Help Needed: Media Site Scraper Plugin
#1
Hi everybody,
I was thinking about developing a universal media content scraper plugin. This means a general framework using regular expressions to extract categories and media links from multimedia websites. XOT-Uzg brought me to this idea. The plugin should have an architecture similar to the imdb & tv.com scrapers for movie information. An XML file could carry all the regular expressions to extract media links from the html. The program would list all installed XML information files in the base directory, so if we had a file TVLinks.xml and Joox.xml installed, it would list TVLinks and Joox. If the user browses that directory, the plugin opens the websites and executes the regexp searches in the XML files and displays the resulting media/directory links. What do you all think of this idea? Is this XML regexp format really such a good idea or would you prefer to have a framework with python plugins like in XOT-Uzg? Would XML be too slow for a plugin? I would like to see this thread as a collective brainstorming of ideas for the framework. Any suggestions?
The plugin should:
  • Extract media URLs (video and audio)
  • Extract browsable categories
  • Extract image information for the items
  • Give developers the chance to develop media scrapers very fast, without all the python headaches. Basic regexp skills should be enough.
  • Have a repositories of scraper XMLs, so people can install them via script automatically.
  • Make user interaction possible (Virtual keyboard to enter a search term for youtube for example)
  • Have an icon embedded as base64 in the XMLs for each scraper
  • Set the user agent for urllib2 (some sites like joox.net block the urllib user agent)
  • Offer the possibility for very complicated scrapers to utilize python scripts. This seems to be necessary for TVLinks, because they use a very bad kind of URL hiding (see the code of TVLinks plugin). Or is there a way to express these tasks without python with an XML rule?
Reply
#2
It could work well.

I personally like using sgmlib for handling html, but I suck at regex. That's why AMT's showtimes scrapers and XBMC Lyrics scrapers are separate modules.

I like having the plugins separate. As a plugin you can set any sub category as a source in files view. So having one plugin to do multiple sites would work. But in Library view it would not. Not really a big deal though.

Quote:Have an icon embedded as base64 in the XMLs for each scraper
I think it's better to have a .tbn file same as xml, so TVLinks.xml, TVLinks.tbn, like how XBMC scraper system is.

Yes you would need to have a player module, with some special routines for certain sites.

So maybe a structure like this:
Code:
Modules-
  TVLinks-
    scraper.py
    scraper.tbn
    player.py [optional]

then the main.py file would then load the folders under modules as categories.

I like copying XBMC scraper format, so maybe a combination, instead of scraper.py, have scraper.xml.

I think i just said the same thing you did Smile
For python coding questions first see http://mirrors.xbmc.org/docs/python-docs/
Reply

Logout Mark Read Team Forum Stats Members Help
Help Needed: Media Site Scraper Plugin0