2010-09-04, 12:10
I mentioned a little while ago about wanting to be able to write Python scrapers. Toward that end, I've been doing some work on cleaning up the general CScraper interface so it's less dependent on having XML scrapers. With my changes, CScraper::Run is replaced by a set of functions that return higher-level objects. (Trac ticket #10078.)
For example: OLD: clients that wanted to match an album call scraper.Run("CreateAlbumSearchUrl") with the album/artist name, parse the returned URL, pass it to scraper.Run("GetAlbumSearchResults"), which would return XML with titles and URLs that the caller parses into CMusicAlbumInfo objects and then presents to the user for selection, and then scraper.Run("GetAlbumDetails") with the user's selection, and the returned XML is parsed into a CAlbum object.
NEW: clients call scraper.FindMovie(title), which returns back a vector of CMusicAlbumInfo objects, which the user selects from. The CScraperUrl of the selected item is passed to scraper.GetMovieDetails and a CAlbum is returned.
The new implementation of CScraper::FindMovie calls both the CreateAlbumSearchUrl and GetAlbumSearchResults scraper functions, but it wouldn't have to for, say, a Python scraper, which could do its own fetching via callbacks. Also, a Python scraper would more naturally return a list of objects (or dicts) rather than XML. The next step to accomplish this is to either make the CScraper media functions pure virtual and create various overrides, or use a bridge pattern. I’d also be interested in working with jfcarroll’s multi-language addon code if possible, so that other languages can be used.
One question that I ran into was testing. At present there don't seem to be any automated scraper tests. I would like to build a (Python) framework for such testing (and perhaps integrate it into 'make test' later), but then I have two problems: (1) it's hard to run scrapers if XBMC isn't running, and (2) insufficient test data.
For the first problem, I would like to move the scraper code (everything in the modified CScraper "and below") into a library (probably dynamic, but static would be fine too). This will allow a test harness (and other clients) to load it and pass it data. There appears to be a lot of dynamic library code in the source tree - presumably I'd want to create a wrapper class inheriting from DllDynamic and load it with one of DllLoader/SectionLoader/DllContainerLoader at an appropriate point.
For the second, I hope I can ask for and get test data from people, when I'm ready for it - that would mean directory listings (no actual media files) and matching music/video database dumps - and pull together the results of successful scrapes into test data, which I can run against the current XBMC, XBMC with my modifications, and any Python conversions of existing XML scrapers. I would also fetch and save a copy of the web pages used for testing as part of the test data since (1) web sites change and (2) it's not good citizenship to hit a search site too many times just to run tests (of course, some optional live tests would be a good idea).
Something else to throw out: when Python scrapers are possible, and presuming I successfully convert the current set of XML scrapers to equally-efficient Python with the same results, is there a good reason to keep the XML scraper system around? One might be "It's easier to teach people XML than Python", but I don't think that's terribly valid as anyone that can learn the necessary details about XML, functions, variables, regexes, chaining, etc. to write an XML scraper should be able to adapt the necessary regexes to a Python template - only a subset of real programming would be required for most.
For example: OLD: clients that wanted to match an album call scraper.Run("CreateAlbumSearchUrl") with the album/artist name, parse the returned URL, pass it to scraper.Run("GetAlbumSearchResults"), which would return XML with titles and URLs that the caller parses into CMusicAlbumInfo objects and then presents to the user for selection, and then scraper.Run("GetAlbumDetails") with the user's selection, and the returned XML is parsed into a CAlbum object.
NEW: clients call scraper.FindMovie(title), which returns back a vector of CMusicAlbumInfo objects, which the user selects from. The CScraperUrl of the selected item is passed to scraper.GetMovieDetails and a CAlbum is returned.
The new implementation of CScraper::FindMovie calls both the CreateAlbumSearchUrl and GetAlbumSearchResults scraper functions, but it wouldn't have to for, say, a Python scraper, which could do its own fetching via callbacks. Also, a Python scraper would more naturally return a list of objects (or dicts) rather than XML. The next step to accomplish this is to either make the CScraper media functions pure virtual and create various overrides, or use a bridge pattern. I’d also be interested in working with jfcarroll’s multi-language addon code if possible, so that other languages can be used.
One question that I ran into was testing. At present there don't seem to be any automated scraper tests. I would like to build a (Python) framework for such testing (and perhaps integrate it into 'make test' later), but then I have two problems: (1) it's hard to run scrapers if XBMC isn't running, and (2) insufficient test data.
For the first problem, I would like to move the scraper code (everything in the modified CScraper "and below") into a library (probably dynamic, but static would be fine too). This will allow a test harness (and other clients) to load it and pass it data. There appears to be a lot of dynamic library code in the source tree - presumably I'd want to create a wrapper class inheriting from DllDynamic and load it with one of DllLoader/SectionLoader/DllContainerLoader at an appropriate point.
For the second, I hope I can ask for and get test data from people, when I'm ready for it - that would mean directory listings (no actual media files) and matching music/video database dumps - and pull together the results of successful scrapes into test data, which I can run against the current XBMC, XBMC with my modifications, and any Python conversions of existing XML scrapers. I would also fetch and save a copy of the web pages used for testing as part of the test data since (1) web sites change and (2) it's not good citizenship to hit a search site too many times just to run tests (of course, some optional live tests would be a good idea).
Something else to throw out: when Python scrapers are possible, and presuming I successfully convert the current set of XML scrapers to equally-efficient Python with the same results, is there a good reason to keep the XML scraper system around? One might be "It's easier to teach people XML than Python", but I don't think that's terribly valid as anyone that can learn the necessary details about XML, functions, variables, regexes, chaining, etc. to write an XML scraper should be able to adapt the necessary regexes to a Python template - only a subset of real programming would be required for most.