Streamlined Scraper Over-ride
#1
A feature that I would welcome and would probably be welcome in general would be to manually override, in app, where a specific file is being scraped from. Namely a text editable field in "Movie Information" where a URL can be pasted and the system would simply scrape from that page without attempting to grok the film itself.

From personal experience, and reviewing a variety of threads, .nfo files with URLs are inconsistent at worst and inelegant at best.

On many an occasion, right now included, I've sat pulling my hair out because the system will not scrape a page no matter how many ways I've tried to get it to. The user creation of .nfo files from scratch, seems like a manual solution to an over-automated system failing. This would be a half-step where the onus is on the user to locate a suitable page to scrape from a supported source.

Support can be limited to the minimal (IMDB, TheMDB, TVDB) and content type can be determined with a drop down or radio buttons in situ. Again, by design, the system does not need to interpret or intuit 'is this the right choice?'.

A genuine over-ride, 'take my word for it, this is the movie in question, now please populate the database'.
Reply
#2
What people are supposed to do, in the event that a scanned movie gets ID'ed wrong from the file name, is just bring up the movie info screen and select "refresh", and then you will be presented with a list of all matches. You can even do additional manual searches for the title, and then select the exact entry that goes with that video file.

The point of using NFOs with URLs is when you don't want to (or are unable to, for some reason) correct the actual file name itself, and want a way to automatically correct the info as it gets scanned. Normally you just use the correct file name and year and you never have to touch NFO files, at all.

If an .NFO file with URL doesn't work then that means that the URL is pointing to a page that can't be used for the scraper. The scraper isn't reading the actual HTML on the page. It's just using the URL to find the exact ID for the movie. I don't know why you think they're "inconsistent". They either work or they don't work.

There is very little that is being "interpreted" or being "automated" in the entire scanning process.
Reply
#3
uhm, of course the scraper is reading the html pages. where do you think the information comes from? (and where do you think the term scraping originates)? it doesn't *parse* the html files however.
Reply
#4
Correct me if I am wrong, but I believe our default scrapers are using an API for most of the work in talking to the scraper site. I am well aware that some scrapers are able to grab the actual raw HTML page that a person would see, and extract the information from that.
Reply
#5
(2014-12-04, 11:10)ironic_monkey Wrote: uhm, of course the scraper is reading the html pages. where do you think the information comes from? (and where do you think the term scraping originates)? it doesn't *parse* the html files however.

That's how it was until a few years ago. These days every metadata site (IMDb, TMDb, TVDB, ...) provide an API which works with XML or JSON or whatever to provide the details in a structured manner. That is much easier to handle than scraping a HTML site which changes every now and then.

So what you are suggesting would probably not work because as a user you'd have to figure out the exact URL of the API for a movie (or episode) that would need to be scraped. Just going to IMDb, looking up your movie and entering that URL wouldn't work. Furthermore Kodi is made with a 10f interface so most people won't even have a keyboard attached that would be necessary for copy & pasting (I'm assuming you don't want to hand-type a complex URL).
Always read the online manual (wiki), FAQ (wiki) and search the forum before posting.
Do not e-mail Team Kodi members directly asking for support. Read/follow the forum rules (wiki).
Please read the pages on troubleshooting (wiki) and bug reporting (wiki) before reporting issues.
Reply
#6
Oh really I must have forgotten how to read the scrapers i invented.

Whether it is xml or html is rather irrelevant as long as it ain't parsed. IMDB API costs money and requires you to hide your source.

As far as API usage goes, nothing has changed since i initially created this back in 2007. TVDB used the api from the first moment, IMDB scrapes html. for this reason the scraper system was designed as is - not based on parsing API outputs - but a regex system to handle both API driven and non-API driven backends. some stuff has been added afterwards (xslt and json) but those are not used in your examples.
Reply
#7
Thank you for the information.

EDIT: redacted previous statement. My bad.
Reply
#8
i never meant to bitch. sorry if i came off that way. i just jumped on wrong information, that is all. pointing out an error != bitching from my pov.
Reply
#9
My examples were probably badly chosen but my point stays the same:
Providing the possibility to enter a URL where the scraper should get the information from is IMO a bad idea because the user has to know how/where the scraper gets the data from. For IMDb it scrapes the HTML page, for TVDB it uses some API, for TADB it uses a JSON API, for MusicBrainz it uses an XML API and the specific URLs used by the scraper are usually not meant to be found/handled by users manually.
The only thing that might make sense is to allow to specify a unique identifier for the scraper already defined in Kodi for an item. But isn't this was we already support through NFOs? (I've never used this myself so I don't really know).
Always read the online manual (wiki), FAQ (wiki) and search the forum before posting.
Do not e-mail Team Kodi members directly asking for support. Read/follow the forum rules (wiki).
Please read the pages on troubleshooting (wiki) and bug reporting (wiki) before reporting issues.
Reply
#10
sure thing, never meant to contend that point. and yeah, that's what url nfos are. the scrapers have a dedicated function to translate such urls into something they understand. first taker is used, with some prioritise stuff around it (default scraper, scraper for source and so on).
Reply
#11
Sorry for misreading your comment, ironic_monkey.
Reply

Logout Mark Read Team Forum Stats Members Help
Streamlined Scraper Over-ride0