Improved code for imdb lookup of directories
#1
i had an annoying problem that the imdb lookup didn't match about half of my movies correctly due to the folders having their original scene names.

this addition to imdb.cpp geturl() corrected this:
Quote: removeallafter(szmovie, " custom ");
removeallafter(szmovie, " dvdscr ");
removeallafter(szmovie, " unrated ");
removeallafter(szmovie, " multisubs ");
removeallafter(szmovie, " ws ");
removeallafter(szmovie, " swedish ");
removeallafter(szmovie, " pal ");
removeallafter(szmovie, " ntsc ");
removeallafter(szmovie, " 2003 ");
removeallafter(szmovie, " 2004 ");
removeallafter(szmovie, " 2005 ");
removeallafter(szmovie, " 2006 ");
and
Quote: removeallafter(szmovie, "+custom+");
removeallafter(szmovie, "+dvdscr+");
removeallafter(szmovie, "+unrated+");
removeallafter(szmovie, "+multisubs+");
removeallafter(szmovie, "+ws+");
removeallafter(szmovie, "+swedish+");
removeallafter(szmovie, "+pal+");
removeallafter(szmovie, "+ntsc+");
removeallafter(szmovie, "+2003+");
removeallafter(szmovie, "+2004+");
removeallafter(szmovie, "+2005+");
removeallafter(szmovie, "+2006+");

as noted in imdb.cpp, the year notations should be done by matching four digits, but this fix improves things greatly until that is done. for me the success rate went up from 52/96 to 96/96,

please commit these changes as i'm sure most people will notice an improvement.
Reply
#2
i'm not 100% sure how this routine runs (need to take the time to have a decent look at it - it's a bit cryptic) but perhaps if you give me an idea of the sort of file names you are dealing with it would be a great help to writing a routine that works better than the one that is already there (and more obvious from a code perspective).

in the mean time, i'll add all except the year groups for now.

cheers,
jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#3
thanks! example name: the.worst.example.movie.ever.1986.internal.complete.ntsc.dvdr-wow
the name always starts with the title. what follows next varies.
here's a logic that works quite well:

Quote:imdb.cpp geturl()

// cleaning up
- if the string is a file name (not folder), remove last '.' and everything after it.
- replace all '.' '-' '_' with ' '
- make string lower case

// remove common unusable stuff that people add themselves
- remove strings included in () or []

// next is not needed if always stacked when doing a lookup.
- remove strings: 'cd1' 'cd2' 'cd3' 'cd4'

// this alone will fix almost half of all now failing searches! (year, if present, always comes right after title)
- remove first four digit string in the name and everything after it.

// if four digit string is not found:
- loop for: (spaces within quotes are intentional) 'dvd' 'stv' 'divx' 'xvid' 'svcd' 'ac3' 'dts' 'internal' ' proper ' 'limited' 'rerip' 'custom' 'unrated' 'multisubs' ' ws ' ' tc ' ' ts ' 'swedish' ' pal ' 'ntsc' 'complete'
- in the loop, look for match in the string and remove it and everything after it.

// multiple spaces get created in the loop since some matches do not including spaces.
// this is preferred since 'dvd' then matches 'dvd' 'dvdr' 'dvdrip' 'dvdscr'... creating a shorter loop.
// some matches need the spaces to prevent removal of parts of the title. ex. 'ws' would match too much'
- replace multiple spaces in the string with single spaces.

- trim traling spaces.

// use results
- replace all spaces with '+' for use in search url.
- set strurl

this works both for release names and common renaming schemes people use.

there's no way of knowing in what order the seeds in the loop appear in the file/directory name.
however, there are a few that can never both appear in the same name.
if needed i can group them and each group could loop until first match is found.

/jonte



Reply
#4
the "full" scan uses code which does this. it shouldnt be difficult to have the "ad-hoc" lookup do it as well.
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
Reply
#5
yep, but there is some stuff that isn't taken care of still.

the only problem i have with the above logic is that we need to be careful about removing strings of 4 digits as years can be a normal part of a movie name (eg the movies 1942 and 2000 : a space odessey).

perhaps there is some other key identifiers (eg year appears after a particular separator, or before a particular identifier group?)

cheers,
jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#6
wouldnt it be easier to just rename your files to the actual movie name?
Reply
#7
(senergy @ feb. 16 2006,04:35 Wrote:wouldnt it be easier to just rename your files to the actual movie name?
i agree :p
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
Reply
#8
is what i do, course mine are on my nas (smb) but...

hellraiser i.nfo
hellraiser ii hellbound.nfo
hellraiser iii hell on earth.nfo
hellraiser iv bloodline.nfo
hellraiser v inferno.nfo
hellraiser vi hellseeker.nfo
hellraiser vii deader.nfo
hide & seek.nfo
hitch.nfo
house of wax.nfo
jay & silent bob strike back.nfo
johnny english.nfo


etc. etc. more detailed infor than that is what i use imdb for Wink

** the above mentioned are for personal backup only. they are in fact owned and originals were purchased.
Image
Reply
#9
as with so many other things, i'd rather do something that takes a little work once, than having to do an easy task over and over until the end of time. Wink

having said that, i think it's much better to remove all 4-digit numbers than none at all. removing them will improve the search for almost all movies at the cost of not correctly recognizing a handfull.

both of jonathan's examples, 1942 and 2000 : a space odessey, would work anyway, since the first word is always the title and never the release year.

any logic attempting to figure out if a year in the middle of name is part of the title or not, would fail more often than a simple rule of removing all years that's not the first 4 chars.

secondly, if a year is found, there's never any need to continue cleaning since the year is always right after the title.

/jonte
Reply

Logout Mark Read Team Forum Stats Members Help
Improved code for imdb lookup of directories0