Login at Kodi Home

dteirney · 2010-10-03, 22:15

tafypz Wrote:Sorry for the radio silence guys, I started a class struture from the vdr plugin. I am painfully slow right now (I just had elbow surgery and therefore I am drugged up and typing with my left hand only). I am on the ball, I will provide a first epg via mythXML. Depending on the performance (and dataset limitations) I might investigate a solution using SQL.

The existing Guide information extracted via the myth:// classes might be re-usable if you go the SQL route. Libcmyth has a bunch of places where SQL is used to get data from the mythconverg database if the Myth Protocol doesn't support it. Have a look at the MythDirectory.cpp class to see what method is being called via DllLibCMyth to get the existing Guide.

PhracturedBlue · 2010-10-04, 00:59

Personally, I'd recommend against trying to directly use the mythtv SQL db. The sql db gets modified more frequently than even the mythprotocol does, and some of the SQL queries needed to obtain info are truly terrifying in complexity. Do as you like, but I think you'll find the long-term maintenance of the SQL queries to be high-effort.

I think I'd try the Python API as both xbmc and Myth offer one with a robust interface. I've only spent a small amount of time with xbmc's and none with Myth, so this might not be as easy as I think, but it is likely the direction I would approach 1st.

**linuxluemmel** · 2010-10-04, 08:32

+ 100 %
The Database approach is IMHO the wrong way. I guess to use the python api would be
more saver.

dteirney · 2010-10-04, 09:48

PhracturedBlue Wrote:The patch in #7197 works very well for me with my SchedulesDirect programs. I added an additional patch to that ticket to do episode-title matching for programs I have that don't have SD info. The title match with TheTVDB works quite well, except that punctuation is sometimes different, and for multi-part episodes, thetvdb often gives the same airdate as well as adding a '(#)' at the end of the title meaning that neither method works properly for these cases.

Does it make sense to further tweak the title matching in the VideoInfoScanner? That fallback mechanism helped to match some of my library since some shows have a subtitle, but nothing else, e.g. no original air date or no season and episode information.

Stripping out punctuation would be relatively easy and dealing with a # on the end should be easy as well (probably also stripped). A CompareFuzzyMatch type method would be handy here Smile

If you're are wondering why your myth:// TV Shows folders had all sorts of weird "x episodes" entries in them after scanning to the library it was due to a bug that was fixed in http://trac.xbmc.org/changeset/34421. That was a weird one to track down.

dteirney · 2010-10-04, 11:57

dteirney Wrote:Stripping out punctuation would be relatively easy and dealing with a # on the end should be easy as well (probably also stripped). A CompareFuzzyMatch type method would be handy here

After a quick search on the interweb I didn't find any easy "Fuzzy Match" library for strings. Found some for words though.

A brutal algorithm that might do OK:
1) remove all punctuation
2) collapse all spaces to a single space and trim front and back
3) remove any trailing 's' characters, prior to a space, to reduce issues with plurals
4) compare ignoring case

http://www.cplusplus.com/reference/clibrary/cctype/

markhoney · 2010-10-04, 13:07

My PHP XMLTV grabber (which used to be at xmltv.co.nz until Sky TV's lawyers sent me a threatening letter!) uses a scoring method to determine whether there's an adequate match for a given movie or TV episode. Would this code be useful? It uses the similar_text function in PHP, and there's details at:

http://php.net/manual/en/function.similar-text.php

It looks like you can grab the C source for this from "ext\standard\string.c" in the PHP source.

I also tried the levenshtein function, which you may well be able to find a library for, but I didn't find this to be as good.

My program basically did TheTVDB and TheMovieDB lookups for programs/movies and checked all the results through similar_text. I created a score for each result (adding the score from the lookup to the score from text matching), and as the TV episode description matches were a lot fuzzier than the movie title matches I did some extra shenanigans to try and the boost the confidence of a good match (removing common words, checking the relative lengths of the strings, shifting "The" from the end to the beginning of a name, etc).

I saved all results to a DB to calibrate where I could draw the line for a positive match, and found with Movies that most scored a 200 match (100 for search score and 100 for similar text) and that anything above 190 was good, and most above 185 were good.

With TV shows, for matching a show I found the same kind of result as for Movies, but for an episode a text match of 50 or so was good enough.

Of course, if you were able to use my XMLTV listings for NZ you'd have all that information coming from MythTV, but unfortunately Sky don't want me running my website!

PhracturedBlue · (This post was last modified: 2010-10-04, 16:10 by PhracturedBlue.)

dteirney Wrote:Does it make sense to further tweak the title matching in the VideoInfoScanner? That fallback mechanism helped to match some of my library since some shows have a subtitle, but nothing else, e.g. no original air date or no season and episode information.

This was my next thought.
I was thinking to do it the same way movies are done, but then realized that was all server-side matching.

My thought was:
strip off punctuation and all leading articles (a, an, the) from both sources, then use something like Contains() to see if the title I have is a substring of the tvdb string. This is a heuristic that would work well for the set of programs I currently have looked at, but I'm not sure it is a great general-purpose answer. It also may not work as well in other languages than english.

But something like 'similar_text' looks quite promising. It looks like it is pretty slow, but I don't imagine we're doing enough compares or long enough ones to make that an issue.

Assuming we can find a string-compare function that works, I'd like to adapt to to the case where multiple episodes contain the same air-date as well.

FYI, I found this, which describes lots of methods:
http://staffwww.dcs.shef.ac.uk/people/S....trics.html
And from there this library:
http://sourceforge.net/projects/simmetrics/

I think I'll put together a test case and run it against my db to see what I can get.

Dteirny, I don't think it makes sense to hold up the existing work before figuring this out. Looks like spiff is ok with what you've got so far, so let's try to get that in, then work on further heuristics to improve things. At least in North-America, the patches as-is have a very high success rate on anything recorded with SD.

Edit:
The other thing to remember, is that when I'm done, all myth stuff should be available in the library regardless of whether there was a DB match, so I am more focused on episodes than tv-shows where myth's info can be almost as good as what you get from thetvdb.

bobtheman · 2010-10-04, 16:57

im surprised the xbmc devs haven't looked into creating our own DVR/PVR solution. This would make more since. Is there ever any voting or highly suggested feature requests that get debated and included into the core software or development cycle?

bobtheman · 2010-10-04, 17:00

opps i retract my prior post, looks like this is a feature in progress..

http://xbmc.org/theuni/2010/02/06/coming-soon/

http://wiki.xbmc.org/?title=GSoC_-_Unified_PVR_Frontend

http://bloggingabout.com/xbmc-addons-pvr-frontend.html

dteirney · 2010-10-04, 22:12

PhracturedBlue Wrote:Dteirny, I don't think it makes sense to hold up the existing work before figuring this out. Looks like spiff is ok with what you've got so far, so let's try to get that in, then work on further heuristics to improve things. At least in North-America, the patches as-is have a very high success rate on anything recorded with SD.

Agreed, I'll be putting in the patches for the myth:// related areas this week and creating the separate patches for the VideoInfoScanner area so spiff can review and then hopefully give the thumbs up.

Thanks in advance for any further work you do to improve the text matching on the episode title.

PhracturedBlue · (This post was last modified: 2010-10-05, 04:09 by PhracturedBlue.)

Ok, I hacked up a quick test to run the similar-text algorithm against my subtitle-list and thetvdb episode-list.

All I did was:
1) lowercase both strings
2) take the best score from running the similar_text function

This ignored any airtime filtering we could have done.

I got 2 incorrect matches (thetvdb has a significantly different episode name that mythtv). Score was 55% match (against the wrong title)

I got 7 bogus matches (these episodes don't exist in thetvdb, so there was no chance of a successful match). score was 73% match(on the wrong title)

I had one recording which was actually a 2-parter in a single recording. The match code chose one of the 2 parts (the 2nd) but due to the significantly different length, match score was 66%

The rest matched correctly. Almost all of these get a score of 100%, but the worst case was 82%. For my recordings, these would have gone to 100% had I removed punctuation and leading articles.

It took 124 secs to compare 130 recorded episodes against 713 episodes from thetvdb, but I was executing a system call for each compare, so it wasn't very efficient. This was using the raw C code implementation though, so there wasn't any interpreter involved.

I will next actually implement this in xbmc. Merged with the airtime check, it should be very accurate for my data-sets.

Notes:
* We should probably try an exact compare before trying a fuzzy compare for performance
* The 'score' I computed above is the # of matching chars / # of chars in the subtitle (from myth). Defining a proper scoring system will be important if you want a cutoff. If we always take the best-match, this is irrelevant.

Edit:
I reimplemented the code entirely into a single 'c' program, and I was able to execute the entire query in less than 1 second. That is ~100,000 comparisons per second (on my hardware). I think this is pretty reasonable, so I won't worry much about performance any more.

tafypz · 2010-10-06, 04:45

I added ticket 10445, which contains the basic pvr addon classes.
this does compile but no methods are implemented, this is based off the vdr plugin, so I may have not removed all vdr specific members. I am taking care of epg.
I don't know if creating a ticket was the appropriate way to share this, I may get yelled at Undecided

PhracturedBlue · 2010-10-06, 06:33

I've updated ticket #7197 with my fuzzy match heuristic. i documented its behavior there, but here it is again:

1) check for an episdoe id match
2) check if there is exactly 1 date that matches
3) check for a case-insensitive exact string match
4) if multiple dates matched, use case-insensitive similar_text() to find best one of the candidates
5) if no dates matched, use a case-insensitive similar_text() to find best match in entire episode list. Result must match at least 80% of the characters in myth's subtitle string

This algorithm gets everything right on my data set, and gets nothing wrong, so I consider it a success. The 80% was designed to be conservative. I'd rather not match something I should than do an incorrect match.

Once I have the code in place to read in all un-scanned myth data, that should fill the rest of the gaps, but that will be a separate patch.

Note that I chose not to strip out leading articles or punctuation. The 80% rule seems to do a good enough job not to need further heuristics.

dteirney · 2010-10-06, 11:40

tafypz Wrote:I added ticket 10445, which contains the basic pvr addon classes.
this does compile but no methods are implemented, this is based off the vdr plugin, so I may have not removed all vdr specific members. I am taking care of epg.
I don't know if creating a ticket was the appropriate way to share this, I may get yelled at

Trac ticket was fine. Please submit an svn diff next time though. I've ripped out even more stuff than you had and wired up the build artifacts through to the top level ./configure and make.

There is a stripped patch that compiles so suggest someone tries to test that and make sure XBMC doesn't barf at it before I commit. No more time today to test...

dteirney · 2010-10-06, 11:41

PhracturedBlue Wrote:I've updated ticket #7197 with my fuzzy match heuristic. i documented its behavior there, but here it is again:

1) check for an episdoe id match
2) check if there is exactly 1 date that matches
3) check for a case-insensitive exact string match
4) if multiple dates matched, use case-insensitive similar_text() to find best one of the candidates
5) if no dates matched, use a case-insensitive similar_text() to find best match in entire episode list. Result must match at least 80% of the characters in myth's subtitle string

This algorithm gets everything right on my data set, and gets nothing wrong, so I consider it a success. The 80% was designed to be conservative. I'd rather not match something I should than do an incorrect match.

Once I have the code in place to read in all un-scanned myth data, that should fill the rest of the gaps, but that will be a separate patch.

Note that I chose not to strip out leading articles or punctuation. The 80% rule seems to do a good enough job not to need further heuristics.

Fantastic work! Thanks. I'll have to make a number of commits and patches for external review to get all this work in. Might take me a wee while. Will be worth it though. Thanks again.