Clean scraping API
#61
Just to reiterate, this is one of the most exciting projects for XBMC that I somehow didn't understand or know about for virtually the entire length of its development. Topfs2 and garbear, at least one beer or reasonably priced alcoholic drink of your choice is on me at next devcon. Smile
Reply
#62
Nice calculations Smile And you bring up a good discussion about "randomly" selecting movies/episodes. Neither service has an API for random items, so that could be difficult. Besides, is P(time|movie) not biased to begin with, given that a user is more likely to download a popular movie than an unpopular movie? Go to a torrent tracker, and you'll clearly see that popular movies are downloaded more, and therefore contribute heavier weights to the time histogram.

If we're calculating P(movie|time) and P(time|movie), I would argue in favor of using the GSOC statistics over random movie samples, precisely because their bias is directly in the domain of the evidence we're considering, and it only helps push the results closer to the truth.

Anyways, topfs2, I'd like to know, what's the next step for the little-scraper-that-could?
Reply
#63
(2013-04-10, 09:28)natethomas Wrote: Just to reiterate, this is one of the most exciting projects for XBMC that I somehow didn't understand or know about for virtually the entire length of its development. Topfs2 and garbear, at least one beer or reasonably priced alcoholic drink of your choice is on me at next devcon. Smile

reasonably priced? don't be a cheapskate nate. I take my drinks shaken, not stirred Wink

and don't get too excited, but PR:16 is able to scrape metadata from roms by directly parsing headers from their binary data. guess whose branch that one's for?
Reply
#64
My guess is something like a recursive partition tree would work well for these data (rpart in R), though even something like naive Bayes would probably work out OK. You have some very good discriminators for episodes vs other (based on regexp matches) and discriminating musicvideos vs movies is pretty clear cut based on time I should think. Long-format music videos (concert footage etc.) will be trickier - things like audio track format might be useful here (higher quality track more indicative of a concert?) Further, things like foldernames can be grepped for typical things like "Movie", "Music", "Concert", "TV", "Show" (perhaps with other common languages included) which would help. You'd maybe use something like fstrcmp() and take the maximum value from each portion of the path?

topfs2: perhaps you could summarise what information (predictors) you have for each item that might be useful? Is it essentially all the DB variables? What else on top of the DB stuff?

Cheers,
Jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#65
(2013-04-10, 09:42)garbear Wrote: and don't get too excited, but PR:16 is able to scrape metadata from roms by directly parsing headers from their binary data. guess whose branch that one's for?

Nice. This could solve the bin/cue platform identification problem. A quick look with a hex editor shows that psx and segacd have platform info stored in the disc images.
Image
Reply
#66
(2013-04-10, 10:05)jmarshall Wrote: My guess is something like a recursive partition tree would work well for these data (rpart in R), though even something like naive Bayes would probably work out OK. You have some very good discriminators for episodes vs other (based on regexp matches) and discriminating musicvideos vs movies is pretty clear cut based on time I should think. Long-format music videos (concert footage etc.) will be trickier - things like audio track format might be useful here (higher quality track more indicative of a concert?) Further, things like foldernames can be grepped for typical things like "Movie", "Music", "Concert", "TV", "Show" (perhaps with other common languages included) which would help. You'd maybe use something like fstrcmp() and take the maximum value from each portion of the path?

topfs2: perhaps you could summarise what information (predictors) you have for each item that might be useful? Is it essentially all the DB variables? What else on top of the DB stuff?

Some predictors I can think of:

+ File runtime
+ Container and track formats
+-- Resolution
+ File name
+-- Presence of year in filename could indicate movie
+-- Split file (.cd1/2)
+-- Episode-based regexes, adjacent files with similar file name structure
+-- Presence of music video artist in music database
+ Tokenized path as a feature vector
+ Scraper results - catch-22 dilemma, but scraper results do verify our decision

For folder string matching, I think rooting ourselves to a particular lexicon will cause too many headaches. Alternatively, we could let a text classifier find a language-neutral pattern kinda like a modern day spam filter. I'm unfamiliar with decision trees (and always looking to learn), so I don't know if they would perform better than naive bayes classifiers here, but I imagine it's the same idea. For instance, I almost added "organizing movies by resolution" as a predictor (I used to do this). But if we do a text classifier, in addition to "Movies" and "TV Shows" in folder names, it'll automatically pick up on this as part of its normal operation.

A hypothetical question: to broaden our information base, what about proactively running 3 queries, each assuming a movie, a music video and a tv show? An empty/undesirable response would indicate a negative/poor test result. ofc this is hypothetical, a better idea could possibly be to submit a query to a different type of scraper retroactively if one type returns poorly.
Reply
#67
(2013-04-10, 09:34)garbear Wrote: Nice calculations Smile And you bring up a good discussion about "randomly" selecting movies/episodes. Neither service has an API for random items, so that could be difficult. Besides, is P(time|movie) not biased to begin with, given that a user is more likely to download a popular movie than an unpopular movie? Go to a torrent tracker, and you'll clearly see that popular movies are downloaded more, and therefore contribute heavier weights to the time histogram.

If we're calculating P(movie|time) and P(time|movie), I would argue in favor of using the GSOC statistics over random movie samples, precisely because their bias is directly in the domain of the evidence we're considering, and it only helps push the results closer to the truth.

That is a very valid point. And with that stated it might actually be better to use the stats then. Sadly I have none on episodes though (limitation of json rpc at the time)

(2013-04-10, 09:34)garbear Wrote: Anyways, topfs2, I'd like to know, what's the next step for the little-scraper-that-could?

I want to do two things. A) solve the complex object problem https://github.com/topfs2/heimdall/issues/7 as that would solve so many problems and then B) get it in xbmc repo as a library.

I think A and B could be done parrallell so nothing really stopping any of them?

(2013-04-10, 10:05)jmarshall Wrote: My guess is something like a recursive partition tree would work well for these data (rpart in R), though even something like naive Bayes would probably work out OK. You have some very good discriminators for episodes vs other (based on regexp matches) and discriminating musicvideos vs movies is pretty clear cut based on time I should think. Long-format music videos (concert footage etc.) will be trickier - things like audio track format might be useful here (higher quality track more indicative of a concert?) Further, things like foldernames can be grepped for typical things like "Movie", "Music", "Concert", "TV", "Show" (perhaps with other common languages included) which would help. You'd maybe use something like fstrcmp() and take the maximum value from each portion of the path?

Yeah getting the extra features out of the path was the main goal of the stats gathering really. Right now we have the paths and know what title and type it is. Then we can extract a bunch of features regarding that probably. P(Movie|MovieInFilePath) for example would be very doable to gather

(2013-04-10, 10:05)jmarshall Wrote: topfs2: perhaps you could summarise what information (predictors) you have for each item that might be useful? Is it essentially all the DB variables? What else on top of the DB stuff?

https://github.com/topfs2/script.statist...tion.py#L6

Thats the ones available. These stats was mostly geared towards getting enough information to test/train filepath extraction really.
But now when jsonrpc is even more powerful we could perhaps do a second round?, after we have analyzed this completely
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#68
(2013-04-10, 13:29)topfs2 Wrote: That is a very valid point. And with that stated it might actually be better to use the stats then. Sadly I have none on episodes though (limitation of json rpc at the time)
mwahaha oh but you do. you collected tvshowid, season and episode, which can be used to resolve, not file runtime per se, but runtime nonetheless Smile
Reply
#69
(2013-04-10, 13:43)garbear Wrote:
(2013-04-10, 13:29)topfs2 Wrote: That is a very valid point. And with that stated it might actually be better to use the stats then. Sadly I have none on episodes though (limitation of json rpc at the time)
mwahaha oh but you do. you collected tvshowid, season and episode, which can be used to resolve, not file runtime per se, but runtime nonetheless Smile

Oh that is true, didn't even think of that!

EDIT: OOOOPS, emberresing moment Smile seems like it didn't store it Smile https://github.com/topfs2/script.statist...ion.py#L71 Not sure if there was a reason behind that or just a miss
EDIT2: Just remembered, the tvshowid is local id for the database, thats why it wasn't pushed
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#70
A naive Bayes classifier typically makes the assumption that your predictors are independent. This is a huge assumption, and greatly simplifies finding the probability distribution you're interested in for Bayes Thm, namely P(predictors|class). However, they can perform incredibly badly if the assumption is wrong, but only if the dependent predictors end up telling you conflicting things IIRC.

Classification trees basically allow a hierarchical structure to the prediction - i.e. they find the predictor that tells you the most about the likely class first, and split the data based on that. Then in each branch it repeats. This allows each branch to have different predictors (interactions between predictors) and they also deal fine with factors vs numerical variables.

I suspect both will give a fairly solid classification rate - there's so much data here that it'll be hard to screw it up too badly I would think Smile

Cheers,
Jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#71
Yeah we could easily use a decision tree instead. When we have enough tests we could run it through a trainer, ID3 is easy and exist in numpy IIRC.

And decision trees are usually insanely quick to classify through, but that is true for most classifiers Smile

Considering the sheer amount of datapoints we have (1 million!) I think we should be able to try out both and see which behaves best. With the exception of music videos as we have so little datapoints for that.
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#72
(2013-04-10, 12:08)garbear Wrote: + File runtime
+ Container and track formats
+-- Resolution
+ File name
+-- Presence of year in filename could indicate movie
+-- Split file (.cd1/2)
+-- Episode-based regexes, adjacent files with similar file name structure
+-- Presence of music video artist in music database
+ Tokenized path as a feature vector
+ Scraper results - catch-22 dilemma, but scraper results do verify our decision

These are great!

I'm semi unsure on what you mean by "Presence of music video artist in music database" and "tokenized path as feature vector" though.

The rest I think most are possible, atleast the file name ones, to get from the current data.
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#73
(2013-04-10, 12:08)garbear Wrote: +-- Presence of year in filename could indicate movie

Could being keyword here. I would not rely on it tho. Many TV shows these days have year as a part of their title. Doctor Who (2005), Louie (2010), The Newsroom (2012) and Merlin (2008) just to name a few.
Image
Reply
#74
(2013-04-11, 11:32)N3MIS15 Wrote:
(2013-04-10, 12:08)garbear Wrote: +-- Presence of year in filename could indicate movie

Could being keyword here. I would not rely on it tho. Many TV shows these days have year as a part of their title. Doctor Who (2005), Louie (2010), The Newsroom (2012) and Merlin (2008) just to name a few.

This is why we will run it through an AI training algorithm. They usually see connections we couldn't even fathom.
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#75
I'm assuming that we're using a probabilistic model to pick our parameters like runtime distributions, etc, instead of picking them by hand - I would argue for doing the same with file paths. That is, instead of picking a lexicon by hand, and trying to guess which words to include across xbmc's 50 languages, and which fuzzy-string-matching parameters/algorithms/heuristics, we treat path tokens as features, so /home/movies/avatar becomes ["home", "movies", "avatar"] and /home/tv shows/adventure time becomes ["home", "tv", "shows", "adventure", "time"], and as a result "home", "avatar" "adventure" and "time" would have little effect on our class variables, whereas "movies", "tv" and "show" would carry a heavier weight (à la some spam filters).

My other idea was to assume a higher probability of it being a music video if filename_is(Deadmau5 - Professional Griefers) and artist_in_database(Deadmau5).
Reply



Logout Mark Read Team Forum Stats Members Help
Clean scraping API3
This forum uses Lukasz Tkacz MyBB addons.