2013-04-11, 12:05
I'm assuming that we're using a probabilistic model to pick our parameters like runtime distributions, etc, instead of picking them by hand - I would argue for doing the same with file paths. That is, instead of picking a lexicon by hand, and trying to guess which words to include across xbmc's 50 languages, and which fuzzy-string-matching parameters/algorithms/heuristics, we treat path tokens as features, so /home/movies/avatar becomes ["home", "movies", "avatar"] and /home/tv shows/adventure time becomes ["home", "tv", "shows", "adventure", "time"], and as a result "home", "avatar" "adventure" and "time" would have little effect on our class variables, whereas "movies", "tv" and "show" would carry a heavier weight (à la some spam filters).
My other idea was to assume a higher probability of it being a music video if filename_is(Deadmau5 - Professional Griefers) and artist_in_database(Deadmau5).
My other idea was to assume a higher probability of it being a music video if filename_is(Deadmau5 - Professional Griefers) and artist_in_database(Deadmau5).