2013-04-10, 08:38
I've done an initial analysis of the data gathered during GSoC and I just wanted to add some calculated values to the example above.
These numbers I got simply from looking at the posted length of the data, e.g
Where len(Video) are videos not in the database (unscraped for a reason or simply missed)
I skipped len(Video) since I want the probability to add up to 1.
I also did a check at the runtimes of movies (with 10k movies), where I think it could be valid to assume normal distribution with µ=105.624426079 σ=22.9860217337
I have noticed that the movie database seems to contain some tv shows aswell, which might be what is causing it to shift slightly to the left.
I want to add a disclaimer on this since its just a very quick, initial analysis. And it was a long time since I did statistics math
And that randomly selected movies/episodes from tmdb and tvdb might be a better indicator on the runtime distribution.
Cheers,
Tobias
Code:
P(Episode | Video) = 0.843797662418
P(Movie | Video) = 0.154526991252
P(MusicVideo | Video) = 0.00167534633044
These numbers I got simply from looking at the posted length of the data, e.g
Code:
len(Episode) = 640650
len(Movie) = 117324
len(MusicVideo) = 1272
len(Video) = 233118
len(TotalVideos) = len(Episode) + len(Movie) + len(MusicVideo)
Where len(Video) are videos not in the database (unscraped for a reason or simply missed)
I skipped len(Video) since I want the probability to add up to 1.
Code:
P(Episode|Video) = len(Episode) / len(TotalVideos)
I also did a check at the runtimes of movies (with 10k movies), where I think it could be valid to assume normal distribution with µ=105.624426079 σ=22.9860217337
I have noticed that the movie database seems to contain some tv shows aswell, which might be what is causing it to shift slightly to the left.
I want to add a disclaimer on this since its just a very quick, initial analysis. And it was a long time since I did statistics math
And that randomly selected movies/episodes from tmdb and tvdb might be a better indicator on the runtime distribution.
Cheers,
Tobias