Hi
I was going through some papers I could find related to this topic.
One of the first paper that I came across was this
https://dl.acm.org/citation.cfm?id=2661714.2661729.
It tries to detect intros/outros using two types of input available to us, video frames and audio signals.
- Video Frames: Usually there is always a black screen just after the intro ends, so it averages over the greyscale values for all video frames upto a few minutes and then takes the minimum of these values. Since the black screen will have the lowest intensity value, we get the time when our intro ends.
- Audio Signals: So the above black screen is accompanied with a 0.5-1 second of silence gap. They take the root mean square of sound energy and zero crossing rate upto a few minutes and then compare their values and if both of them are below a 'certain' threshold for 0.5-1 second, then that gap is termed as the silent gap.
Above is for a single episode, but can be averaged out for the whole series if the intro/outro for that series are around the same timings.
I tried to implement this in a very primitive manner. The rough script I wrote can be seen here
https://gist.github.com/mohit-0212/31ffa...9f845b51a3. This involves only the use of video frames till now, as I could not figure out the certain value for that threshold for audio signals. I also wrote a separate script for audio signals (
https://gist.github.com/mohit-0212/540bf...b88e68ff4b) which just spits out the graph for sound energy rms and zcr for now, but can be made useful if the threshold pattern is estimated. I have written it using the python libraries available.
So I tested it (using only video frames) on a few TV show videos I had and it worked roughly fine for most of them. I have taken the initial 5 minutes of each TV show to account for intro sequence.
One of the exceptions was Breaking Bad intro sequence which is very short and most importantly begins and ends with a black screen, so the time of intro end for this video was detected at what was actually intro's start time.
Another problem I had was that some of the episodes used to 'start' with a black screen, so the time it gave out was ~0.x sec which is not correct, so I filtered that by ignoring the first 15 seconds of the video file.
Also, looking at the video frames takes some time and it increases with the video resolution a bit, so we need to figure out how would we like this service to function for a Kodi user. (i.e how to efficiently carry out the processing with the minimum(or no) delay for the user).
Overall it was a decent method for the initial attempt and can be refined by further discussions. Or we can come up with a totally different approach, whatever suits our requirements for the platform.
Any input/feedback from your side on this or anything you want me to do/look at would be great.