Better default stacking algorithim for XBMC files-mode view
#1
Lightbulb 
I think there should be a better stacking algorithm in XBMC when stacking files and looking for sequences of patterns in filenames. I came to this opinion after the current algorithms not being able to stack some of my filenames, and the problems arising from it. I was also inspired and challenged by the current needs/examples given in a thread here. Fortunately for you I have done most of all the work in creating the algorithm, I can share it with you if you want to implement it into xbmc. I would do it myself but it would require extensive amount of time in order for me to become familiar and learn the c++ language and other xbox/xbmc specifics. My algorithm can be used in PCRE or Python regex library, and then simple language constructs and decisions structures.

I encourage you to test it thoroughly and place your comments here in this thread, I have created a demonstration of the algorithim via PHP here:
http://tinyurl.com/ypgmt2

The idea is simple, it is all based on numbers if there exists two strings with the same content in them except for a certain number (can be in any position), then it will consider this a sequence.
movie 1
Movie 2 //no

Movie 1
movie 2 //yes

On strict mode, these examples won't match, however the concept is still the same...

Here's some info I typed up quickly, but you really can see what it can do by testing it yourself.
Quote:Rules:
Not Strict (unchecked):
Will parse all strings for patterns of series of numbers, any sequence is counted it does not matter in what part of the string the sequence comes in.
ie of some example sequences:
ocean12
ocean13

o324234sdf
o234890790834sdf


Strict:
Strict will still look for patterns of sequences in any part of the string, but will be more strict on what it looks for.
All numbers that follow characters a-z or a-z\s* will have to have (part|pt|dvd|cd|title|file|disc) before it.
Unless this number is part of a 1-1 sequence or 1/1 sequence. \s*\d+/\d+ \s*\d+-\d+
Also, this will not count:
sdfdfdsfpart1 or asdfdsfpart 1
but this will:
sdfdf part1
and so will this:
sdfdf_part_1

The main advantage of strict mode is that it will not think that transformers01 or blade 1, blade 2 is a sequence, but still have all the powerful sequence finding capability of unstrict.


In both strict modes and unstrict modes, this tv show episode identifier will never count as a part of a sequence:
\d+\s*x\s*\d+
like:
01 X 08


My stacking algorithim can encompass pretty much all realistic patterns of sequences within strings, giving way more possiblities that what XBMC's current stacking algorithims offer. If you have comments/questions/suggestions post in xbmc thread.

-Enjoy! Smile -plex



Right now it only deals with numbers, but I'm thinking about implementing other types like 1a, 1b, or roman numerals..
Spread the knowledge, nothing else.Image
Reply
#2
For reference => http://forum.xbmc.org/showthread.php?tid=31494

Discussion about the default stacking expressions for videos in XBMC
Reply
#3
Glad to see someone taking the time to improve things Smile

The key with any system is that it needs to be flexible to allow advanced users to add their own patterns that don't necessarily match what you may be expecting.

Some of our default regexps catch too much (thus the other thread) and there's a couple of them that will be removed probably today.

What does this offer over our current one, which is essentially an arbitrary list of regexps that are checked for matching?

Cheers
Jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#4
Quote:What does this offer over our current one, which is essentially an arbitrary list of regexps that are checked for matching?

Didn't I just say that in my first post? Smile Read some! Smile

Quote:Some of our default regexps catch too much
Really? I thought it caught too little, most of the thread was about people saying what it didn't catch. Give me an example of something that it caught too much on.


The point of this algorithim is to be 1 size fits all, to catch them all without being too specific. Of course, people will have their own things they want to consider "sequences", so sure that's a good idea. I added a custom pattern box in the demonstration, which you can make it look for whatever it is you consider a sequence, also you can see above the strict and unstrict sequences.

The pattern must have delimiters on each end (~, /, #, whatever..) You can add any modifiers you want avail in PCRE in php. (i, x, m, s)
It will only run the pattern on 1 string or filename at a time, so you can see where modifier 's' won't be of much use... Also, you must put your identifier in the pattern with the named subgroup 'id'. ie: (?P<id>pattern for identifier)


Heres' an example:
Let's say you consider 'bob2' to be a sequence identifier, you can see in strict and unstrict how I consider \d+ to be sequence identifiers.
Enter for custom pattern:
~(?P<id>\bbob2)~

Enter for haystack:
file bob1bob2 bob2
file bob1bob2 BOB2

It will take out the sequence identifier and compare the rest of the strings in the haystacks with the sequence identifier removed, if it is the same (ignoring case and trimming the string) then it will be considered part of the sequence.


You can try the old regex's from xbmc to see how it works with those.
Spread the knowledge, nothing else.Image
Reply
#5
Nope, it catches too much. Pretty much every post was describing the format they use, of which the majority are already caught with the first 3 regexps in SVN. The 5th expression in SVN for instance can catch sequels (Die Hard 2, Die Hard), thus the reason I started the thread.

I read your post, and went to the site, and I still don't see why this is better than the arbitrary regexps that we currently have.

As I understand it, our system currently works as follows:

1. We have a list of regexps, with a single pattern to search for that contains a sequence (alpha-numeric, such as " - part 1") that identifies which part of a movie that file is.
2. We run through the list and run each regexp on the filename.
3. If we have a match in the regexp (i.e. it matches and we have a sequence to order by) then we then run through the rest of the files in the list to get matches (note: possible improvement here in that we match any regexp, not necessarily the one that matched the original, plus the "identify sequence" code probably could do with a cleanup - see CUtil::GetVolumeFromFilename()).
4. Any that also match are then grouped together, and sorted by their sequence. We also remove the sequence from the filename.

Perhaps you could explain in clear detail that a complete noob could understand exactly why your method is better than above?

Thanks,
Jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#6
jmarshall Wrote:Nope, it catches too much. Pretty much every post was describing the format they use, of which the majority are already caught with the first 3 regexps in SVN. The 5th expression in SVN for instance can catch sequels (Die Hard 2, Die Hard), thus the reason I started the thread.
I agree, the point with the other topic-thread was to find out the most common naming-standards that usually come with 'scene-releases' and people want stacked. We DO NOT want XBMC's default out-of-the-box stacking expressions to catch things things like movie sequels (Die Hard 2, Die Hard, Oceans 11, Oceans 12).

IMO it is better to by default force your average user to re-name his/her files if needed to a common naming-standard for stacking then for XBMC to be too flexible and therefor catch too much by default. Advanced users will still always be able to create their own stacking expression via advancedsettings.xml
Reply
#7
A couple questions, I don't fully understand the current algorithm/stacking methodology you are using.
These questions will also explain my improvements and differences in my algorithm.

1. Will this stack?
name part 9 part 1
name part 9 part 2

I am asking basically here if it can handle multiple sequence identities within 1 string, and figure out the correct one. Mine does this, if it can't do this, it should.

2.
Quote:This tag used to be called <videostacking>.
Contains regular expressions for use in matching filenames in a "stack" of video files. The regular expression must have a (...) surrounding the volume label portion. Text matching is compared case-insensitive. Anything matched by the regular expression will be removed from the titlename. If more than one (...) section is used, the first one will be the prefix, the second one the volume label, and the third one (if it exists) will be the suffix. Use this to keep extensions after matching. If more than one expression matches a particular filename, the first one will be used.
g_advancedSettings.m_videoStackRegExps.push_back("()([ab])(\\....)$");

When I say sequence identity it translates to what you identify as a "volume".
It says anything matched will be removed from the filename, do you mean matched by the whole expression, or that you just remove the volume (subgroup $1) from the filename?
Whatever the answer, I think you should ONLY remove the volume from the filename, because most expressions will try to describe the volume and that description should not be removed from the filename, just the volume. Especially with custom patterns, it will be very hard for users to describe advanced sequences and only match the volume in the full pattern, so only the volume subgroup should be removed.

3. Can it handle the volume being in any part of the filename (not just at the ends or begg.)? I think if two files contain the same sequence identity and once the sequence identity is removed from both files the filenames are duplicates of each other, then this is still a sequence. Mine can do this.

4. Do sequences have to be in exact progression of each other?
(this is dictated by the style of sequence identity, roman numerals are I, II, II, numbers are 1, 2, 3..etc..)
I think a sequence doesn't have to be in exact progression of each other, so something like this should still match, in case there are missing parts of the sequence..etc..
name part 12323
name part 23423423434 //these should match, my algo matches this

5. another thing mine is different is that it can do more in less as far as expressions are concerned, this isn't THAT big of a deal, unless its a real memory hog, but anytime you have a shorter algo and quicker expression to do more or the same with less, I think this is a good thing.

6. another thing I view differently is the commonality of what should be stacked and what should be viewed as default/common sequence identifiers. We can all agree that scene release sequence identifiers are common. However I believe that the range of sequence identifiers will expand in the future, if not now already. What is seen in file browsing mode doesn't just come from scene releases. Take for example the wealth of plugins using the browse mode like powerflv, which will index filenames with different sequence identifiers that what is currently set as default.
Here are some things that I think should be added to the range of "common" sequences:
name part1 //you don't have to have a space or other character between part & sequence ID
name part2

name lalapart2 //'part' must not have any letter chars directly before it, so it should be its own word
name lalapart2

name lala890part2
name lala234part2 //acceptable

name lala_*(&*&*(part2 //acceptable
name lala (a couple tabs here) part2 //acceptable

These should be valid sequences:
1.2
1.3
1-2
1-3
1/2
2/2

These should be able to come directly after letter chars, or directly after letter chars and multiple spaces
name1/2 //acceptable
name (spaces) 1/2 //acceptable


no multiple number progression sequence will be counted if it comes directly after letters or letters and then spaces (this is where ocean 11, ocean 12 comes in..), UNLESS it has one of these keywords before it:
(dvd|cd|part|pt|title)
name 1 //no
name1 //no
name part1 //yes
name part_dvd_1 //yes

but these types are okay directly after letters/letters with spaces (like i mentioned before)
name 1/2 //yes
name 1.3 //yes



All these rules can be seen in the expressions of strict & unstrict which I shared with you.
These types of rules can always be forced through using your own custom expressions in advanced settings, but it would be better if they were "default".


However, the most important things are the questions/topics concering 1, 2, 3 I talked about above.
I wouldn't be able to customize any of these rules in advanced settings if these fundamental techniques aren't present in the stacking algo.
Spread the knowledge, nothing else.Image
Reply
#8
Another thing that should be present in the stacking algo is to make sure that the sequence is of the same sequence ID type. Mine does this now quite easily because it only looks for 1 sequence type (multiple number progression), but there are other sequence types like a, b, c and roman numerals.
ie:
name part 1 //these should not be stacked, because its 2 difff. sequence types
name part a

name part a //should match
name part b
Spread the knowledge, nothing else.Image
Reply
#9
Thanks for taking the time to write that up. I agree that the system that we use has to be as good as possible at identifying what the user expects. This ofcourse is tricky - too much fuzzy logic makes it hard to pinpoint corner cases where things don't work as expected, and also makes it hard to explain to users how it works (so they can work around or teach the engine what they want). Not enough fuzzy logic means it fails to match for too many users which is just as bad.

In answer to your queries:

1. The current stacking regexps will hit the part 9 so they won't be stacked, as the resulting "clean" filenames won't match (see 2 and 3). There's probably a tweak that can be made to the regexps so that this part doesn't match, but I'm not sure I see the point in this particular example.

2. It removes subgroup 1 from the filename, or in the case of 3 subgroups, removes subgroup 2 from the filename. In either case, just the volume identifier subgroup is removed when looking for matches.

3. Yes - The only check for match is that once the volume identifier is removed that the resulting file titles much match (case-insensitive).

4. We don't care about the resulting volume sequence at all - it's only ever considered for sorting within the stack. Repeat volume sequences for instance will still be stacked, as will sequences with missing values as you suggest.

5. This is arguable until profiled ofcourse - you may have a fancier regexp, but a fancier regexp may take longer to compute than multiple simple regexps. In either case, we're not generally running this on huge lists Wink

6.

name part1 // these match with expression 1
name part2

name lalapart2 // these don't (no separator before part - regexp could be changed ofcourse to simply support any non-alpha)
name lalapart2
name lala890part2
name lala234part2
name lala_*(&*&*(part2
name lala (a couple tabs here) part2

The others can be taken care of with an additional regexp no doubt.

Some ideas for improvement of our current setup:

* Only run the matching expression on other filenames to find sequences. This rules out things like name-cd1.avi and name-part1.avi being part of the same stack.

* Remove the suspect regexps (expressions 4 and 5 - certainly 5 anyway).

Cheers,
Jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply

Logout Mark Read Team Forum Stats Members Help
Better default stacking algorithim for XBMC files-mode view0