Cleandatetime
#1
The default regex for cleandatetime is really bad. To clarify some details for anyone searching for info:

CUtils::CleanStrings first pulls the <cleandatetime> regex that you can specify in advancedsettings.xml.

Only one regex string is allowed in that field.

The first group matched is determined to be the title. The second group matched is determined to be the year (and is passed in to the scraper in buffer $$2). Any additional groups matched are discarded. If the regex isn't matched at all, nothing is inserted into the year group and the entire file name string is passed on to the <cleanstrings> portion of name handling. If a match is found, everything other than the year and first group found (generally everything before the start of the year info) is discarded.


I'll list a number of possible year labels on films, and explain what happens with the default regex, and with mine (shown below). The films aren't generally real, I'm just listing different patterns.

'no match' means it will use the entirety of the provided file name, and not provide any year. Otherwise, I will show the captured title, then a slash, then the year that was determined.

My Movie
- default: no match
- mine: no match

My Movie 2004
- default: My Movie / 2004
- mine: My Movie / 2004

My Movie (2004)
- default: no match
- mine: My Movie / 2004

My_Movie_2004
- default: My_Movie / 2004
- mine: My_Movie / 2004

My Movie[2004]
- default: no match
- mine: My Movie / 2004

My TV Show (2004-2005)
- default: no match
- mine: My TV Show / 2004

My TV Show ( 2004 - 2005 )
- default: My TV Show ( 2004 / 2005
- mine: My TV Show / 2004

2001: A Space Odyssey
- default: no match
- mine: no match

2001: A Space Odyssey (1968)
- default: no match
- mine: 2001: A Space Odyssey / 1968

Knives: 2000 Ways to Kill Someone
- default: Knives: / 2000
- mine: Knives: 2000 Ways to Kill Someone

Knives: 2000 Ways to Kill Someone.2001
- default: Knives: 2000 Ways to Kill Someone / 2001
- mine: Knives: 2000 Ways to Kill Someone / 2001

Knives: 2000 Ways to Kill Someone-2001
- default: Knives: 2000 Ways to Kill Someone / 2001
- mine: Knives: 2000 Ways to Kill Someone / 2001

Knives: 2000 Ways to Kill Someone[2001]
- default: Knives: / 2000
- mine: Knives: 2000 Ways to Kill Someone / 2001

1999.S00E01
- default: no match
- mine: no match

1999.S00E01.1974
- default: 1999.S00E01 / 1974
- mine: 1999.S00E01 / 1974

1999 - S00E01 (1974)
- default: no match
- mine: 1999 - S00E01 / 1974

Umika - Sincerity [AKROSS_Con_2012]
- default: no match
- mine: Umika - Sincerity / 2012

Oasis - Falling Down (East of the Eden version)[2008][h264]
- default: Oasis - Falling Down (East of the Eden version / 2008
- mine: Oasis - Falling Down (East of the Eden version) / 2008

The 1975 Show (1975)
- default: The, 1975
- mine: The 1975 Show / 1975

The Tonight Show of 1995 (1995)
- default: The Tonight Show of / 1995
- mine: The Tonight Show of / 1995



As you can see, there are quite a few patterns that are just broken using the default regex.

The following is the regex that I've built up to handle as many different cases as feasible, from the various testing that I've been able to manage. It handles everything that I've been able to throw at it except for that last pattern, and I'm not sure there's any reasonable way to deal with that except completely disallowing dates that are only preceded by spaces (something I would not object to, but since the default allows simple spaces as delimiters, I'm allowing that in mine).

Code:
<cleandatetime>(.+?)(?:\s*(?:(?:[[({])(?:[^])}]*)(?:_|\b))|[ _.,-]\s*)((?:19|20)\d{2})(?:(?:_|\s)*-(?:_|\s)*(?:19|20)\d{2})?\b(?!(?:\s*\w)+)[^\\/]*?$</cleandatetime>

A version that doesn't allow simple spaces to be a delimiter for a year:
Code:
<cleandatetime>(.+?)(?:\s*(?:(?:[[({])(?:[^])}]*)(?:_|\b))|[_.,-]\s*)((?:19|20)\d{2})(?:(?:_|\s)*-(?:_|\s)*(?:19|20)\d{2})?\b(?!(?:\s*\w)+)[^\\/]*?$</cleandatetime>


And for reference, here's the default regex:

Code:
<cleandatetime>(.+[^ _\,\.\(\)\[\]\-])[ _\.\(\)\[\]\-]+(19[0-9][0-9]|20[0-1][0-9])([ _\,\.\(\)\[\]\-][^0-9]|$)</cleandatetime>


Edit: Fixed the regex slightly.
Reply
#2
THANK YOU, internet person from 2014!
Wonder Womand 1984 sent me on a quest for "wtf is wrong with this scraper" and this post is where I found salvation.
Reply
#3
(2021-03-16, 00:44)apo86 Wrote: Wonder Womand 1984
Could it be because you spelt it wrong?
My Signature
Links to : Official:Forum rules (wiki) | Official:Forum rules/Banned add-ons (wiki) | Debug Log (wiki)
Links to : HOW-TO:Create Music Library (wiki) | HOW-TO:Create_Video_Library (wiki)  ||  Artwork (wiki) | Basic controls (wiki) | Import-export library (wiki) | Movie sets (wiki) | Movie universe (wiki) | NFO files (wiki) | Quick start guide (wiki)
Reply
#4
No, it was definitely cleandatetime stripping the year from the movie title. I'm more diligent with my file names than with my forum posts Wink
Reply

Logout Mark Read Team Forum Stats Members Help
Cleandatetime0