Cleandatetime
#1
The default regex for cleandatetime is really bad. To clarify some details for anyone searching for info:

CUtils::CleanStrings first pulls the <cleandatetime> regex that you can specify in advancedsettings.xml.

Only one regex string is allowed in that field.

The first group matched is determined to be the title. The second group matched is determined to be the year (and is passed in to the scraper in buffer $$2). Any additional groups matched are discarded. If the regex isn't matched at all, nothing is inserted into the year group and the entire file name string is passed on to the <cleanstrings> portion of name handling. If a match is found, everything other than the year and first group found (generally everything before the start of the year info) is discarded.


I'll list a number of possible year labels on films, and explain what happens with the default regex, and with mine (shown below). The films aren't generally real, I'm just listing different patterns.

'no match' means it will use the entirety of the provided file name, and not provide any year. Otherwise, I will show the captured title, then a slash, then the year that was determined.

My Movie
- default: no match
- mine: no match

My Movie 2004
- default: My Movie / 2004
- mine: My Movie / 2004

My Movie (2004)
- default: no match
- mine: My Movie / 2004

My_Movie_2004
- default: My_Movie / 2004
- mine: My_Movie / 2004

My Movie[2004]
- default: no match
- mine: My Movie / 2004

My TV Show (2004-2005)
- default: no match
- mine: My TV Show / 2004

My TV Show ( 2004 - 2005 )
- default: My TV Show ( 2004 / 2005
- mine: My TV Show / 2004

2001: A Space Odyssey
- default: no match
- mine: no match

2001: A Space Odyssey (1968)
- default: no match
- mine: 2001: A Space Odyssey / 1968

Knives: 2000 Ways to Kill Someone
- default: Knives: / 2000
- mine: Knives: 2000 Ways to Kill Someone

Knives: 2000 Ways to Kill Someone.2001
- default: Knives: 2000 Ways to Kill Someone / 2001
- mine: Knives: 2000 Ways to Kill Someone / 2001

Knives: 2000 Ways to Kill Someone-2001
- default: Knives: 2000 Ways to Kill Someone / 2001
- mine: Knives: 2000 Ways to Kill Someone / 2001

Knives: 2000 Ways to Kill Someone[2001]
- default: Knives: / 2000
- mine: Knives: 2000 Ways to Kill Someone / 2001

1999.S00E01
- default: no match
- mine: no match

1999.S00E01.1974
- default: 1999.S00E01 / 1974
- mine: 1999.S00E01 / 1974

1999 - S00E01 (1974)
- default: no match
- mine: 1999 - S00E01 / 1974

Umika - Sincerity [AKROSS_Con_2012]
- default: no match
- mine: Umika - Sincerity / 2012

Oasis - Falling Down (East of the Eden version)[2008][h264]
- default: Oasis - Falling Down (East of the Eden version / 2008
- mine: Oasis - Falling Down (East of the Eden version) / 2008

The 1975 Show (1975)
- default: The, 1975
- mine: The 1975 Show / 1975

The Tonight Show of 1995 (1995)
- default: The Tonight Show of / 1995
- mine: The Tonight Show of / 1995



As you can see, there are quite a few patterns that are just broken using the default regex.

The following is the regex that I've built up to handle as many different cases as feasible, from the various testing that I've been able to manage. It handles everything that I've been able to throw at it except for that last pattern, and I'm not sure there's any reasonable way to deal with that except completely disallowing dates that are only preceded by spaces (something I would not object to, but since the default allows simple spaces as delimiters, I'm allowing that in mine).

Code:
<cleandatetime>(.+?)(?:\s*(?:(?:[[({])(?:[^])}]*)(?:_|\b))|[ _.,-]\s*)((?:19|20)\d{2})(?:(?:_|\s)*-(?:_|\s)*(?:19|20)\d{2})?\b(?!(?:\s*\w)+)[^\\/]*?$</cleandatetime>

A version that doesn't allow simple spaces to be a delimiter for a year:
Code:
<cleandatetime>(.+?)(?:\s*(?:(?:[[({])(?:[^])}]*)(?:_|\b))|[_.,-]\s*)((?:19|20)\d{2})(?:(?:_|\s)*-(?:_|\s)*(?:19|20)\d{2})?\b(?!(?:\s*\w)+)[^\\/]*?$</cleandatetime>


And for reference, here's the default regex:

Code:
<cleandatetime>(.+[^ _\,\.\(\)\[\]\-])[ _\.\(\)\[\]\-]+(19[0-9][0-9]|20[0-1][0-9])([ _\,\.\(\)\[\]\-][^0-9]|$)</cleandatetime>


Edit: Fixed the regex slightly.
Reply


Messages In This Thread
Cleandatetime - by Kinematics - 2014-12-26, 10:33
RE: Cleandatetime - by apo86 - 2021-03-16, 00:44
RE: Cleandatetime - by Karellen - 2021-03-16, 01:11
RE: Cleandatetime - by apo86 - 2021-03-16, 01:21
Logout Mark Read Team Forum Stats Members Help
Cleandatetime0