Can I modify how the Movie Database Scraper add on resolves movie names
#1
Hello,

I have a large collection of movies that I recorded from the TV. The movies are created with filenames that are not recognised by the Movie DB add-on. 

Here are some examples:

php:
Filename: 20150813 2159 - NBC +1 - Prometheus.ts
Movie Name: Prometheus
Filename: 20150831 2159 - CBS+1 - The Bourne Identity.ts
Movie Name: The Bourne Identity

I understand that the best practice would be to rename all the files and even better to have each movie in its own folder with the name and year of the movie.

I haven't included any debug log because I know why it's not working... the filenames are not recognised by the regular expressions in the tmdb.xml files.

Can someone advise me on how I can modify the tmdb.xml file (the MovieDB Add-On) so it will recognise my particular movie filename convention??

I read this post about How to Write Media Scrapers and also this Reddit Post gives a lot of info about registering changes to an addon's .xml file.

I already have the regular expression I just need to know where to put it..

Here is what I think I need to do but I need some help please Smile

I was thinking can I add a new regular expression in the tmdb.xml file something like this...

xml:
<CreateSearchUrl dest="3">
<RegExp input="$$7" output="&lt;url&gt;https://api.tmdb.org/3/search/movie?api_key=f090bb54758cabf231fb605d3e3e0468&amp;amp;query=\1@@@WCF_LITERAL_AMP@@@amp;year=$$4@@@WCF_LITERAL_AMP@@@amp;language=$INFO[language]&lt;/url&gt;" dest="3">
         <RegExp input="$$1" output="\1" dest="7">
              <expression>(\d{8} \d{4} -.*?- ?(.*)\.ts$)</expression>
         </RegExp>
</RegExp>
<CreateSearchUrl>

There are SIX tmdb.xml files in my Kodi (LibreELEC) installation..

php:
/storage/.kodi/addons/metadata.common.themoviedb.org/tmdb.xml
/storage/.kodi/addons/metadata.themoviedb.org/tmdb.xml
/storage/.kodi/addons/metadata.tvshows.themoviedb.org/tmdb.xml
/usr/share/kodi/addons/metadata.common.themoviedb.org/tmdb.xml
/usr/share/kodi/addons/metadata.themoviedb.org/tmdb.xml
/usr/share/kodi/addons/metadata.tvshows.themoviedb.org/tmdb.xml

After some more reading it looks like..

../metadata.common.themoviedb.org/tmdb.xml
and
../metadata.themoviedb.org/tmdb.xml

are a companion pair that work together and...

../metadata.tvshows.themoviedb.org/tmdb.xml
surely isn't what I need since I am scrapping for movie metadata.

The <CreateSearchUrl> tag is in...
../metadata.themoviedb.org/tmdb.xml

So now my questions are:

1. What is the difference between...
/storage/.kodi/addons/metadata.themoviedb.org/tmdb.xml
and
/usr/share/kodi/addons/metadata.themoviedb.org/tmdb.xml

2. Do I need to make changes to both files?

3. Is there any special way to "re-install" the add-on after or will it just work when I make the changes to the xml file?

Cheers,

Flex
Reply
#2
1. The "storage" file will be the one that came in the Kodi installation.  The "usr" copy will be the most up-to-date version, downloaded after the fact. 

2. You only need to change the usr file.  It always supercedes the install copy.

3.  It will just work after you make the changes.  Only if you have to edit the addon.xml, or the settings.xml, do you need to restart kodi to see the changes.

That said, as it stands, your regex won't work.  The scraper is never passed the filename directly - it first passes through two filters, <cleandatetime> and <cleanstrings>, and is then percent-encoded to make it easier to put directly into a search URL.  I don't think the cleandatetime or cleanstrings will affect you majorly, except that the file extension is definitely lost along the way, so you don't need that in the regex, and the percent encoding means the spaces will need to be replaced by %20.

Note also, cleandatetime and cleanstrings can't be altered to help you in this case because they discard everything to the right of a match.  They assume the file is basically along the form of "Title (extra crap) (year) (extra crap).ext" and try to grab the year and remove all the extra crap so only the title makes it to the scraper.  In your case the bit to discard is on the left, so editing the scraper is potentially the best option, but does have issues. 

First, unless you're careful, any regularly named files, should you have any, will no longer match (you can always just stick your regexp after the others in the search function, and if the file happens to match that, it'll overwrite whatever the earlier ones produced, otherwise, they'll stand and be used in the search url). 

Second, you'll lose your changes whenever the scraper updates. To prevent that, you'd need to copy the whole metadata.themoviedb.com folder, rename it, and edit the addon.xml so the id attribute matches the new folder name, and change the name attribute to distinguish it from the original.  You may also need to "enable" the new copy in your addon settings in Kodi.  That way you'd have an independent version to edit as you like, that won't ever get updated (which is a whole different issue).  You can also use this to avoid the first issue, assuming the files are kept separate, since the regularly named files can use the original scraper, while the recorded files use the modified version.
Reply
#3
(2020-02-24, 00:51)scudlee Wrote: Note also, cleandatetime and cleanstrings can't be altered to help you in this case because they discard everything to the right of a match.  They assume the file is basically along the form of "Title (extra crap) (year) (extra crap).ext" and try to grab the year and remove all the extra crap so only the title makes it to the scraper.  In your case the bit to discard is on the left, so editing the scraper is potentially the best option, but does have issues.
Hello,

Thanks a lot for the detailed response.. saved me much time. If I understand you correctly surely "cleandatetime" will very often match what it thinks is a year in my filenames and then discard the movie name. These are more example filenames i have in my collection...

html:

20180305 2059 - Film4 - Carol.ts
20171024 1519 - Film4 - Patton.ts
20190410 2059 - Film4 - Gladiator.ts
20180511 2314 - Film4 - Northern Soul.ts
20150526 2059 - BBC Four - We Need to Talk About Kevin.ts

Looking at the cleandatetime RegExp:
xml:
<video> <cleandatetime>(.*[^ _\,\.\(\)\[\]\-])[ _\.\(\)\[\]\-]+(19[0-9][0-9]|20[0-9][0-9])([ _\,\.\(\)\[\]\-]|[^0-9]$)?</cleandatetime> </video>

Any movie that has a time in its filename between 7 pm and 9 pm will have the movie name truncated... as in the first, third and fifth movies above.

This might be a show stopper for me unless there is a work-around? You say "editing the scraper is potentially the best option" but aside from adding my regex into <CreateSearchUrl> would you have any other suggestion for me (apart from renaming the files!)

Maybe if I cloned the MovieDB Add-on and modified it so that cleandatetime doesn't run and then have that as a second scraper that runs after the original MovieDB scraper to capture movies with non-standard filenames. Can you over-ride the behaviour of cleandatetime by putting a different regexp for it in advancedsettings.xml ? Maybe this file doesn't just work on individual scrapers but all that are installed? Where does cleandatetime get called from? Is there code running behind the scenes in python or something? Would be a pity if cleandatetime and cleanstrings could not be user over-ridden.

Thank you,

Flex
Reply
#4
<cleandatetime> (and <cleanstrings>) are global settings being used by Kodi directly, but they are user-editable.  All you need to do is create your own advancedsettings.xml (wiki) file in your userdata folder, with the following contents:
xml:
<advancedsettings>
    <video>
        <cleandatetime></cleandatetime>
    </video>
</advancedsettings>
Whatever you put in your own advancedsettings.xml will overwrite the default values, so the empty regex should kill the cleandatetime completely. (You could also just put <cleandatetime/>.)
You could also throw in a <cleanstrings/> to kill that too, if necessary.

Again, that might have a negative impact for any files that would benefit from the filtering, but as you say, instead of killing, you could just make a different cleandatetime regex that just fails to match on the recorded files.
Reply
#5
Scudlee, thank you so much for your advice, so well explained.

I'll write a different cleandatetime regex for advancedsettings.xml that ignores file names that are TV recordings in my collection. I realise this is a hacky way of doing things but it's great to have the option to make the MovieDB Add-on work with my collection as-is instead of me having to change everything around to suit it.

I'll report back when I have a working solution.

Flex
Reply
#6
Quick question... what standard of Regular Expressions are compatible with the MovieDB xml files? Is PCRE supported?

Cheers,

Flex
Reply
#7
Yes, it's PCRE.  Although it doesn't necessarily interact with the scraper code perfectly.  You can't, for example, reference named sub-patterns in the outputs, or possibly even ones past two digits long.
Reply
#8
As mentioned above, I would like to add a new regexp to the tmdb.xml file so it correctly matches the non standard file names of my movie recordings when looking up metadata from api.tmdb.org

This file is in two locations:
/usr/share/kodi/addons/metadata.themoviedb.org/tmdb.xml
.. and
/storage/.kodi/addons/metadata.themoviedb.org/tmdb.xml

I was advised to make changes to the tmdb.xml in the /usr/share directory path.

It won't let me do that because /usr/share is on a system partition that is read only. I have read a very complicated workaround using squashfs to make a writable copy of the system partition which you then copy back to your device running Kodi after making changes. There is another solution that involves copying the add-on files to the /storage directory that is writable and creating a totally new copy of the add-on with necessary changes and then installing that add-on with a new name and version number.

Which is the best approach?

Is there really no easy way to just make one small change to /usr/share/kodi/addons/metadata.themoviedb.org/tmdb.xml ?? Why do (slightly different versions of) themoviedb Add-On files also exist in the /storage/.kodi path if the Add-On doesn't run from there but from the /usr/share.. path?

I am using LibreELEC.

Cheers,

Flex
Reply
#9
To recap...

I want to modify the MovieDB scraper so it will add my movie collection, along with metadata, to my Kodi library..
1. If movie names are named correctly according to the Kodi file naming scheme
2. If movie names have this format: 20180511 2314 - Film4 - Northern Soul.ts

I have created the file:
cpp:
/storage/.kodi/userdata/advancedsettings.xml

With this contents:
xml:
<advancedsettings version="1.0">
<video>
  <cleandatetime>(?&lt;= - )([^-]*?)(?=\.ts)</cleandatetime>
</video>
</advancedsettings>

This correctly identifies my .ts files and adds them to the Kodi library.

But it only works for those .ts files.

Now I want to modify the <cleandatetime> regex to use that regex above IFF the filename ends in .ts ELSE using the original <cleandatetime> regex.

I'm not going to ask for help about how to build this regex (although if you want to tell me I don't mind!)

The main thing I want to ask is some further explanation of the logic that kodi uses to build the CreateSearchUrl:

Is this definately the <cleandatetime> regex?
xml:
<video>
  <cleandatetime>(.*[^ _\,\.\(\)\[\]\-])[ _\.\(\)\[\]\-]+(19[0-9][0-9]|20[0-9][0-9])([ _\,\.\(\)\[\]\-]|[^0-9]$)?</cleandatetime>
</video>

Regarding the <cleandatetime> regex, the wiki states: "The string found before will be used as basis string getting cleaned by the cleanstrings expressions."

When I put that regex into a regex tester and run it against a few filenames with a more standard "Kodi format" I get a "full match" and "Group1/2/3" matches...

Image

My understanding is that the <cleanstrings> regex is called after <cleandatetime>

Does <cleanstrings> need to see regex match groups in some kind of order? Does it just take the "Full Match" and work on that. Or does it look for a "Group 1" match and work on it?

Also, does Kodi store away a "Group 2" match from <cleandatetime> when building the CreateSearchUrl.

I want to confirm what I need my modified <cleandatetime> regex to do. Because I have this regex that matches movie names if the filename is kodi compliant or according to my .ts naming scheme.

Image

But I think maybe it isn't exactly right because it doesn't extract out a year from those .mp4 files.. it just gives a "full match" for each movie name, there are no capture groups with years.

Any help gratefully appreciated,

Flex
Reply
#10
Man, I feel so dumb some times.  Of course you could use cleandatetime to discard junk on the left of the title, as well as on the right.

What I realise/remember now is what you can't do is have the title on the right of a year you want to keep, e.g "[2003] 21 Grams.mp4".  That won't work.
That's because the title part always has to be in the first capturing group, and the year has to be in the second
Since the capturing groups have to run in order, the title has to be before the year... Although maybe you could do some weird positive lookahead trick to switch the order... Hmm...

Anyway, besides the point.  Basically, you need to wrap brackets around the things you want to keep, such that the part that's the title is in the first set, and the part that's the year is in second.
Whether you can pull that off for both types of filename simultaneously might be tricky .  But almost certainly not impossible.

(The cleanstring regexes, on the other hand, just have to match the thing to be removed, and then only the stuff before the first match is kept, so there's absolutely no way to use them to remove anything to the left of the title.)

Kodi passes the contents of the first capturing group from cleandatetime through cleanstring, and then whatever remains is percent-encoded and placed in buffer $$1 in the CreateSearchUrl function for the scraper being used.
The second capturing group, with the year in it, is placed in buffer $$2.

The cleandatetime regex that is currently being used is here in the source code (just replace all the double backslashes with single backslashes).
I haven't bothered to check if it's different from the one in the wiki or not, although it hasn't been changed in five years, so I'd hope they match.
Reply
#11
Thanks scudlee,

I've got it working now and for anyone who reads this thread in the future here is a bit of info...

What did I get working?
I got Kodi to recognise filenames with two different naming conventions. The standard Kodi convention:
xml:
moviename year.ext

.. and another style like this:
xml:
20150813 2159 - NBC +1 - Prometheus.ts

Here is what I did

I got a terminal window to kodi (libreELEC in my case) using Putty.

I created an advancedsettings.xml file:
xml:
# nano /storage/.kodi/userdata/advancedsettings.xml

Added this contents:

xml:
<advancedsettings version="1.0">
<video>
  <cleandatetime>(?|(?&lt;= - )([^-]*?)(?=\.ts)|(.*[^ _\,\.\(\)\[\]\-])[ _\.\(\)\[\]\-]+(19[0-9][0-9]|20[0-9][0-9])([ _\,\.\(\)\[\]\-]|[^0-9]$)?)</cleandatetime>
</video>
</advancedsettings>

When the regex goes in the xml file you have to replace the "<" in the positive look behind with "&lt;" or it breaks the closing </cleandatetime> tag.

Obviously this regex is specific to my use-case that will apply to almost no one else. But you could modify it and get some flexibility with how you can get Kodi to recognise your movie collection even if not all your filenames are Kodi compliant.

Basically the regex is an alternation of two regexes...
This one matches my non standard filenames:
xml:
(?&lt;= - )([^-]*?)(?=\.ts)

.. and this is the standard Kodi cleandatetime regex:
xml:
(.*[^ _\,\.\(\)\[\]\-])[ _\.\(\)\[\]\-]+(19[0-9][0-9]|20[0-9][0-9])([ _\,\.\(\)\[\]\-]|[^0-9]$)?

They are placed in a branch reset pattern (?|alternation) so the numbered grouping starts from 1,2.. regardless of which regex matches the filename.

For anyone interested here is the regex I used to see and play with.

These are the filenames of my test movies:
xml:
The Edge Of Seventeen 2016 720p.mp4
21 Grams 2003.mp4
The Martian 2015 720p x264.mp4
20140724 0128 - more_movies - Closer.ts
20150813 2159 - NBC +1 - Prometheus.ts
20150831 2159 - CBS+1 - The Bourne Identity.ts

They were all added to the Kodi library with metadata and artwork:
Image

After making changes to (or creating the) advancedsettings.xml file you have to restart kodi.
xml:
# reboot

But actually before I did that, since I am still just testing, I reset my current Kodi library...

# systemctl stop kodi
Go to "Userdata" folder in Windows LibreELEC Samba share
In /Userdata folder rename Thumbnails to Thumbnails.old
In /Userdata/Database folder likewise add a .old extension to the MyVideos.. and Textures.. .db files
# reboot

When libreELEC restarts…

Open a putty ssh command line terminal session to libreELEC
Check if the advancedsettings.xml file was correctly loaded by kodi on start-up
Image

You also need to now set up your source again and set its content to Movies.
Then "Do you want to refresh information for all items in this path" = Yes

As Scudlees response above explains the <cleandatetime> regular expression creates numbered capture groups. Group 1 must be the movie name. Group 2 doesn't have to exist but it will only be used if it is a year. These two pieces of info are used to construct the CreateSearchUrl. See the tmdb.xml file for more info. I found that this file is in two locations but apparently the one that matters is here:

xml:
/usr/share/kodi/addons/metadata.themoviedb.org/tmdb.xml

This is a useful thread giving further explanation of the <cleandatetime> regex, some of its potential limitations and proposing an improved version.

I think all this info should go in the Wiki because I think this forum gets a lot of questions from confused people hoping to customise the MovieDB scraper to work with different file naming conventions.

While I'm still typing! Some more feedback I have is I think it should be possible to create a different advancedsettings.xml file for each kodi source. This way people could save their thousands of movies with file names, that are not kodi compliant, to a different folder and make a regex just for them then use the Kodi standard regexs for properly named files that are stored in a folder set up as a different source. This would give flexibility to people who have large collections with a variety of file name standards and also make less hassle to people supporting Kodi.

Thanks again for your help scudlee,

Flex
Reply

Logout Mark Read Team Forum Stats Members Help
Can I modify how the Movie Database Scraper add on resolves movie names0