Identify Incorrectly Scraped Movies?
#1
I scraped a very large movie database using Universal Movie Scraper; for the most part, it worked great.  Recently, however, I've started to notice quite a few movies were incorrectly scraped.  I know how to manually add the .nfo file, which I started doing, and fixed 10 movies that I noticed were incorrect.  But now I'm noticing more.

Is there a script that can search through the database to identify movies that don't match the movie folder or filename?  'Missing Movies' doesn't work because they're not missing, they're just misidentified.  I think some would be very easy to identify an issue ("Mickey's Christmas Carol (1983)" is CLEARLY not "Charles Dickens' A Christmas Carol (2020)") and others, just not an EXACT match (A Trip to the Moon (1902) vs A Trip to the Moon (2017)) ... even both titles are exact matches with TMDB, Universal Movie Scraper continues to pull the wrong one.  So, I have no idea how many movies (of a very large database) are incorrectly pulled without manually looking at each one.

Any recommendations to speed up fixing this issue - much appreciated.
Reply
#2
(2021-03-28, 12:37)ethanfox Wrote: Is there a script that can search through the database to identify movies that don't match the movie folder or filename?  'Missing Movies' doesn't work because they're not missing, they're just misidentified.  I think some would be very easy to identify an issue ("Mickey's Christmas Carol (1983)" is CLEARLY not "Charles Dickens' A Christmas Carol (2020)") and others, just not an EXACT match (A Trip to the Moon (1902) vs A Trip to the Moon (2017)) ... even both titles are exact matches with TMDB, Universal Movie Scraper continues to pull the wrong one.  So, I have no idea how many movies (of a very large database) are incorrectly pulled without manually looking at each one.
No, there is no script that will mass auto-correct scraping. It is surprising that some movies don't even closely match, such as A Christmas Carol.

Apart from the Parsing nfo method, the other method is searching by the IMDB tt ID. If you hit Refresh on an incorrect movie, when the search results pop up, if it is not in the list, press Manual and type in the tt... number.
My Signature
Links to : Official:Forum rules (wiki) | Official:Forum rules/Banned add-ons (wiki) | Debug Log (wiki)
Links to : HOW-TO:Create Music Library (wiki) | HOW-TO:Create_Video_Library (wiki)  ||  Artwork (wiki) | Basic controls (wiki) | Import-export library (wiki) | Movie sets (wiki) | Movie universe (wiki) | NFO files (wiki) | Quick start guide (wiki)
Reply
#3
i wrote a Python script that grabs the movie name, year and filename from the kodi db.

It uses python SequenceMatcher to get a score how similar the Movie Name (Year) is to the filename (without extension)
If score is too low, it will display the difference and ask for a new IMDB ID.
Blank will skip, or enter a ID and it will create a movie.nfo and put that the imdb url in it.
Then it deletes itself from the kodi db so it's ready for a correct re-scrape.

script is very heavily integrated with my use-case, but that's how you could do it.
Reply
#4
(2021-03-28, 22:50)matthuisman Wrote: i wrote a Python script that grabs the movie name, year and filename from the kodi db.
Sounds great. Is it available for download?
My Signature
Links to : Official:Forum rules (wiki) | Official:Forum rules/Banned add-ons (wiki) | Debug Log (wiki)
Links to : HOW-TO:Create Music Library (wiki) | HOW-TO:Create_Video_Library (wiki)  ||  Artwork (wiki) | Basic controls (wiki) | Import-export library (wiki) | Movie sets (wiki) | Movie universe (wiki) | NFO files (wiki) | Quick start guide (wiki)
Reply
#5
i did a quick script that should be more universal. here it is

Code:
import os
from difflib import SequenceMatcher

## MySQL
# import pymysql.cursors
# connection = pymysql.connect(host='192.168.20.3', user='kodi', password='kodi', db='MyVideos119', charset='utf8')
# placeholder = '%s'
##########

## sqlite
import sqlite3
connection = sqlite3.connect('MyVideos119.db')
placeholder = '?'
##########

MIN_RATIO = 0.8

cursor = connection.cursor()
cursor.execute("SELECT c00, strPath, premiered, strFileName, idMovie, idFile, uniqueid_type, uniqueid_value FROM movie_view")

try:
    for row in cursor.fetchall():
        title = row[0].strip()
        year = int(row[2][:4])

        scanned_name = u'{} ({})'.format(title, year)
        file_name = os.path.splitext(row[3])[0]

        ratio = SequenceMatcher(None, scanned_name, file_name).ratio()
        if ratio >= MIN_RATIO:
            continue

        value = input('{} vs {} ({}) {} {}: '.format(file_name, scanned_name, ratio, row[6], row[7]))
        url = None

        if value.lower().startswith('tt'):
            url = 'https://www.imdb.com/title/{}/'.format(value)
        elif value.lower().startswith('http'):
            url = value
        elif value.lower() == 'y':
            print('do rename')

        if url:
            with open(os.path.join(row[1], file_name+'.nfo'), 'w') as f:
                f.write(url)

            sql = "DELETE FROM movie WHERE idMovie = {}".format(placeholder)
            cursor.execute(sql, (row[4],))
            connection.commit()
finally:
    cursor.close()
    connection.close()
Reply
#6
Thank you very much for sharing this! Sounds like just what I was looking for. I'll give it a go!
Reply
#7
Amazing.. where should I run it?
At the MyVideosxxx.sq folder?
Reply

Logout Mark Read Team Forum Stats Members Help
Identify Incorrectly Scraped Movies?0
This forum uses Lukasz Tkacz MyBB addons.