Identify files by checksum
#1
This seems so obvious that it was probably already considered at some point, but in case it was not: Wouldn't it make sense to identify all files in the library by MD5 and/ or SHA1 checksums and not rely on filenames and paths? The basic idea is that, if XBMC notices a new or modified file in one of the source directories, it would first generate a fingerprint and check if it's already in the database. That way, you could rename or move the files around without losing the metadata, thumbs or fanarts.
Reply
#2
I had to kill this after 12 minutes because it was choking my network:

Code:
/net/server.local/tank/Movies/The A-Team (2010)# time md5sum The\ A-Team.mkv
^C
real    12m55.520s
user    0m33.266s
sys    0m14.593s

That was ONE movie.
Code:
GRANT ALL PRIVILEGES ON `xbmc_%`.* TO 'xbmc'@'%';
IF you have a mysql problem, find one of the 4 dozen threads already open.
Reply
#3
darkscout Wrote:I had to kill this after 12 minutes because it was choking my network:

Code:
/net/server.local/tank/Movies/The A-Team (2010)# time md5sum The\ A-Team.mkv
^C
real    12m55.520s
user    0m33.266s
sys    0m14.593s

That was ONE movie.
I was afraid performance would be an issue, but that's actually a lot worse than I expected. I wonder if it's possible to generate a sufficiently unique fingerprint based on only a small part of a file, eg "head -c 1048576 $FILE | md5".
Reply
#4
This has been discussed and suggested many times before, and yeah the solution is to use a part of the file. I don't think anyone have done any empirical proofs on how unique an MD5 of the HEAD is, nor how large the head needs to be for it to be sufficiently unique. Tests and data on this would be much appreciated as its very time consuming doing them
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#5
topfs2 Wrote:This has been discussed and suggested many times before, and yeah the solution is to use a part of the file. I don't think anyone have done any empirical proofs on how unique an MD5 of the HEAD is, nor how large the head needs to be for it to be sufficiently unique. Tests and data on this would be much appreciated as its very time consuming doing them

Ooo, batch testing. Something I do rather well. When I get home I'll script something up to test uniqueness of my movies.
Code:
GRANT ALL PRIVILEGES ON `xbmc_%`.* TO 'xbmc'@'%';
IF you have a mysql problem, find one of the 4 dozen threads already open.
Reply
#6
I have quite some experience in the domain due to a past project, basically:

- Don't use MD5 but something bigger like SHA-1, RIPEMD-160 or better a SHA-2 (increase the hash space).
- Uniqueness is assured by the crypto hash, it's more a matter of making sure you integrate a "meaningful" part (i.e the first 10 bytes only would be bad obviously).
- In my experience, I started to consider a partial hash for a file more than 10MB, and was checking up to 50MB. Frankly, 10MB was enough but I had a special requirement.
- Also, if you are super paranoid, you can store the file size in combination (It can speed up the matches too, comparing hashes is expensive). Statistically speaking, the same hash with the same size is not really probable.

I don't have stats anymore, and even if I did it would be obsolete (was in 2004), but you get the idea, and someone can build a quick test bed.
Reply
#7
I'll try to hack something together to verify all ~1000 files on my XBMC box later. As performance seems to be the main concern, it's probably desirable to use MD5 over SHA-1 (needs to be benchmarked), but it's definitely a good idea to also check the filesize.
Reply
#8
Hi,

topfs2 Wrote:This has been discussed and suggested many times before, and yeah the solution is to use a part of the file. I don't think anyone have done any empirical proofs on how unique an MD5 of the HEAD is, nor how large the head needs to be for it to be sufficiently unique. Tests and data on this would be much appreciated as its very time consuming doing them

what about taking a part from the beginning and a part of the end? This should be unique enough?


Greetz X23
Reply
#9
x23piracy Wrote:Hi,

what about taking a part from the beginning and a part of the end? This should be unique enough?

Greetz X23

You base this on what? If you give me a proof that is then fine. You might be interested to know that most hashings don't even have a proper proof (except empirical ones).

I'm not arguing it may be unique enough, but we can't guess. Thats why someone creating a script and actually test is needed (empirical proof).
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#10
Extremely rudimentary shell script to batch generate md5 checksums and look for duplicates. Takes a directory as an argument and is recursive, files smaller than 1MB are ignored. To increase the sample size, simply change the "SAMPLE_SIZE" variable:

Code:
#!/bin/sh

DIR="."
NUMBER_OF_FILES=0
SAMPLE_SIZE=1024

function list_files()
{
    if !(test -d "$1")  
    then return;
    fi

    cd "$1"

    for i in *
    do
               if test -d "$i"  #if dictionary
                then  
                 list_files "$i" #recursively list files
                  cd ..
                else
            if [ $(du "$i" | awk '{ print $1 }') -ge 1048576 ]
             then
              head -c $SAMPLE_SIZE "$i" | md5 >> $TMPDIR/verify_log
              NUMBER_OF_FILES=$((NUMBER_OF_FILES+1))
            fi
                fi

    done
}

rm $TMPDIR/verify_log

if [ $# -eq 0 ]
then list_files .
exit 0
fi

for i in $*
do
    DIR="$1"
    list_files "$DIR"
    shift 1 #To read next directory/file name
done

echo "$NUMBER_OF_FILES checksums generated, listing duplicates:"

sort $TMPDIR/verify_log | uniq -d
Reply
#11
@topfs2: Crypto hash such as SHA-1, MD5 and so on have all a very detailed proof on collision and this is why they are retained in cryptography. Anything else is a waste of time. Knowing that, all you have to care of is making sure the part you hash is a meaningful one, i.e. one that varies.

Hashing the end might have some value since in most format this is where the seek table is (would be highly different between files); it might add a good entropy. That being said, the beginning of the file is more important for entropy I think, but you have to make sure you get way more than the most static part (header).

Other than that, you actually make me laugh with the "unique enough" part. Crypto-hash will generate totally different value with 1 single bit of difference. Of course, collision are not exactly entirely impossible (2^51 if you force it to happen for SHA-1... theoretically for a customized attack), and this is why you take a big number of bits to push it further. Now, MD5 is vulnerable to collision, SHA-1 not exactly proven... RIPEMD-160, SHA-2 are fine. Of course, we are talking here about crypto secure against collision for transaction authentication... not exactly the scope here.

Now, frankly I don't care about this feature in particular. But using a crypto hash on 5-10MB would identify any file for sure, esp. if you associate the file size (reduce further down the chance of a collision). I think it'd be better to use this as an addon, or option when one add a source to facilitate a migration. You don't want to query a DB every time on that.
Reply
#12
Calvados Wrote:Other than that, you actually make me laugh with the "unique enough" part. Crypto-hash will generate totally different value with 1 single bit of difference. Of course collision are always possible (2^51 if you force it to happen for SHA-1... I let you decide what randomly implies statistically), and this is why you take a big number of bits (MD5 is more prone to collision that SHA-1... however not that likely either).

There's a reason why MD5 is not considered "secure" anymore and should be replaced by SHA-1 when security is important. An example can be found here where two strings only differ in a few places and they produce the exact same MD5 hash. Obviously this is a much bigger use case when it comes to security because there ppl actually try to find/force such collisions which I doubt anyone will do with his/her media files Wink
Always read the online manual (wiki), FAQ (wiki) and search the forum before posting.
Do not e-mail Team Kodi members directly asking for support. Read/follow the forum rules (wiki).
Please read the pages on troubleshooting (wiki) and bug reporting (wiki) before reporting issues.
Reply
#13
Correct - this is why I edited my initial post. If you go back to my 1st post in this thread I actually mentioned to not use MD5 Smile. I do crypto for a living actually, I don't use anything below SHA-2 512 for those work.

Now, realistically speaking for the task at hand, I'd not go that far.
Reply
#14
Improved script:

Code:
#!/bin/sh

DIR="."
NUMBER_OF_FILES=0
SAMPLE_SIZE=512

function list_files()
{
    if !(test -d "$1")  
    then return;
    fi

    cd "$1"

    for i in *
    do
               if test -d "$i"  
                then  
                 list_files "$i"
                  cd ..
                else
if [ "$i" != "*" ]
then
            if [ $(du "$i" | awk '{ print $1 }') -ge 131072 ]
             then
              head -c $SAMPLE_SIZE "$i" | md5 >> $TMPDIR/verify_log
              NUMBER_OF_FILES=$((NUMBER_OF_FILES+1))
            fi
                fi
fi
    done
}

rm $TMPDIR/verify_log

if [ $# -eq 0 ]
then list_files .
exit 0
fi

for i in $*
do
    DIR="$1"
    list_files "$DIR"
    shift 1
done

echo "$NUMBER_OF_FILES checksums generated, listing duplicates:"
sort $TMPDIR/verify_log | uniq -d > $TMPDIR/verify_log2
NUMBER_OF_DUPLICATES=$(wc -l /$TMPDIR/verify_log2 | awk '{ print $1 }')
cat $TMPDIR/verify_log2
echo "$NUMBER_OF_DUPLICATES" duplicates found.
Tested 998 files:

0 duplicates @ 2048B
0 duplicates @ 1024B
0 duplicates @ 512B
1 duplicate @ 256B
2 duplicates @ 128B
2 duplicates @ 64B
6 duplicates @ 32B

Planted duplicates were successfully identified. Roughly half the files tested were MP4, the other half used a variety of different formats. TV shows with identical intros were also included.


EDIT: Using SHA-1 instead of MD5 led to the exact same results but took much longer.
Reply
#15
Hi,

Montellese Wrote:There's a reason why MD5 is not considered "secure" anymore and should be replaced by SHA-1 when security is important.

yes ur right unsalted md5 hashes are massiv unsecure mostly in case of using dictionary passwords.

But the topic sounds like identify film titels by checksum.

Iam wrong?


Greetz X23
Reply

Logout Mark Read Team Forum Stats Members Help
Identify files by checksum0