Posts: 22
Joined: Dec 2010
Reputation:
0
This seems so obvious that it was probably already considered at some point, but in case it was not: Wouldn't it make sense to identify all files in the library by MD5 and/ or SHA1 checksums and not rely on filenames and paths? The basic idea is that, if XBMC notices a new or modified file in one of the source directories, it would first generate a fingerprint and check if it's already in the database. That way, you could rename or move the files around without losing the metadata, thumbs or fanarts.
Posts: 4,549
Joined: Dec 2007
Reputation:
17
topfs2
Team-Kodi Developer
Posts: 4,549
This has been discussed and suggested many times before, and yeah the solution is to use a part of the file. I don't think anyone have done any empirical proofs on how unique an MD5 of the HEAD is, nor how large the head needs to be for it to be sufficiently unique. Tests and data on this would be much appreciated as its very time consuming doing them
If you have problems please read
this before posting
Always read the
XBMC online-manual,
FAQ and
search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the
forum rules.
For troubleshooting and bug reporting please make sure you
read this first.
"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Posts: 37
Joined: Apr 2011
Reputation:
0
I have quite some experience in the domain due to a past project, basically:
- Don't use MD5 but something bigger like SHA-1, RIPEMD-160 or better a SHA-2 (increase the hash space).
- Uniqueness is assured by the crypto hash, it's more a matter of making sure you integrate a "meaningful" part (i.e the first 10 bytes only would be bad obviously).
- In my experience, I started to consider a partial hash for a file more than 10MB, and was checking up to 50MB. Frankly, 10MB was enough but I had a special requirement.
- Also, if you are super paranoid, you can store the file size in combination (It can speed up the matches too, comparing hashes is expensive). Statistically speaking, the same hash with the same size is not really probable.
I don't have stats anymore, and even if I did it would be obsolete (was in 2004), but you get the idea, and someone can build a quick test bed.
Posts: 22
Joined: Dec 2010
Reputation:
0
I'll try to hack something together to verify all ~1000 files on my XBMC box later. As performance seems to be the main concern, it's probably desirable to use MD5 over SHA-1 (needs to be benchmarked), but it's definitely a good idea to also check the filesize.
Posts: 184
Joined: Feb 2011
Hi,
topfs2 Wrote:This has been discussed and suggested many times before, and yeah the solution is to use a part of the file. I don't think anyone have done any empirical proofs on how unique an MD5 of the HEAD is, nor how large the head needs to be for it to be sufficiently unique. Tests and data on this would be much appreciated as its very time consuming doing them
what about taking a part from the beginning and a part of the end? This should be unique enough?
Greetz X23
Posts: 37
Joined: Apr 2011
Reputation:
0
2011-05-09, 14:29
(This post was last modified: 2011-05-09, 14:51 by Calvados.)
@topfs2: Crypto hash such as SHA-1, MD5 and so on have all a very detailed proof on collision and this is why they are retained in cryptography. Anything else is a waste of time. Knowing that, all you have to care of is making sure the part you hash is a meaningful one, i.e. one that varies.
Hashing the end might have some value since in most format this is where the seek table is (would be highly different between files); it might add a good entropy. That being said, the beginning of the file is more important for entropy I think, but you have to make sure you get way more than the most static part (header).
Other than that, you actually make me laugh with the "unique enough" part. Crypto-hash will generate totally different value with 1 single bit of difference. Of course, collision are not exactly entirely impossible (2^51 if you force it to happen for SHA-1... theoretically for a customized attack), and this is why you take a big number of bits to push it further. Now, MD5 is vulnerable to collision, SHA-1 not exactly proven... RIPEMD-160, SHA-2 are fine. Of course, we are talking here about crypto secure against collision for transaction authentication... not exactly the scope here.
Now, frankly I don't care about this feature in particular. But using a crypto hash on 5-10MB would identify any file for sure, esp. if you associate the file size (reduce further down the chance of a collision). I think it'd be better to use this as an addon, or option when one add a source to facilitate a migration. You don't want to query a DB every time on that.
Posts: 184
Joined: Feb 2011
Hi,
Montellese Wrote:There's a reason why MD5 is not considered "secure" anymore and should be replaced by SHA-1 when security is important.
yes ur right unsalted md5 hashes are massiv unsecure mostly in case of using dictionary passwords.
But the topic sounds like identify film titels by checksum.
Iam wrong?
Greetz X23