2011-05-09, 16:34
No, you're absolutely correct. Security isn't important at all, speed and uniqueness are. And if my findings so far are anything to go by, performance and network traffic are non-issues.
Calvados Wrote:@topfs2: Crypto hash such as SHA-1, MD5 and so on have all a very detailed proof on collision and this is why they are retained in cryptography. Anything else is a waste of time. Knowing that, all you have to care of is making sure the part you hash is a meaningful one, i.e. one that varies.
Hashing the end might have some value since in most format this is where the seek table is (would be highly different between files); it might add a good entropy. That being said, the beginning of the file is more important for entropy I think, but you have to make sure you get way more than the most static part (header).
Other than that, you actually make me laugh with the "unique enough" part. Crypto-hash will generate totally different value with 1 single bit of difference. Of course, collision are not exactly entirely impossible (2^51 if you force it to happen for SHA-1... theoretically for a customized attack), and this is why you take a big number of bits to push it further. Now, MD5 is vulnerable to collision, SHA-1 not exactly proven... RIPEMD-160, SHA-2 are fine. Of course, we are talking here about crypto secure against collision for transaction authentication... not exactly the scope here.
Now, frankly I don't care about this feature in particular. But using a crypto hash on 5-10MB would identify any file for sure, esp. if you associate the file size (reduce further down the chance of a collision). I think it'd be better to use this as an addon, or option when one add a source to facilitate a migration. You don't want to query a DB every time on that.
Quote:Here are the results. The times are listed in seconds. This indicates how many seconds it took for the given algorithm to compute a hash 1000 times for the given file size.
Code:Algo 50 K 100 K 500 K 1 MB 5 MB
MD5 0.359 0.578 2.422 5.13 21.875
SHA1 0.781 1.469 7.115 15.427 70.078
SHA256 1.109 2.109 9.891 21.807 96.98
SHA384 2.688 5.36 26.224 57.656 256.641
SHA512 2.703 5.391 26.719 57.691 263.625
Quote:Please keep in mind that this was not a certified experiment. It is simply something that I threw together for the sake of curiosity. In other words, draw your own conclusions before accepting the results as a universal law.
wsippel Wrote:Tested 998 files:
0 duplicates @ 2048B
0 duplicates @ 1024B
0 duplicates @ 512B
1 duplicate @ 256B
2 duplicates @ 128B
2 duplicates @ 64B
6 duplicates @ 32B
topfs2 Wrote:Interesting, would be nice if you could add filesize check also and see if that helps. But it seems like we can get away with rather small sizesWhat exactly would you propose? Checking just the filesizes for comparison purposes or even smaller samples? With so few duplicates even at tiny samples, I don't have much to work with I'm afraid.
wsippel Wrote:What exactly would you propose? Checking just the filesizes for comparison purposes or even smaller samples? With so few duplicates even at tiny samples, I don't have much to work with I'm afraid.
EDIT: I can't access my library at the moment, so I could only use the very small selection of 26 files I had on my notebook. Using just the hash based on the first 32B, there's one duplicate. If I add the filesize, there are no duplicates. Also, verifying those 26 files using 32B samples took 2.376 seconds, with a sample size of 4kB it takes 2.399 seconds. The script itself is very inefficient of course - the size of the sample has very little impact. Also, judging by that tiny selection of files, tail seems to give more unique results compared to head, but it's also a bit slower.
topfs2 Wrote:tbh I wrote the response when I thought you wrote KB and not B, I agree that with so small sizes its kindof no need. Obviously storing file size is still a good idea for xbmc but may not really need to be tested.It probably makes sense to store the sizes either way. It's small, basically free (from a performance standpoint), and adds another layer of protection. Other than that, from what I've seen, 4kB should be sufficiently safe and reasonably fast, especially considering that the scan is usually only done once per file unless it's moved or renamed.
jmarshall Wrote:We should just use the same hash as is used already in various online databases (eg thetvdb and themoviedb I think both implement the opensubtitles hash, which is based on an MD5 of 64k start and end + filesize IIRC). That way we can use it to lookup subs and also get more accurate matches from online sites.