Identify files by checksum
#16
No, you're absolutely correct. Security isn't important at all, speed and uniqueness are. And if my findings so far are anything to go by, performance and network traffic are non-issues.
Reply
#17
Calvados Wrote:@topfs2: Crypto hash such as SHA-1, MD5 and so on have all a very detailed proof on collision and this is why they are retained in cryptography. Anything else is a waste of time. Knowing that, all you have to care of is making sure the part you hash is a meaningful one, i.e. one that varies.

Hashing the end might have some value since in most format this is where the seek table is (would be highly different between files); it might add a good entropy. That being said, the beginning of the file is more important for entropy I think, but you have to make sure you get way more than the most static part (header).

Other than that, you actually make me laugh with the "unique enough" part. Crypto-hash will generate totally different value with 1 single bit of difference. Of course, collision are not exactly entirely impossible (2^51 if you force it to happen for SHA-1... theoretically for a customized attack), and this is why you take a big number of bits to push it further. Now, MD5 is vulnerable to collision, SHA-1 not exactly proven... RIPEMD-160, SHA-2 are fine. Of course, we are talking here about crypto secure against collision for transaction authentication... not exactly the scope here.

Now, frankly I don't care about this feature in particular. But using a crypto hash on 5-10MB would identify any file for sure, esp. if you associate the file size (reduce further down the chance of a collision). I think it'd be better to use this as an addon, or option when one add a source to facilitate a migration. You don't want to query a DB every time on that.

I have never said SHA-1 and such doesn't provide unique enough values, in fact I said its proven (empirically) that they do on full files. What I was questioning was:
How big chunk of any given file is "enough" for input to the hashing so that it stays unique. We cannot hash a full file, even if that would be unique, so what is the minimum we need.

And I suggested finding this out by having people here try different values so that the devs doesn't have to. i.e. we do a "proof" that X bytes from the beginning is enough when used as input. Frankly I bet 5-10MB is more than enough, so what is the smallest amount we need. The smaller value we can find the less of an impact it has on using XBMC
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#18
Hi,

FYI:
Quote:Here are the results. The times are listed in seconds. This indicates how many seconds it took for the given algorithm to compute a hash 1000 times for the given file size.

Code:
Algo    50 K     100 K     500 K     1 MB     5 MB
MD5    0.359     0.578     2.422     5.13     21.875
SHA1    0.781     1.469     7.115     15.427     70.078
SHA256    1.109     2.109     9.891     21.807     96.98
SHA384    2.688     5.36     26.224     57.656     256.641
SHA512    2.703     5.391     26.719     57.691     263.625

Source: http://jeffbarnes.net/blog/post/2007/01/...rison.aspx

Quote:Please keep in mind that this was not a certified experiment. It is simply something that I threw together for the sake of curiosity. In other words, draw your own conclusions before accepting the results as a universal law.

A bit more interesting should this be: ftp://ftp.cis.upenn.edu/pub/mbgreen/papers/ton98.pdf
Performance of Checksums and CRCs over Real Data

If i think about a working solution, scraping in future can be really fast if xbmc firstly computes all the checksums and submit them all together to a compatible info db it can resubmit all the nfo, posters, fanarts in a row.


Greetz X23
Reply
#19
wsippel Wrote:Tested 998 files:

0 duplicates @ 2048B
0 duplicates @ 1024B
0 duplicates @ 512B
1 duplicate @ 256B
2 duplicates @ 128B
2 duplicates @ 64B
6 duplicates @ 32B

Interesting, would be nice if you could add filesize check also and see if that helps. But it seems like we can get away with rather small sizes Smile
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#20
topfs2 Wrote:Interesting, would be nice if you could add filesize check also and see if that helps. But it seems like we can get away with rather small sizes Smile
What exactly would you propose? Checking just the filesizes for comparison purposes or even smaller samples? With so few duplicates even at tiny samples, I don't have much to work with I'm afraid.

EDIT: I can't access my library at the moment, so I could only use the very small selection of 26 files I had on my notebook. Using just the hash based on the first 32B, there's one duplicate. If I add the filesize, there are no duplicates. Also, verifying those 26 files using 32B samples took 2.376 seconds, with a sample size of 4kB it takes 2.399 seconds. The script itself is very inefficient of course - the size of the sample has very little impact. Also, judging by that tiny selection of files, tail seems to give more unique results compared to head, but it's also a bit slower.
Reply
#21
wsippel Wrote:What exactly would you propose? Checking just the filesizes for comparison purposes or even smaller samples? With so few duplicates even at tiny samples, I don't have much to work with I'm afraid.

EDIT: I can't access my library at the moment, so I could only use the very small selection of 26 files I had on my notebook. Using just the hash based on the first 32B, there's one duplicate. If I add the filesize, there are no duplicates. Also, verifying those 26 files using 32B samples took 2.376 seconds, with a sample size of 4kB it takes 2.399 seconds. The script itself is very inefficient of course - the size of the sample has very little impact. Also, judging by that tiny selection of files, tail seems to give more unique results compared to head, but it's also a bit slower.

tbh I wrote the response when I thought you wrote KB and not B, I agree that with so small sizes its kindof no need. Obviously storing file size is still a good idea for xbmc but may not really need to be tested.

EDIT: I think we always need to do this on HEAD, as say your moving/copying the data the tail will alter throughout the copy. XBMC can play it while copying so would be nice if the metadata is correct despite that Smile
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply
#22
topfs2 Wrote:tbh I wrote the response when I thought you wrote KB and not B, I agree that with so small sizes its kindof no need. Obviously storing file size is still a good idea for xbmc but may not really need to be tested.
It probably makes sense to store the sizes either way. It's small, basically free (from a performance standpoint), and adds another layer of protection. Other than that, from what I've seen, 4kB should be sufficiently safe and reasonably fast, especially considering that the scan is usually only done once per file unless it's moved or renamed.
Reply
#23
We should just use the same hash as is used already in various online databases (eg thetvdb and themoviedb I think both implement the opensubtitles hash, which is based on an MD5 of 64k start and end + filesize IIRC). That way we can use it to lookup subs and also get more accurate matches from online sites.
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#24
jmarshall Wrote:We should just use the same hash as is used already in various online databases (eg thetvdb and themoviedb I think both implement the opensubtitles hash, which is based on an MD5 of 64k start and end + filesize IIRC). That way we can use it to lookup subs and also get more accurate matches from online sites.

While I like the idea of being compliant with thetvdb and themoviedb one most note the inherent downside with that approach. You can only ever identify a "finished" file. I know most probably use xbmc and streams from other devices so what we fetch is "finished". I currently push my videos to my HTPC and this would never work with that, and I am mostly bringing the downside up as it shows a problem for perhaps a minority of the users.

Other usecases where it may be more interesting to only hash the beginning is if we hook up to inotify where we get a notification as soon as its starting to be copied, we would have to wait until finished before actually identifying.
If you have problems please read this before posting

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Image

"Well Im gonna download the code and look at it a bit but I'm certainly not a really good C/C++ programer but I'd help as much as I can, I mostly write in C#."
Reply

Logout Mark Read Team Forum Stats Members Help
Identify files by checksum0