-
Notifications
You must be signed in to change notification settings - Fork 16
Feature Request : Save hash to Report DB. #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Well eventually I've done it with some hack job. diff -rw dupd_latest/dupd dupd --exclude=tests --exclude=*.git Only in dupd: build Only in dupd: dupd diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/dbops.c dupd/src/dbops.c 112c112 < "each_size INTEGER, paths TEXT)"); --- > "each_size INTEGER, paths TEXT, hash TEXT )"); 420c420 < void duplicate_to_db(sqlite3 * dbh, int count, uint64_t size, char * paths) --- > void duplicate_to_db(sqlite3 * dbh, int count, uint64_t size, char * paths, char * hash) 422c422,425 < const char * sql = "INSERT INTO duplicates (count, each_size, paths) " --- > > const char * sqly = "INSERT INTO duplicates (count, each_size, paths, hash) " > "VALUES(?, ?, ?, ?)"; > const char * sqlx = "INSERT INTO duplicates (count, each_size, paths) " 424a428,429 > int hash_len = strlen(hash); > const char * sql = ( hash == 0 ? sqlx : sqly ); 440a446,451 > > if( hash != 0 ) { > // printf("++++++++++++++ Hash %d -> %s\n", hash_len, hash); > rv = sqlite3_bind_text(stmt_duplicate_to_db, 4, hash, -1, SQLITE_STATIC); > rvchk(rv, SQLITE_OK, "Can't bind file hash: %s\n", dbh); > } diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/dbops.h dupd/src/dbops.h 135c135 < void duplicate_to_db(sqlite3 * dbh, int count, uint64_t size, char * paths); --- > void duplicate_to_db(sqlite3 * dbh, int count, uint64_t size, char * paths, char * hash); diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/filecompare.c dupd/src/filecompare.c 76c76 < duplicate_to_db(dbh, 2, size, paths); --- > duplicate_to_db(dbh, 2, size, paths, 0); diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/hashlist.c dupd/src/hashlist.c 326a327,332 > char hash_out[HASH_MAX_BUFSIZE]; > char * strhash; > char * strp ; > char * hashp = hash_out; > int hsize = hash_get_bufsize(hash_function); > 372,373d377 < int hsize = hash_get_bufsize(hash_function); < char hash_out[HASH_MAX_BUFSIZE]; 382a387 > strp = memstring("hash", p->hash, hsize); 389,390c394,395 < duplicate_to_db(dbh, p->next_index, size, pbi->buf); < --- > duplicate_to_db(dbh, p->next_index, size, pbi->buf, strp); > free(strp); diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/refresh.c dupd/src/refresh.c 132c132 < duplicate_to_db(dbh, new_entry_count, entry_each_size, new_list); --- > duplicate_to_db(dbh, new_entry_count, entry_each_size, new_list, 0); diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/utils.c dupd/src/utils.c 300a301 > 307c308 < printf("%s: ", text); --- > printf("%s: %d: ", text, bytes); 314a316,332 > } > > char * memstring(char * text, char * ptr, int bytes) > { > int i; > unsigned char * p = (unsigned char *)ptr; > int space = ( strlen(ptr)*3 + 2 ); > char * optr = (char *) malloc((1024) * sizeof(char)); > char * xptr = optr ; > > for (i=0; i<bytes; i++) { > xptr += sprintf(xptr, "%02x ", *p++); > } > //printf("\n-----------> memstring >> %s <-------------\n", optr); > //memdump(text, ptr, bytes); > //printf("~~~~~~~~~~~~\n"); > return optr; diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/utils.h dupd/src/utils.h 239a240,241 > char * memstring(char * text, char * ptr, int bytes); > So, not much changed, but then, I'm not sure about the hidden (if any) side effects. |
Thanks for using dupd! Saving the hashes of duplicates is easy enough, but I'm not sure if it is useful? Hashes are computed only for files known to be duplicates (if they can be rejected earlier, the full file is not read so the hash isn't computed). If you compare the known-duplicate hashes from two different systems, there is no guarantee that will find any duplicates even if they exist. That's because files which are duplicates on the two systems won't have a hash present unless both of them also have duplicates in the local system. So comparing across systems that way will only match a somewhat random subset of files, if any. (If the two external drives are mounted on the same system, run dupd with multiple -p options to point at both paths which will solve that use case.) In general, to do a duplicate find across separate systems requires computing all the hashes for all files. Easy enough with just find & sha1sum but it'll be very slow. |
If I may butt in, one use-case for storing hashes for all files: checking for duplicates on completely separate systems, especially with completely different paths, with the intent of keeping certain subsets on chosen machines (ie, keeping some parts duplicated, and others not). Admittedly an uncommon use-case one would not expect dupd to solve. Still, it is a use-case which the non-profit I volunteer for has been facing for some time. |
Hi, Yes of course you are correct, that comparing individual dupd runs with a hash will only catch duplicates on both drives. However I was considering creating a separate file list with xxhash output to compare to the original too, (also to pump into pandas). As you say something like,
|
Bit of trivia: dupd is named as a daemon (ends in 'd') even though it is not, because during initial implementation my plan was for it to be a daemon which coordinates duplicate finding across systems. That turned out to be too slow to be interesting so I focused on the local disk case but didn't change the name. I'd still love to solve for the multiple systems problem if there is an efficient way that is much better than simply using find | sort | uniq. @rosyth - dupd currently does save the hash of some files, but only large ones. You could get these from the .dupd_cache db with something like:
There's a performance cost to saving these though, so they're only saved for large files. |
The manpage covers this, but if there's anything that the manpage doesn't make clear please let me know so I can add more clarity. |
This is inspiring to hear, as it was the same direction I was heading. Would it be fine to open a new ticket for your consideration, presenting our use case, or shall I clarify here? |
Feel free to file another ticket with specific use case details. I'm not entirely convinced it's possible though. Trying to coordinate partial file matches over the network (particularly if more than two systems are involved) would likely introduce so much delay that it's just faster to hash everything and compare later. At that point dupd doesn't add any value since it can be done in a trivial shell script. But I'd love to be proved wrong. |
Yes, I see that now, thanks, RTFM always applies. |
Since hashing already is getting done, why not save the hash to the report database.
This would allow me to merge by hash two separate dupd runs on different external drives.
I can import the sqlitedb's into python/pandas (since I'm not familiar with SQL) merge them and get a new list of possible duplicates.
eg..
`
import pandas as pd
import sqlite3
con = sqlite3.connect("dupd.db3")
dupx = pd.read_sql('SELECT * FROM duplicates WHERE each_size > 10000;', con)
`
I've tried to modify the code myself, to add hashes, but not having used C for 20 years, it's not been very successful.
I suspect it would not be difficult, and possibly quite useful to other users too.
The text was updated successfully, but these errors were encountered: