-
Notifications
You must be signed in to change notification settings - Fork 16
Consider ways to make the listing operations (ls, dups, uniques) more efficient #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Nice to hear it has worked satisfactorily on a 150TB data set. The largest I've run it on was about 4TB. Are you running a development branch build or a released version? The SQLite db only stores known duplicates, it does not store info about files which were unique at the time of the scan. So when you run I have not looked at the |
The use-case: a group of students in humanities need to work on several collections of digital document: audio, image, text and video. Some collections are copies of each other, some are not. They need to assign taxonomies of their choice. They can add files to the collections, rename or move files and directories according to their own chosen taxonomy principles. In the end, the collections are "frozen", no more changes. Then we need to ascertain which taxonomy choice better served the initial given criteria guideline. Often, a collection will be almost 100% a duplicate of another. Then, we need to copy the That is why If it may be of any use:
I initially thought that the SQLite database was holding all entries, so that such computation, in case of frozen collections, could be done without rescanning. Thank you for considering this! |
We need to use our university server, which is running FreeBSD 13.2. Latest release of If necessary, I can ask the sysadmin to install a development version. Thank you again. |
I am keeping It seems also
(The stale warning in this case should be ignored, as I know no folder has been changed.) |
The Yes, you can ignore the database staleness warning if you know the data set is static. I would not recomment updating the dupd build on the server to the development version as that one is a work in progress that may or may not work. The development version (what will eventually become the 2.0 release) will reduce memory consumption by quite a bit in many scenarios but if you're not running out of RAM in your 150TB data set you should be good with the latest release version. I assume the files are all very large, if the data set only contains 2.4M files but total size is 150TB. |
I had forgotten that the 1.7 release has the --uniques scan flag, so could try it.
This will cause it to save all unique files in the db as well. Given your use case where nearly all files are duplicates, this may be useful. |
Thank you, I will not use the development version then, and wait for release 2.0. The server is a Dell PowerEdge with 128GB Ram, and I did not notice excessive Ram usage.
Yes, most files are large, being archival-quality digitisation of film and audio reels.
Thank you, that is very good to know. I will wait for the current Thank you again! |
I just made a 1.7.2 release which makes the Since I just made the release, it won't be available in the FreeBSD ports until someday. However, you can pretty much replicate the behavior with the previous release by first running the scan with the
(substitue /root/.dupd_sqlite for the location of the db, if it is elsewhere) |
dupd uniques
taking very long time?
The reason these listing operations are slow is that they do an SQLite query for every file and that's just slow. For the case of uniques, it is easiest to work around by simply showing the list of previously identified unique files as-is. As long as no new files have been added since the scan, that should be the list. The dups operation is more complex because it needs to validate whether a duplicate is still a duplicate, which will require a db query so it can't be skipped. And the ls operation is just dups+uniques so it also needs to do that. These will be more tricky to make more efficient. I will leave this ticket open for 2.0 to consider ways to improve these. |
Thank you for releasing version 1.7.2, as soon as I see it of FreshPorts I will have it installed, redo the scan and report back. By that time we shall probably add a major extra collection and reach 180TB, so it will be interesting to see. Thank you also for the details on SQLite operations, very instructive for me. Is this somehow related to it? Sometimes I wish I would be in Programming and not humanities. Fascinating field! Thank you very much for all your help. Much appreciated! |
It may take a long while to show up in ports (not sure), but as noted you can get the same outcome with the current version, just a bit more cumbersome. So no need to wait. Re-run the scan with (I know your system has the sqlite3 libraries installed, because dupd requires them, but it may or may not have the sqlite3 command installed. If it doesn't, ask for it to be installed in order to run the query to export the list of files.) (This is unrelated to the SQLite DISTINCT link you mentioned. dupd just stores a list of all unique files in the database but only when |
Thank you for the heads up. I have just launched I will report back when it is done. Am I correct in assuming this will store only uniques paths, and not duplicates? |
Duplicates are always stored in the db, so this run will store the pathnames of both duplicates and unique files. |
Understood, thanks.
|
Thanks! I built it for my workflow since the existing tools didn't really match how I wanted to work. I'm glad it's useful to others. |
Interim report:
I will be traveling over the week-end for a seminar. I will report back on Tuesday, hopefully it will be completed by then. |
Sorry for the delay. The scan finished a few days ago, but I wanted to make sure of some peculiarities. Doing it with 3 paths took about 144 hours:
The
There are 323429 uniques in all 3 paths.
That is not much useful, since we need to know uniques only from 1 path at a time.
So, I could now move those 195853 files to a new directory. BUT: If I run the former command (top of this issue),
It has printed already a few files that are not in the list above. So, now I am unsure in what to do next? Shall I simply follow the normal way, to remove all files from /path2 which are duplicates? Is there a way to debug this issue? |
That seems odd, hard to tell without seeing the files and paths. Some thoughts... I know you said the file set is static, but any chance new files were added after the scan started? Since Since the scan was run with --uniques, the db should contain an entry for every file in the scanned filesystems (except for new files added after scan started). Try this:
Now /tmp/found_by_find should have all the files scanned by dupd (I excluded zero-sized files and hidden files since dupd skips those). So there should be 3814032 file paths there. Now compare with the set of paths in the db:
So now /tmp/found_by_dupd should have the same 3814032 file paths as in /tmp/found_by_find If not, try to identify what's different about the files missing in one or the other list. |
Thank you for taking so much time to assist. Your thoughts, suggestions and command line examples are much appreciated.
I thought of sharing the SQLite db and paths, but they contains personal data which belongs to others.
That is why the number
3814032 − 3810453 = 3579
Something like,
Would that be relevant? |
The truncated filenames with spaces in /tmp/found_by_dupd is just due to the awk command picking the first string bounded by spaces, so it's not a dupd issue. I put that in the example above to remove the leading whitespace but didn't think of files with spaces in them. The full paths should be in the db correctly. So run this instead:
With that correction, try to identify what is different about the files not included in /tmp/found_by_dupd I'm curious to find out what might be the cause. |
Sorry for the delay. I was quite confused by the remark above,
I could not figure it out why some files are reported as uniques by
It turns out many of the files reported by the "slow" method,
are empty. Could it be that
so that the former method consider unique also zero-length files? |
Yes, that's it! Thanks for identifying it. I filed a separate bug #47 for this.
However, |
Excellent, thank you for explaining it. As a side note, to our admittedly very unusual corner case, filenames are considered integral part of a file, not some "metadata" to its content. So, for us, files with same size and different filenames are to be considered different. I know, this clashes with Unix Weltanschauung. Now that I know of this aspect, I can always treat zero-length files with a separate script. Thank you again so much for considering this and the tremendous help! |
Uh oh!
There was an error while loading. Please reload this page.
Hello,
I am very glad to have found
dupd
, as it offers the best workflow for my use-case.I have run the following command, on about 150TB data. It took about 70 hours:
Then, I did:
dupd
started listing the files which are unique to /path2, but it is taking a very long time, with CPU tagged at about 50%.Is this normal? I thought that since the files have been listed in a SQLite db, print such list would have been fast?
The text was updated successfully, but these errors were encountered: