-
Notifications
You must be signed in to change notification settings - Fork 8
My ongoing devel branch w/ rebased branches. #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
chadnetzer
wants to merge
130
commits into
akaihola:master
Choose a base branch
from
chadnetzer:devel
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
While there is a queue of pending merges and development, including possible option parsing changes, best to change version string to an interim value until a next release.
Make verbosity checks consistently use "greater than", rather than "greater than or equal".
Since it's a counter, rather than a boolean, 'verbosity' is a more appropriate word.
Even though verbosity resets the value to zero on the first '-v' encountered, if no '-v' the variable is initialized to None (at least in Python 3)
There is a rare corner case when transitioning from old verbose option arguments (-v num) to new (-v -v ...), that the OptionParser cannot catch, when the given 'num' matches an existent directory name. Add a safeguard to prevent scanning 'num' directories inadvertently when old-style verbose arguments are still being used. This *could* block intentional cases where someone wants to legitimately scan a directory named as a number. This seems a rare case, though, and can be worked around by not specifying the directory name directly after a '-v' argument (or not using '-v'). An error message is printed to indicate the ambiguous argument, and the code can be disabled by changing the default global boolean. Eventually, this safeguard could be removed, when it's safe to assume that users have transitioned to the new style verbosity arguments.
Addresses issue GH akaihola#1 and incorporates @wolfospealain verbosity changes. Changes internal variable name to the more appropriate 'verbosity'. Adds a safeguard against certain rare cases where an old-style verbose numeric argument might be misinterpreted as an existent directory. Should help better justify the transition to the new style.
Memory can be a problem over a large number of files. Added file-size and max-file-size to check only files of a certain size min, max or range. It is ultimately more important to check all identical files in a directory than check all files and run short on memory. It also provides a quick way to save space by hardlinking only the larger files.
A more specific name and consistent w/ options.max_file_size
Separate from the regular file check, allowing further tests without a long boolean chain, and use the stat_info attributes for consistency with other parts of the code.
Simple tests to specify file sizes out of range of the test data, and ensure that no changes were made (similar to a dry-run).
Adds the file size options implemented by @wolfospealain, with argument range checking and test cases.
With samename option enabled, possible matching hardlinks could be skipped depending on the directory iteration order and existing hardlink status of files. This was causing actual failure of test_hardlink_tree_filenames_equal() on some systems.
Adds a test in case dir2 is found before dir1. With the filenames equal option enabled, the order of iteration could can a false negative (ie. an improperly passed test). By changing the initial link relationship, we can cause the test to fail if the directory iteration order is reversed.
Fixes directory order dependency for 'filenames-equal' option. With samename option enabled, possible matching hardlinks could be skipped depending on the directory iteration order and existing hardlink status of files. This was causing actual failure of test_hardlink_tree_filenames_equal() on some systems. Changed to test if new file matches one in file_hashes only if filenames also match.
Add tearDown(), and keep track of all dirs and files created in the setUp(), so that we can safely delete them during teardown. Previously, the tempdir created in setUp() was being left behind and polluting the tmp filesystem.
Add tearDown(), and keep track of all dirs and files created in the setUp(), so that we can safely delete them during teardown. Previously, the tempdir created in setUp() was being left behind and polluting the tmp filesystem.
Hoist the directory/file exclusions and matching out of hardlink_identical_files(), and group them all together in main(). This also sets up possible replacement of os.listdir() w/ os.scandir() in Python 3, which should reduce the number of file stats(), and potentially improve performance.
Other than the required "if True:" indentation marker, this patch only removes the additional indentation whitespace left after hoisting the file matching logic into the calling function. A 'git diff --ignore-space-change' of this commit against it's parent should verify that it is only an indentation change.
The match option implies matching against filenames, but the actual check was being done against full paths, which meant to match a file one always had to begin the match string with an asterisk. Ie. "*foo" was required to match a file called "foo". This adds a test to check for that bug.
Move directory tree walking logic out of hardlink_identical_files() and put it all in one place (under main()). Create helper functions for the file/directory exclusion and matching logic. Fixes a bug with previous file matching code perfoming match against the whole pathname, rather than just the filename (and adds test to demonstrate).
Reduces the amount of directory walking code to maintain, while retaining Python 2 and 3 compatibility. Some of the exclude/match code that is now spread across multiple functions can be grouped closer together. In Python 3, os.walk() uses os.scandir() which may improve performance over Python 2.
Replaces tree walk algorithm with Python's os.walk(). This gives us the benefit of Python 2/3 compatibility while still taking advantage of improvements in the tree walk implementation (ie. generators in Python 2 and os.scandir() in Python 3). Also reduces the code we must maintain.
If the link succeeds, but the temporary file is not deleted, then we could end up trying to link to it again (since it is equal in content to the linked files). The unlink() shouldn't fail, but if it does, we shouldn't proceed.
If the link() succeeds, but the unlink() of the temporary file fails, log a message and don't continue, since the program may attempt to hardlink the temporary file (possibly endlessly). Removing the temp file should almost certainly succeed anyway, so this is already a panic situation, just made more explicit.
If the program verbosity is left at zero, and the program is doing a lot of comparisons, it's helpful to see a dry-run message at startup to know that nothing bad is happening.
This stops the dry-run message from printing during testing. The normal unittest module doesn't seem to easily allow toggling the buffering back on at the commandline, which is a shame. Nose may allow it, I haven't checked.
Print dry-run message on program start
By specifying --debug, print stats on internal software statistics, such as how often the hashes match, and how many linear search iterations occur.
Track some additional statistics for debugging
The MAX_HASHES size limit is likely an artifact of the program's origins in C/C++, where a fixed size hash table may have been used (C++ had no standard hash table for the longest time). With Python, there is no need to limit the hash value range, as a dynamic dictionary mapping is being used to store the file hash values. Limiting the hash value range only increases the probability of overlapping hash values, which lengthen the lists for storing pathnames of files with equivalent hash values. Since these must be searched linearly, there is a net slowdown. Space savings are moderate or non-existant, since we store all the key values anyway.
The hash values are used as dictionary keys, so they don't need to be limited in range by a modulus.
Rather than re-extracting the filename from pathname (per file), just pass along the provided filename.
os.walk() provides filename, avoiding need for 8000 basename.
Accepting stat_info and options gives more freedom as to how the hash could be calculated. What is called the "hash" is really a "key", and is itself hashed by Python when used as a dict key.
Adds tests that ensure that files with different sizes don't get linked, and also that max_size restrictions allow smaller files to become linked.
Also perform simultaneous min-size/max-size testing (ie. both args at once)
Update testing with different sized data, and verification of min-size/max-size options in different combinations and circumstances.
In order to capture the exception argument before Python 2.6, in a way that is also compatible with Python 3, we use the sys.exc_info method, rather than "except Exception as".
The logging package was added in Python 2.3. Python 2.3 support tested on RHEL/Centos 5.9
The basicConfig() call only started taking keyword setup in Python 2.4.
There are significant changes to the 2.7 unittests module, so we don't bother with supporting older versions.
After testing on Python 2.3, include changes to exception handling and logging to allow the program to still run on Python 2.3+ systems.
This just clarifies that we should attempt to update to the newest inode time only when ignoring the time value in our hash calculations, otherwise the hash in file_hashes would become outdated.
Code clarification as to when inode time/ownership can be updated.
The change that causes linking to always use the file with the greater nlink as the source, doesn't work, because it can orphan already seen files, but since only a single pass is made, the files aren't considered again for linking to the larger cluster. The problem can be masked with two clusters, since the there are enough remaining files to allow the unlinked but already seen files to be relinked. These tests demonstrate the issue by using 3 clusters to test with.
Go back to always using cached pathname as link() source. This reverts commit 73d2764. This reverts commit 5e9d961. This reverts commit 591be1c. Since this is a single pass algorithm, the link() source should always be a file that was previously seen and is in the file_hashes cache. In this way, the cached file's inode count can only increase, and the algorithm can link to all equal files (at the cost of doing more link() calls.) By reversing the link order when the newly discovered file had a larger nlink value, files that had already been processed could become unlinked from the new cluster, and possibly never re-linked to it. The test case added previously demonstrates this effect.
The "fix" for the "clustering issue" doesn't work, as it allows files that have already been processed by the one-pass algorithm to become unlinked to the larger cluster. So in cases where all equal files could be hardlinked together by the original algorithm, the clustering "fix" would leave stragglers. Test case added to demonstrate this bug.
This was referenced Jul 21, 2018
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Incorporates many @wolfospealain changes, and closes issue #1, #4, #5, #15, #17, #18, #20, #21, #24, #26, #27, and #28.
I did a fresh series of merge/rebases to create a less intimidating merge history, which should be easier to review as a whole than 20 separate stale PRs.
The software still works on Python 2, but now also works with Python 3, and I've tripled the number of tests (with more to come).