8000 My ongoing devel branch w/ rebased branches. by chadnetzer · Pull Request #33 · akaihola/hardlinkpy · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

My ongoing devel branch w/ rebased branches. #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 130 commits into
base: master
Choose a base branch
from

Conversation

chadnetzer
Copy link
Collaborator

Incorporates many @wolfospealain changes, and closes issue #1, #4, #5, #15, #17, #18, #20, #21, #24, #26, #27, and #28.

I did a fresh series of merge/rebases to create a less intimidating merge history, which should be easier to review as a whole than 20 separate stale PRs.

The software still works on Python 2, but now also works with Python 3, and I've tripled the number of tests (with more to come).

chadnetzer and others added 30 commits July 4, 2018 14:13
While there is a queue of pending merges and development, including
possible option parsing changes, best to change version string to an
interim value until a next release.
Make verbosity checks consistently use "greater than", rather than
"greater than or equal".
Since it's a counter, rather than a boolean, 'verbosity' is a more
appropriate word.
Even though verbosity resets the value to zero on the first '-v'
encountered, if no '-v' the variable is initialized to None (at least in
Python 3)
There is a rare corner case when transitioning from old verbose option
arguments (-v num) to new (-v -v ...), that the OptionParser cannot
catch, when the given 'num' matches an existent directory name.

Add a safeguard to prevent scanning 'num' directories inadvertently when
old-style verbose arguments are still being used.  This *could* block
intentional cases where someone wants to legitimately scan a directory
named as a number.  This seems a rare case, though, and can be worked
around by not specifying the directory name directly after a '-v'
argument (or not using '-v').

An error message is printed to indicate the ambiguous argument, and the
code can be disabled by changing the default global boolean.
Eventually, this safeguard could be removed, when it's safe to assume
that users have transitioned to the new style verbosity arguments.
Addresses issue GH akaihola#1 and incorporates @wolfospealain verbosity changes.
Changes internal variable name to the more appropriate 'verbosity'.

Adds a safeguard against certain rare cases where an old-style verbose
numeric argument might be misinterpreted as an existent directory.
Should help better justify the transition to the new style.
Memory can be a problem over a large number of files. Added file-size and max-file-size to check only files of a certain size min, max or range. It is ultimately more important to check all identical files in a directory than check all files and run short on memory. It also provides a quick way to save space by hardlinking only the larger files.
A more specific name and consistent w/ options.max_file_size
Separate from the regular file check, allowing further tests without a
long boolean chain, and use the stat_info attributes for consistency
with other parts of the code.
Simple tests to specify file sizes out of range of the test data, and
ensure that no changes were made (similar to a dry-run).
Adds the file size options implemented by @wolfospealain, with argument
range checking and test cases.
With samename option enabled, possible matching hardlinks could be
skipped depending on the directory iteration order and existing hardlink
status of files.  This was causing actual failure of
test_hardlink_tree_filenames_equal() on some systems.
Adds a test in case dir2 is found before dir1.  With the filenames equal
option enabled, the order of iteration could can a false negative (ie.
an improperly passed test).  By changing the initial link relationship,
we can cause the test to fail if the directory iteration order is
reversed.
Fixes directory order dependency for 'filenames-equal' option.

With samename option enabled, possible matching hardlinks could be
skipped depending on the directory iteration order and existing hardlink
status of files. This was causing actual failure of
test_hardlink_tree_filenames_equal() on some systems.

Changed to test if new file matches one in file_hashes only if filenames
also match.
Add tearDown(), and keep track of all dirs and files created in the
setUp(), so that we can safely delete them during teardown.

Previously, the tempdir created in setUp() was being left behind and
polluting the tmp filesystem.
Add tearDown(), and keep track of all dirs and files created in the
setUp(), so that we can safely delete them during teardown.

Previously, the tempdir created in setUp() was being left behind and
polluting the tmp filesystem.
Hoist the directory/file exclusions and matching out of
hardlink_identical_files(), and group them all together in main().

This also sets up possible replacement of os.listdir() w/ os.scandir()
in Python 3, which should reduce the number of file stats(), and
potentially improve performance.
Other than the required "if True:" indentation marker, this patch only
removes the additional indentation whitespace left after hoisting the
file matching logic into the calling function.

A 'git diff --ignore-space-change' of this commit against it's parent
should verify that it is only an indentation change.
The match option implies matching against filenames, but the actual
check was being done against full paths, which meant to match a file one
always had to begin the match string with an asterisk.  Ie. "*foo" was
required to match a file called "foo".  This adds a test to check for
that bug.
Move directory tree walking logic out of hardlink_identical_files() and
put it all in one place (under main()).  Create helper functions for the
file/directory exclusion and matching logic.

Fixes a bug with previous file matching code perfoming match against the
whole pathname, rather than just the filename (and adds test to
demonstrate).
Reduces the amount of directory walking code to maintain, while
retaining Python 2 and 3 compatibility.  Some of the exclude/match code
that is now spread across multiple functions can be grouped closer
together.  In Python 3, os.walk() uses os.scandir() which may improve
performance over Python 2.
Replaces tree walk algorithm with Python's os.walk().  This gives us the
benefit of Python 2/3 compatibility while still taking advantage of
improvements in the tree walk implementation (ie. generators in Python 2
and os.scandir() in Python 3).  Also reduces the code we must maintain.
chadnetzer added 12 commits July 8, 2018 04:27
If the link succeeds, but the temporary file is not deleted, then we
could end up trying to link to it again (since it is equal in content to
the linked files).  The unlink() shouldn't fail, but if it does, we
shouldn't proceed.
If the link() succeeds, but the unlink() of the temporary file fails,
log a message and don't continue, since the program may attempt to
hardlink the temporary file (possibly endlessly).

Removing the temp file should almost certainly succeed anyway, so this
is already a panic situation, just made more explicit.
If the program verbosity is left at zero, and the program is doing a lot
of comparisons, it's helpful to see a dry-run message at startup to know
that nothing bad is happening.
This stops the dry-run message from printing during testing.  The normal
unittest module doesn't seem to easily allow toggling the buffering back
on at the commandline, which is a shame.  Nose may allow it, I haven't
checked.
Print dry-run message on program start
By specifying --debug, print stats on internal software statistics, such
as how often the hashes match, and how many linear search iterations
occur.
Track some additional statistics for debugging
The MAX_HASHES size limit is likely an artifact of the program's origins
in C/C++, where a fixed size hash table may have been used (C++ had no
standard hash table for the longest time).  With Python, there is no
need to limit the hash value range, as a dynamic dictionary mapping is
being used to store the file hash values.

Limiting the hash value range only increases the probability of
overlapping hash values, which lengthen the lists for storing pathnames
of files with equivalent hash values.  Since these must be searched
linearly, there is a net slowdown.  Space savings are moderate or
non-existant, since we store all the key values anyway.
The hash values are used as dictionary keys, so they don't need to be
limited in range by a modulus.
Rather than re-extracting the filename from pathname (per file), just
pass along the provided filename.
os.walk() provides filename, avoiding need for 
8000
basename.
Accepting stat_info and options gives more freedom as to how the hash
could be calculated.  What is called the "hash" is really a "key", and
is itself hashed by Python when used as a dict key.
Adds tests that ensure that files with different sizes don't get linked,
and also that max_size restrictions allow smaller files to become
linked.
Also perform simultaneous min-size/max-size testing (ie. both args at
once)
Update testing with different sized data, and verification of
min-size/max-size options in different combinations and circumstances.
chadnetzer added 10 commits July 8, 2018 18:25
In order to capture the exception argument before Python 2.6, in a way
that is also compatible with Python 3, we use the sys.exc_info method,
rather than "except Exception as".
The logging package was added in Python 2.3.

Python 2.3 support tested on RHEL/Centos 5.9
The basicConfig() call only started taking keyword setup in Python 2.4.
There are significant changes to the 2.7 unittests module, so we don't
bother with supporting older versions.
After testing on Python 2.3, include changes to exception handling and
logging to allow the program to still run on Python 2.3+ systems.
This just clarifies that we should attempt to update to the newest inode
time only when ignoring the time value in our hash calculations,
otherwise the hash in file_hashes would become outdated.
Code clarification as to when inode time/ownership can be updated.
The change that causes linking to always use the file with the greater
nlink as the source, doesn't work, because it can orphan already seen
files, but since only a single pass is made, the files aren't considered
again for linking to the larger cluster.  The problem can be masked with
two clusters, since the there are enough remaining files to allow the
unlinked but already seen files to be relinked.  These tests demonstrate
the issue by using 3 clusters to test with.
Go back to always using cached pathname as link() source.

This reverts commit 73d2764.
This reverts commit 5e9d961.
This reverts commit 591be1c.

Since this is a single pass algorithm, the link() source should always
be a file that was previously seen and is in the file_hashes cache.  In
this way, the cached file's inode count can only increase, and the
algorithm can link to all equal files (at the cost of doing more link()
calls.)

By reversing the link order when the newly discovered file had a larger
nlink value, files that had already been processed could become unlinked
from the new cluster, and possibly never re-linked to it.  The test case
added previously demonstrates this effect.
The "fix" for the "clustering issue" doesn't work, as it allows files
that have already been processed by the one-pass algorithm to become
unlinked to the larger cluster.  So in cases where all equal files could
be hardlinked together by the original algorithm, the clustering "fix"
would leave stragglers.  Test case added to demonstrate this bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0