My ongoing devel branch w/ rebased branches. #33

chadnetzer · 2018-07-07T11:13:12Z

Incorporates many @wolfospealain changes, and closes issue #1, #4, #5, #15, #17, #18, #20, #21, #24, #26, #27, and #28.

I did a fresh series of merge/rebases to create a less intimidating merge history, which should be easier to review as a whole than 20 separate stale PRs.

The software still works on Python 2, but now also works with Python 3, and I've tripled the number of tests (with more to come).

While there is a queue of pending merges and development, including possible option parsing changes, best to change version string to an interim value until a next release.

Make verbosity checks consistently use "greater than", rather than "greater than or equal".

Since it's a counter, rather than a boolean, 'verbosity' is a more appropriate word.

Even though verbosity resets the value to zero on the first '-v' encountered, if no '-v' the variable is initialized to None (at least in Python 3)

There is a rare corner case when transitioning from old verbose option arguments (-v num) to new (-v -v ...), that the OptionParser cannot catch, when the given 'num' matches an existent directory name. Add a safeguard to prevent scanning 'num' directories inadvertently when old-style verbose arguments are still being used. This *could* block intentional cases where someone wants to legitimately scan a directory named as a number. This seems a rare case, though, and can be worked around by not specifying the directory name directly after a '-v' argument (or not using '-v'). An error message is printed to indicate the ambiguous argument, and the code can be disabled by changing the default global boolean. Eventually, this safeguard could be removed, when it's safe to assume that users have transitioned to the new style verbosity arguments.

@wolfospealain

Addresses issue GH akaihola#1 and incorporates @wolfospealain verbosity changes. Changes internal variable name to the more appropriate 'verbosity'. Adds a safeguard against certain rare cases where an old-style verbose numeric argument might be misinterpreted as an existent directory. Should help better justify the transition to the new style.

Memory can be a problem over a large number of files. Added file-size and max-file-size to check only files of a certain size min, max or range. It is ultimately more important to check all identical files in a directory than check all files and run short on memory. It also provides a quick way to save space by hardlinking only the larger files.

A more specific name and consistent w/ options.max_file_size

Separate from the regular file check, allowing further tests without a long boolean chain, and use the stat_info attributes for consistency with other parts of the code.

Simple tests to specify file sizes out of range of the test data, and ensure that no changes were made (similar to a dry-run).

@wolfospealain

Adds the file size options implemented by @wolfospealain, with argument range checking and test cases.

With samename option enabled, possible matching hardlinks could be skipped depending on the directory iteration order and existing hardlink status of files. This was causing actual failure of test_hardlink_tree_filenames_equal() on some systems.

Adds a test in case dir2 is found before dir1. With the filenames equal option enabled, the order of iteration could can a false negative (ie. an improperly passed test). By changing the initial link relationship, we can cause the test to fail if the directory iteration order is reversed.

Fixes directory order dependency for 'filenames-equal' option. With samename option enabled, possible matching hardlinks could be skipped depending on the directory iteration order and existing hardlink status of files. This was causing actual failure of test_hardlink_tree_filenames_equal() on some systems. Changed to test if new file matches one in file_hashes only if filenames also match.

Add tearDown(), and keep track of all dirs and files created in the setUp(), so that we can safely delete them during teardown. Previously, the tempdir created in setUp() was being left behind and polluting the tmp filesystem.

Hoist the directory/file exclusions and matching out of hardlink_identical_files(), and group them all together in main(). This also sets up possible replacement of os.listdir() w/ os.scandir() in Python 3, which should reduce the number of file stats(), and potentially improve performance.

Other than the required "if True:" indentation marker, this patch only removes the additional indentation whitespace left after hoisting the file matching logic into the calling function. A 'git diff --ignore-space-change' of this commit against it's parent should verify that it is only an indentation change.

The match option implies matching against filenames, but the actual check was being done against full paths, which meant to match a file one always had to begin the match string with an asterisk. Ie. "*foo" was required to match a file called "foo". This adds a test to check for that bug.

Move directory tree walking logic out of hardlink_identical_files() and put it all in one place (under main()). Create helper functions for the file/directory exclusion and matching logic. Fixes a bug with previous file matching code perfoming match against the whole pathname, rather than just the filename (and adds test to demonstrate).

Reduces the amount of directory walking code to maintain, while retaining Python 2 and 3 compatibility. Some of the exclude/match code that is now spread across multiple functions can be grouped closer together. In Python 3, os.walk() uses os.scandir() which may improve performance over Python 2.

Replaces tree walk algorithm with Python's os.walk(). This gives us the benefit of Python 2/3 compatibility while still taking advantage of improvements in the tree walk implementation (ie. generators in Python 2 and os.scandir() in Python 3). Also reduces the code we must maintain.

If the link succeeds, but the temporary file is not deleted, then we could end up trying to link to it again (since it is equal in content to the linked files). The unlink() shouldn't fail, but if it does, we shouldn't proceed.

If the link() succeeds, but the unlink() of the temporary file fails, log a message and don't continue, since the program may attempt to hardlink the temporary file (possibly endlessly). Removing the temp file should almost certainly succeed anyway, so this is already a panic situation, just made more explicit.

If the program verbosity is left at zero, and the program is doing a lot of comparisons, it's helpful to see a dry-run message at startup to know that nothing bad is happening.

This stops the dry-run message from printing during testing. The normal unittest module doesn't seem to easily allow toggling the buffering back on at the commandline, which is a shame. Nose may allow it, I haven't checked.

Print dry-run message on program start

By specifying --debug, print stats on internal software statistics, such as how often the hashes match, and how many linear search iterations occur.

Track some additional statistics for debugging

The MAX_HASHES size limit is likely an artifact of the program's origins in C/C++, where a fixed size hash table may have been used (C++ had no standard hash table for the longest time). With Python, there is no need to limit the hash value range, as a dynamic dictionary mapping is being used to store the file hash values. Limiting the hash value range only increases the probability of overlapping hash values, which lengthen the lists for storing pathnames of files with equivalent hash values. Since these must be searched linearly, there is a net slowdown. Space savings are moderate or non-existant, since we store all the key values anyway.

The hash values are used as dictionary keys, so they don't need to be limited in range by a modulus.

Rather than re-extracting the filename from pathname (per file), just pass along the provided filename.

os.walk() provides filename, avoiding need for 8000 basename.

Accepting stat_info and options gives more freedom as to how the hash could be calculated. What is called the "hash" is really a "key", and is itself hashed by Python when used as a dict key.

Adds tests that ensure that files with different sizes don't get linked, and also that max_size restrictions allow smaller files to become linked.

Also perform simultaneous min-size/max-size testing (ie. both args at once)

Update testing with different sized data, and verification of min-size/max-size options in different combinations and circumstances.

In order to capture the exception argument before Python 2.6, in a way that is also compatible with Python 3, we use the sys.exc_info method, rather than "except Exception as".

The logging package was added in Python 2.3. Python 2.3 support tested on RHEL/Centos 5.9

The basicConfig() call only started taking keyword setup in Python 2.4.

There are significant changes to the 2.7 unittests module, so we don't bother with supporting older versions.

After testing on Python 2.3, include changes to exception handling and logging to allow the program to still run on Python 2.3+ systems.

This just clarifies that we should attempt to update to the newest inode time only when ignoring the time value in our hash calculations, otherwise the hash in file_hashes would become outdated.

Code clarification as to when inode time/ownership can be updated.

The change that causes linking to always use the file with the greater nlink as the source, doesn't work, because it can orphan already seen files, but since only a single pass is made, the files aren't considered again for linking to the larger cluster. The problem can be masked with two clusters, since the there are enough remaining files to allow the unlinked but already seen files to be relinked. These tests demonstrate the issue by using 3 clusters to test with.

Go back to always using cached pathname as link() source. This reverts commit 73d2764. This reverts commit 5e9d961. This reverts commit 591be1c. Since this is a single pass algorithm, the link() source should always be a file that was previously seen and is in the file_hashes cache. In this way, the cached file's inode count can only increase, and the algorithm can link to all equal files (at the cost of doing more link() calls.) By reversing the link order when the newly discovered file had a larger nlink value, files that had already been processed could become unlinked from the new cluster, and possibly never re-linked to it. The test case added previously demonstrates this effect.

The "fix" for the "clustering issue" doesn't work, as it allows files that have already been processed by the one-pass algorithm to become unlinked to the larger cluster. So in cases where all equal files could be hardlinked together by the original algorithm, the clustering "fix" would leave stragglers. Test case added to demonstrate this bug.

chadnetzer and others added 30 commits July 4, 2018 14:13

Update version to reflect alpha status

1313fee

While there is a queue of pending merges and development, including possible option parsing changes, best to change version string to an interim value until a next release.

Verbosity changes.

6608148

Change verbosity tests to use greater than

53897ba

Make verbosity checks consistently use "greater than", rather than "greater than or equal".

Change 'verbose' option to incrementing counter

e38b8b5

Change 'verbose' option variable to 'verbosity'.

ad52baf

Since it's a counter, rather than a boolean, 'verbosity' is a more appropriate word.

Update tests for compatibility with verbose option

05c8534

Add default option for verbosity.

7dbfd17

Even though verbosity resets the value to zero on the first '-v' encountered, if no '-v' the variable is initialized to None (at least in Python 3)

Rename options.file_size to options.min_file_size

ae62530

A more specific name and consistent w/ options.max_file_size

Separate file size check from filetype check

d0cec8d

Separate from the regular file check, allowing further tests without a long boolean chain, and use the stat_info attributes for consistency with other parts of the code.

Add range checking to min-size/max-size options

133e247

Add tests for min-size and max-size options.

5df045e

Simple tests to specify file sizes out of range of the test data, and ensure that no changes were made (similar to a dry-run).

Merge branch '24_filesize_options'

69b7d53

Adds the file size options implemented by @wolfospealain, with argument range checking and test cases.

Add tearDown() to the unittest classes.

65dc73c

Add tearDown(), and keep track of all dirs and files created in the setUp(), so that we can safely delete them during teardown. Previously, the tempdir created in setUp() was being left behind and polluting the tmp filesystem.

Merge branch '21_unittest_mkdtemp_cleanup'

6fc392d

Add tearDown(), and keep track of all dirs and files created in the setUp(), so that we can safely delete them during teardown. Previously, the tempdir created in setUp() was being left behind and polluting the tmp filesystem.

Use os.lstat() to not follow symlinks off device

2c86980

Make file match and dotfile exclusion functions

ed9d27e

Update 'match' option to allow multiple matches.

0259c7b

Add docstring to found_excluded* functions

68dbdde

chadnetzer added 12 commits July 8, 2018 04:27

Exit on failure to remove tmp file after link

d8a462a

If the link succeeds, but the temporary file is not deleted, then we could end up trying to link to it again (since it is equal in content to the linked files). The unlink() shouldn't fail, but if it does, we shouldn't proceed.

Print dry-run message at program start if selected

30545f8

If the program verbosity is left at zero, and the program is doing a lot of comparisons, it's helpful to see a dry-run message at startup to know that nothing bad is happening.

Turn on unittest buffering to suppress printouts

8f3f8a6

This stops the dry-run message from printing during testing. The normal unittest module doesn't seem to easily allow toggling the buffering back on at the commandline, which is a shame. Nose may allow it, I haven't checked.

Merge branch 'log_dryrun_early'

4db9867

Print dry-run message on program start

Add hidden debugging option and hash stat tracking

ed89369

By specifying --debug, print stats on internal software statistics, such as how often the hashes match, and how many linear search iterations occur.

Merge branch 'add_debugging_statistics' into devel

bc9c83a

Track some additional statistics for debugging

Merge branch 'remove_max_hashes_modulus'

581dcd0

The hash values are used as dictionary keys, so they don't need to be limited in range by a modulus.

Use 'filename' provided already by os.walk()

c355877

Rather than re-extracting the filename from pathname (per file), just pass along the provided filename.

Merge branch 'use_os_walk_filename'

a9f96f5

os.walk() provides filename, avoiding need for 8000 basename.

Change the hash function to use more general args

b6f4669

Accepting stat_info and options gives more freedom as to how the hash could be calculated. What is called the "hash" is really a "key", and is itself hashed by Python when used as a dict key.

chadnetzer force-pushed the devel branch from 6278cb5 to 9d1d46b Compare July 9, 2018 01:01

chadnetzer added 4 commits July 8, 2018 18:07

Add tests w/ different sized test data.

3f255d3

Adds tests that ensure that files with different sizes don't get linked, and also that max_size restrictions allow smaller files to become linked.

Verify min_size test can restrict and allow links

7bcc2bf

Update max-size tests for smaller data

d49560c

Also perform simultaneous min-size/max-size testing (ie. both args at once)

Merge branch 'updated_test_different_sizes'

0241226

Update testing with different sized data, and verification of min-size/max-size options in different combinations and circumstances.

chadnetzer force-pushed the devel branch from 9d1d46b to 0241226 Compare July 9, 2018 01:08

chadnetzer added 10 commits July 8, 2018 18:25

Use older Py2/3 compatible exception information

6e3fccd

In order to capture the exception argument before Python 2.6, in a way that is also compatible with Python 3, we use the sys.exc_info method, rather than "except Exception as".

Require at least Python 2.3

1091690

The logging package was added in Python 2.3. Python 2.3 support tested on RHEL/Centos 5.9

Avoid setting logging formatting for Python 2.3

d6cd41e

The basicConfig() call only started taking keyword setup in Python 2.4.

Indicate the tests suite requiring at least Py2.7

ee53761

There are significant changes to the 2.7 unittests module, so we don't bother with supporting older versions.

Merge branch 'py23_compat'

5171315

After testing on Python 2.3, include changes to exception handling and logging to allow the program to still run on Python 2.3+ systems.

Only modify mtime/uid/gid if ignoring differences

872dd82

This just clarifies that we should attempt to update to the newest inode time only when ignoring the time value in our hash calculations, otherwise the hash in file_hashes would become outdated.

Merge branch 'clarify_mtime_inode_updates'

caec28c

Code clarification as to when inode time/ownership can be updated.

This was referenced Jul 21, 2018

Discussion on future directions of "hardlinkpy" and variants #36

Open

Added filesize options, clustering check, change mode and timestamp use-case, keep newest attributes. #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

My ongoing devel branch w/ rebased branches. #33

My ongoing devel branch w/ rebased branches. #33

Uh oh!

Uh oh!

Uh oh!

My ongoing devel branch w/ rebased branches. #33

Are you sure you want to change the base?

My ongoing devel branch w/ rebased branches. #33

Uh oh!

Conversation

Uh oh!

Uh oh!