8000 Hardlinks support by rezib · Pull Request #633 · hpc/mpifileutils · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Hardlinks support #633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

rezib
Copy link
Contributor
@rezib rezib commented Mar 21, 2025

Dear mpifileutils developers,

This is my proposal to add support of hardlinks in many mpifileutils commands: dwalk, dcp, dcmp, dsync and dtar.

During tree walk with details, regular files with more than one nlink are temporarily placed in a hardlinks flist. This flist is then globally ordered by names and ranked to select one reference path per inode, and flag all other paths to this inodes as hardlinks. The sorted hardlinks flist is finally merged in global flist with all other items. The paths name ordering is performed to ensure reproducibility between two similar trees, thus minimizing the differences for dcmp and dsync eventually.

Note

You may find more implementations details in respective commits messages.

The pull request introduces a cache format v5, to support encoding of files nlink and hardlinks references paths.

This pull request also includes a functional test suite that relies on Python standard unittest library. This suite is designed to be easy to execute:

  • Set two environment variables to define respectively the path to mpifileutils binaries and arguments provided to mpirun, eg:
$ export MFU_BIN=~/dev/bin
$ export MFU_MPIRUN_ARGS="--bind-to none --oversubscribe -N 4"
  • And run all the tests:
$ python3 -m unittest discover -v test
  • Or:
$ pytest  # require pytest

It is also designed to be easy to integrate in continuous integration systems. The pull request even provides a GitHub action workflow to execute this test suite on every pull requests and merges in main branch (example run).

For the record, this test suite has already helped detect and fix the following bugs:

Please let me know what you think! I can also remove the tests and GitHub actions workflow if you don't like the technical approach.

Important

Note this feature does not work properly without this fix for a bug in DTCMP: LLNL/dtcmp#20

Important

There is one limitation with dcp/dsync --dereference when symlinks point to path with more than one link. In this specific case, mpifileutils will consider the symlink as one more additional path to the same inode and create one more hardlink on this inodes in destination directory. For reference, this case is coverered by test test_dsync_symlink_dereference_target_nlinks.

Note

I would like to emphasize that this work is sponsored by @cea-hpc.

fix #417 #336

rezib added 9 commits March 21, 2025 16:37
During tree walk with details, regular files with more than one nlink
are temporarily placed in a hardlinks flist. This flist is then globally
ordered by names and ranked to select one reference path per inode, and
flag all other paths to this inodes as hardlinks. The sorted hardlinks
flist is finally merged in global flist with all other items.

The paths name ordering is performed to ensure reproducibility between
two similar trees, thus minimizing the differences for dcmp and dsync
eventually.

This commit introduces a new structure inodes_hardlink_map_t used to
temporarily associate paths to inodes in reference/hardlinks solving
logic.

The type elem_t receives 2 new members: nlink, the number of links on an
inode, and ref, the reference path to this inode. The ref is NULL except
on hardlinks.

This commit also introduces a new filetype MFU_TYPE_HARDLINK, which is
used to distinguish hardlinks to inodes from reference paths which have
MFU_TYPE_FILE type.

The packed flist element now contains the filetype, even when details
are enabled, as there is now way to determine if an element is a regular
file or a hardlink based on stat result.

New functions mfu_[un]pack_sized_str() are introduced to manage packing
and unpackaging of optional strings with maximum length.

Signed-off-by: Rémi Palancher <remi@rackslab.io>
Add support for hardlinks in dcp.

This function renames existing functions mfu_create_hardlink[s]() to
mfu_create_hardlink[s]_dest() to reflect their purpose related to
--link-dest option.

Two new functions mfu_create_hardlink[s]() are introduced to create all
hardlinks in destination directory with the appropriate link
destination. The summary at the end of copy is modified to mention
hardlinks operations.

Signed-off-by: Rémi Palancher <remi@rackslab.io>
Add support of hardlinks in dcmp. The reference paths of hardlinks in
source and destination are compared. If not equal, strmap is updated to
flag them as different.

The branchs in items comparison logic is now based on filetype recorded
in flist rather than the file mode as there is no way to distinguish
reference paths and hardlinks with just the mode, both are regular
files.

Signed-off-by: Rémi Palancher <remi@rackslab.io>
Add support of hardlinks in dsync. The reference paths of hardlinks in
source and destination are compared. If not equal, strmap is updated to
flag them as different.

The branchs in items comparison logic is now based on filetype recorded
in flist rather than the file mode as there is no way to distinguish
reference paths and hardlinks with just the mode, both are regular
files.

Additional logic is added with dsync_remove_hardlinks_with_removed_ref()
function to detect hardlinks whose references paths are marked for
deletion in destination. In this case, all the hardlinks pointing to
this reference are also marked for being replaced to avoid residual
links pointing to wrong inodes.

Signed-off-by: Rémi Palancher <remi@rackslab.io>
Add support for hardlinks in dtar, in all supported create and extract
algorithms.

New structure entry_list_t is introduced, it is used in some extract
algorithms to fill a temporary a list of hardlinks entries to create in
a second pass, after all other files are created.

Signed-off-by: Rémi Palancher <remi@rackslab.io>
Introduce cache format v5 which supports hardlinks encoding with nlink
and reference paths.

New read_cache_v5() is basically similar to read_cache_v4() except the
calls to list_elem_pack_size[_le4]() and list_insert_ptr[_le4]().

Signed-off-by: Rémi Palancher <remi@rackslab.io>
When dcp reads input list from cache, place files with more than one
links in a temporary list and resolve hardlinks, similary to the logic
implemented in walk with details.

Signed-off-by: Rémi Palancher <remi@rackslab.io>
This command adds many functional tests of dcmp, dcp, dsync, dtar and
dwalk, executed and automatically validated with Python standard
unittest library. This is designed to be easy to execute and integrate
in continuous integration systems.

Set two environment variables to define respectively the path to
mpifileutils binaries and arguments provided to mpirun, eg:

  $ export MFU_BIN=~/dev/bin
  $ export MFU_MPIRUN_ARGS="--bind-to none --oversubscribe -N 4"

And run all the tests:

  $ python3 -m unittest discover -v test

Or:

  $ pytest  # require pytest

The suite has utilities to check similarity between two trees, with the
possibility to specific paths and attributes (eg. mtime). It is also
possible to assert specific command outputs.

Most tests are run against a specific testing file tree to cover many
cases. Other tests are run with a file tree generated by dfilemaker.

Signed-off-by: Rémi Palancher <remi@rackslab.io>
Add continuous integration workflow to build
8000
 and install lwgrp,
libcircle, dtcmp and mpifileutils and execute Python test suite in
github actions for all pull requests and merges in main branch.

Signed-off-by: Rémi Palancher <remi@rackslab.io>
@carbonneau1
Copy link
Collaborator

Going through your changes.

@rezib
Copy link
Contributor Author
rezib commented Jun 13, 2025

Important

There is one limitation with dcp/dsync --dereference when symlinks point to path with more than one link. In this specific case, mpifileutils will consider the symlink as one more additional path to the same inode and create one more hardlink on this inodes in destination directory. For reference, this case is coverered by test test_dsync_symlink_dereference_target_nlinks.

FWIW, I checked the behavior of GNU cp and rsync with similar options and it produces the same results.

For reference, here is my setup-test.sh script for this test:

#!/bin/sh

# Purge previous testing data
if [ -d orig ]; then
  rm orig/*
  rmdir orig
fi

if [ -d dest ]; then
  rm dest/*
  rmdir dest
fi

# Create fresh testing data
mkdir orig
echo foo > orig/foo
ln orig/foo orig/bar
ln -s foo orig/baz

#cp -R --dereference --preserve=all orig dest
#rsync -a --copy-links --hard-links orig/ dest
  • With GNU cp:
$ sh setup-test.sh
$ cp -R --dereference --preserve=all orig dest && ls -lirn orig dest
orig:
total 8
28201162 -rw-rw-r-- 2 1000 1000 4 Jun 13 16:53 foo
28201168 lrwxrwxrwx 1 1000 1000 3 Jun 13 16:53 baz -> foo
28201162 -rw-rw-r-- 2 1000 1000 4 Jun 13 16:53 bar

dest:
total 12
28201217 -rw-rw-r-- 3 1000 1000 4 Jun 13 16:53 foo
28201217 -rw-rw-r-- 3 1000 1000 4 Jun 13 16:53 baz
28201217 -rw-rw-r-- 3 1000 1000 4 Jun 13 16:53 bar
  • With rsync:
$ sh setup-test.sh
$ rsync -a --copy-links --hard-links orig/ dest && ls -lirn orig dest
orig:
total 8
28201162 -rw-rw-r-- 2 1000 1000 4 Jun 13 16:54 foo
28201168 lrwxrwxrwx 1 1000 1000 3 Jun 13 16:54 baz -> foo
28201162 -rw-rw-r-- 2 1000 1000 4 Jun 13 16:54 bar

dest:
total 12
28201217 -rw-rw-r-- 3 1000 1000 4 Jun 13 16:54 foo
28201217 -rw-rw-r-- 3 1000 1000 4 Jun 13 16:54 baz
28201217 -rw-rw-r-- 3 1000 1000 4 Jun 13 16:54 bar

In all cases, the dereferenced symlink baz is transformed into an additional hardlink on the inode pointed by foo and bar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dsync and probably dcp doesn't handle hardlinks
2 participants
0