8000 [RFC] src/aiori-CEPHFS: New libcephfs backend by markhpc · Pull Request #217 · hpc/ior · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[RFC] src/aiori-CEPHFS: New libcephfs backend #217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 10, 2020

Conversation

markhpc
Copy link
Contributor
@markhpc markhpc commented Mar 10, 2020

This is a new aiori backend using libcephfs that is loosely based on the existing POSIX and RADOS backends. It also borrows the "prefix" concept from the DFS backend for an existing POSIX mount point (necessary for ior/mdtest to function properly even when using a library for direct filesystem access). A slight change to libcephfs.h is needed for IOR to properly compile (this does not appear to be necessary for C++ clients using libcpehfs however):

#include <sys/time.h>

io500 tests on a 10 node in-house test cluster with 2X replication and co-located clients appeared to function properly with similar (though much better in the case of sequential reads) scores vs using the POSIX backend with kernel based CephFS mount points. In the following results, the mdtest easy directories are being round-robin pinned prior to the test, though in the near future ceph will do ephemeral pinning across MDSes automatically with a single top-level xattr.

[RESULT] BW   phase 1            ior_easy_write               30.328 GB/s : time 630.68 seconds
[RESULT] IOPS phase 1         mdtest_easy_write              240.573 kiops : time 374.11 seconds
[RESULT] BW   phase 2            ior_hard_write                7.225 GB/s : time 525.99 seconds
[RESULT] IOPS phase 2         mdtest_hard_write               23.795 kiops : time 516.84 seconds
[RESULT] IOPS phase 3                      find              574.220 kiops : time 178.15 seconds
[RESULT] BW   phase 3             ior_easy_read               79.416 GB/s : time 240.79 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat             1057.850 kiops : time  85.08 seconds
[RESULT] BW   phase 4             ior_hard_read               24.591 GB/s : time 154.39 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat              100.794 kiops : time 122.02 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete              191.729 kiops : time 469.41 seconds
[RESULT] IOPS phase 7          mdtest_hard_read               56.874 kiops : time 216.24 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete               14.824 kiops : time 831.99 seconds
[SCORE] Bandwidth 25.5761 GB/s : IOPS 124.21 kiops : TOTAL 56.3632

2020-03-06-RedHatLibCephFS-10-30.zip

Generally, lower scores in unaligned reads/writes and build-up time for dynamic subtree partitioning in the ior and mdtest hard test cases held us back (we actually see higher scores with longer run times!). Given how scores are calculated these will be prime targets for future optimization.

Signed-off-by: Mark Nelson mnelson@redhat.com

Signed-off-by: Mark Nelson <mnelson@redhat.com>
Copy link
Collaborator
@JulianKunkel JulianKunkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great patch.
Delightful to see that you tested it with the IO500 benchmark.

.prefix = NULL,
};

static option_help options [] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great that you use the new options ^-^

"cannot total data moved");
if (tmpMin != tmpMax) {
if (rank == 0) {
WARN("inconsistent file size by different tasks");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice check, albeit it costs a little performance.
As it assumes to be a collective operation that may lead to unexpected behavior (in terms of AIORI semantics) if not all processes invoke the same function. I'm not 100% sure if we should generally allow that but also not so worried as IO500 runs...

@@ -197,6 +197,20 @@ AM_COND_IF([USE_RADOS_AIORI],[
AC_DEFINE([USE_RADOS_AIORI], [], [Build RADOS backend AIORI])
])

# CEPHFS support
AC_ARG_WITH([cephfs],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, the code might benefit from automatic detection of include/library files. It is reasonable to keep it function for now though.

@glennklockwood glennklockwood merged commit 657ff8a into hpc:master Mar 10, 2020
@markhpc markhpc deleted the wip-aiori-cephfs branch March 10, 2020 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0