10000 Git clone fails with large repositories · Issue #334 · moosefs/moosefs · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Git clone fails with large repositories #334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jSML4ThWwBID69YC opened this issue Jan 28, 2020 · 19 comments
Closed

Git clone fails with large repositories #334

jSML4ThWwBID69YC opened this issue Jan 28, 2020 · 19 comments
Labels

Comments

@jSML4ThWwBID69YC
Copy link

Have you read through available documentation and open Github issues?

Yes

Is this a BUG report, FEATURE request, or a QUESTION? Who is the indended audience?

BUG report

System information

FreebSD 12.1
MooseFS 3.0.109 installed through ports.

Hardware / network configuration, and underlying filesystems on master, chunkservers, and clients.

1x master server
4x chunk server
2x 1GB LACP network.

All servers are dedicated hardware. The underlying file system is ZFS.

How much data is tracked by moosefs master (order of magnitude)?

None. This is testing lab, not production.

Describe the problem you observed.

Git clone fails on large repositories. Using git version 2.25.0.

Can you reproduce it? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Yes

Tests showing output with different repositories.

1: Cloning MooseFS works
git clone https://github.com/moosefs/moosefs.git
Cloning into 'moosefs'...
remote: Enumerating objects: 696, done.
remote: Counting objects: 100% (696/696), done.
remote: Compressing objects: 100% (424/424), done.
remote: Total 9288 (delta 447), reused 413 (delta 263), pack-reused 8592
Receiving objects: 100% (9288/9288), 4.95 MiB | 10.50 MiB/s, done.
Resolving deltas: 100% (7811/7811), done.
Updating files: 100% (429/429), done.

2: Cloning FreeBSD fails
git clone https://github.com/freebsd/freebsd.git
Cloning into 'freebsd'...
remote: Enumerating objects: 190, done.
remote: Counting objects: 100% (190/190), done.
remote: Compressing objects: 100% (151/151), done.
remote: Total 3834965 (delta 59), reused 75 (delta 35), pack-reused 3834775
Receiving objects: 100% (3834965/3834965), 2.35 GiB | 19.43 MiB/s, done.
fatal: premature end of pack file, 96 bytes missing
fatal: index-pack failed

3: Cloning Linux fails
git clone https://github.com/torvalds/linux.git
Cloning into 'linux'...
remote: Enumerating objects: 7131368, done.
remote: Total 7131368 (delta 0), reused 0 (delta 0), pack-reused 7131368
Receiving objects: 100% (7131368/7131368), 2.65 GiB | 21.43 MiB/s, done.
fatal: premature end of pack file, 1011 bytes missing
fatal: index-pack failed

Tests 2 and 3 fail on the 'Resolving deltas' part. A possible related bug: s3fs-fuse/s3fs-fuse#839

@borkd
Copy link
Collaborator
borkd commented Jan 28, 2020

Out of curiosity - how much free memory do you have on the client? (re)packing tends to be greedy and can fail for larger repos on memory constrained systems

@jSML4ThWwBID69YC
Copy link
Author

The system I'm testing on had 110GB free at the time. I can clone those repositories onto the normal disk. It's only on the MooseFS mount the bigger repositories fail.

@borkd
Copy link
Collaborator
borkd commented Jan 28, 2020

No problems on v4 mountpoint:

git clone https://github.com/freebsd/freebsd.git
Cloning into 'freebsd'...
remote: Enumerating objects: 404, done.
remote: Counting objects: 100% (404/404), done.
remote: Compressing objects: 100% (318/318), done.
remote: Total 3835115 (delta 149), reused 155 (delta 82), pack-reused 3834711
Receiving objects: 100% (3835115/3835115), 1.50 GiB | 7.56 MiB/s, done.
Resolving deltas: 100% (2661721/2661721), done.
Checking out files: 100% (79356/79356), done.

mfsdirinfo -h /mnt/mfs4/test/freebsd
/mnt/mfs4/test/freebsd:
 inodes:         85Ki
  directories:  7.5Ki
  files:         78Ki
 chunks:         77Ki
 length:       2.7GiB
 size:         7.6GiB
 realsize:      10GiB

Also no problems on a fresh v3.0.109 instance made for this test using default settings:

git clone https://github.com/torvalds/linux.git
Cloning into 'linux'...
remote: Enumerating objects: 116, done.
remote: Counting objects: 100% (116/116), done.
remote: Compressing objects: 100% (82/82), done.
remote: Total 7135009 (delta 71), reused 48 (delta 34), pack-reused 7134893
Receiving objects: 100% (7135009/7135009), 2.65 GiB | 15.39 MiB/s, done.
Resolving deltas: 100% (5906539/5906539), done.
Checking out files: 100% (66617/66617), done.

mfsdirinfo -h /mnt/mfs3/linux
/mnt/mfs3/linux:
 inodes:         69Ki
  directories:  4.3Ki
  files:         65Ki
 chunks:         65Ki
 length:       3.7GiB
 size:         7.7GiB
 realsize:      15GiB

Cloning started when PoC had only one CS, it was then expanded to two chunkservers (ZFS backed). The only indication of unhappiness is the reference to

syslog(LOG_NOTICE,"chunk %016"PRIX64"_%08"PRIX32": can't replicate chunk - locked to: %"PRIu32,c->chunkid,c->version,c->lockedto);
in the master's log.

loading metadata ...
loading sessions data ... ok (0.0000)
loading storage classes data ... ok (0.0000)
loading objects (files,directories,etc.) ... ok (0.0375)
loading names ... ok (0.0000)
loading deletion timestamps ... ok (0.0000)
loading quota definitions ... ok (0.0000)
loading xattr data ... ok (0.0000)
loading posix_acl data ... ok (0.0000)
loading open files data ... ok (0.0000)
loading flock_locks data ... ok (0.0000)
loading posix_locks data ... ok (0.0000)
loading chunkservers data ... ok (0.0000)
loading chunks data ... ok (0.0000)
checking filesystem consistency ... ok
connecting files and chunks ... ok
all inodes: 1
directory inodes: 1
file inodes: 0
chunks: 0
progress: current change: 0 (first:4 - last:0 - 100% - ETA:finished)
metadata file has been loaded
mfsmaster[51324]: config: using default value for option 'METADATA_SAVE_FREQ' - '1'
config: using default value for option 'METADATA_SAVE_FREQ' - '1'
mfsmaster[51324]: config: using default value for option 'BACK_LOGS' - '50'
config: using default value for option 'BACK_LOGS' - '50'
mfsmaster[51324]: config: using default value for option 'BACK_META_KEEP_PREVIOUS' - '1'
config: using default value for option 'BACK_META_KEEP_PREVIOUS' - '1'
mfsmaster[51324]: stats file has been loaded
stats file has been loaded
mfsmaster[51324]: config: using default value for option 'BACK_META_KEEP_PREVIOUS' - '1'
config: using default value for option 'BACK_META_KEEP_PREVIOUS' - '1'
mfsmaster[51324]: config: using default value for option 'MATOML_LISTEN_PORT' - '9419'
config: using default value for option 'MATOML_LISTEN_PORT' - '9419'
mfsmaster[51324]: master <-> metaloggers module: listen on 192.168.254.254:9419
master <-> metaloggers module: listen on 192.168.254.254:9419
mfsmaster[51324]: config: using default value for option 'MATOCS_TIMEOUT' - '10'
config: using default value for option 'MATOCS_TIMEOUT' - '10'
mfsmaster[51324]: config: using default value for option 'RESERVE_SPACE' - '0'
config: using default value for option 'RESERVE_SPACE' - '0'
mfsmaster[51324]: config: using default value for option 'MATOCS_LISTEN_PORT' - '9420'
config: using default value for option 'MATOCS_LISTEN_PORT' - '9420'
mfsmaster[51324]: master <-> chunkservers module: listen on 192.168.254.254:9420
master <-> chunkservers module: listen on 192.168.254.254:9420
mfsmaster[51324]: config: using default value for option 'MATOCL_LISTEN_PORT' - '9421'
config: using default value for option 'MATOCL_LISTEN_PORT' - '9421'
mfsmaster[51324]: main master server module: listen on 192.168.254.254:9421
main master server module: listen on 192.168.254.254:9421
mfsmaster daemon initialized properly
mfsmaster[51324]: csdb: found cs using ip:port and csid (192.168.254.254:9422,1)
mfsmaster[51324]: chunkserver register begin (packet version: 6) - ip: 192.168.254.254 / port: 9422, usedspace: 3591634944 (3.34 GiB), totalspace: 15316153794560 (14264.28 GiB)
mfsmaster[51324]: chunkserver register end (packet version: 6) - ip: 192.168.254.254 / port: 9422
mfsmaster[51324]: created new sessionid:1
mfsmaster[51324]: remove session: 1
mfsmaster[51324]: created new sessionid:2
mfsmaster[51324]: chunk 000000000000001C_00000001: can't replicate chunk - locked to: 1580248878
mfsmaster[51324]: no metaloggers connected !!!
mfsmaster[51324]: child finished
mfsmaster[51324]: store process has finished - store time: 0.116
mfsmaster[51324]: csdb: server not found (192.168.254.254:9442,0), add it to database
mfsmaster[51324]: chunkserver register begin (packet version: 6) - ip: 192.168.254.254 / port: 9442, usedspace: 0 (0.00 GiB), totalspace: 0 (0.00 GiB)
mfsmaster[51324]: csdb: generate new server id for (192.168.254.254:9442): 2
mfsmaster[51324]: chunkserver register end (packet version: 6) - ip: 192.168.254.254 / port: 9442

Working with big git repos can be a PITA in general. Maybe try a shallow clone first, and then fetch more data to see where it breaks for you?

It is worth noting that the above tests used client and servers running Linux, not FreeBSD.

@borkd borkd added the need more info Please let us know more about your issue! label Jan 28, 2020
@borkd
Copy link
Collaborator
borkd commented Jan 28, 2020

On my hardware, clone tests against v3 and v4 mountpoints resulted in a noticeable CPU load on the respective masters due to metadata operations needed to support 1K+ chunk creations/second but a healthy margin was always present preventing any housekeeping tasks from starving. It would be helpful if you edit the issue and include all relevant information

@jSML4ThWwBID69YC
Copy link
Author
jSML4ThWwBID69YC commented Jan 29, 2020

It appears to be an issue related to the mfscachemode.

Using mfscachemode=DIRECT works as expected, though slower.
Using mfscachemode=AUTO causes large git clones to fail with errors.

mfscachemode=DIRECT

git clone https://github.com/freebsd/freebsd.git
Cloning into 'freebsd'...
remote: Enumerating objects: 152, done.
remote: Counting objects: 100% (152/152), done.
remote: Compressing objects: 100% (126/126), done.
remote: Total 3835178 (delta 55), reused 39 (delta 24), pack-reused 3835026
Receiving objects: 100% (3835178/3835178), 1.50 GiB | 15.11 MiB/s, done.
Resolving deltas: 100% (2662345/2662345), done.
Updating files: 100% (79357/79357), done.

mfscachemode=AUTO

git clone https://github.com/freebsd/freebsd.git
Cloning into 'freebsd'...
remote: Enumerating objects: 97, done.
remote: Counting objects: 100% (97/97), done.
remote: Compressing objects: 100% (76/76), done.
remote: Total 3835187 (delta 35), reused 35 (delta 19), pack-reused 3835090
Receiving objects: 100% (3835187/3835187), 2.35 GiB | 18.07 MiB/s, done.
fatal: premature end of pack file, 2009 bytes missing
fatal: index-pack failed

I ran each test several times on both settings. The results and errors were similar each time. The mfsmount.cfg file looks like this:

mfsmaster=IP address
mfscachemode=DIRECT
mfssubfolder=/web
mfsmkdircopysgid=1
/storage/chunk

or

mfsmaster=IP address
mfscachemode=AUTO
mfssubfolder=/web
mfsmkdircopysgid=1
/storage/chunk

This was tested on FreeBSD 12.1 with the new Fuse stack. The sysctl vfs.fusefs.data_cache_mode is set to 1. I suspect the issue is related to mfscachemode=AUTO and the new FreeBSD Fuse stack. I don't think it's GIT specific, but it seems to be a good test case.

Any chance somebody running FreeBSD 12.1+ can verify these results?

@chogata
Copy link
Member
chogata commented Jan 29, 2020

There is a bug in FreeBSD 12.1 behaviour. Basically, when you append data to a file, the system opens it write only and should only write data to the end. It absolutely should not read the file (or even a part of it) first - this is not a POSIX compliant behaviour. But FreeBSD does.

Yesterday, while testing something else, I run a script on FreeBSD and I discovered this bug. In DIRECT mode, FreeBSD does not attempt to read anything while appending, in mode with cache - it does.

Now, I don't know what operations exactly git clone performs, and if it reports write errors or not, but a too short file while cloning makes me think that maybe at some point it writes something to a file, then tries to append, doesn't report write error to the user, but does report that a file is too short and that the clone operation failed.

Anyway, as of right now, we recommend going back to only using DIRECT mode with FreeBSD. I will run some more test and make a bug report to FreeBSD, but then the ball is in their court to fix this.

@borkd borkd added platform specific and removed need more info Please let us know more about your issue! labels Jan 29, 2020
@jSML4ThWwBID69YC
Copy link
Author

Thank you @chogata

Please link to the FreeBSD bug report once filed. I'll follow up on it.

@jSML4ThWwBID69YC
Copy link
Author

Hello @chogata

Has a bug report been filed with FreeBSD? If not, is there any further information you can provide to assist in replicating the issue?

@chogata
Copy link
Member
chogata commented Feb 12, 2020

It turns out FUSE allows reading from files opened with O_WRONLY flag. We will accommodate this behaviour in the next release of MooseFS 3.0.x. Until then - use the DIRECT mode.

@jSML4ThWwBID69YC
Copy link
Author

Hello @chogata

I've patched mfs_fuse.c on the client using https://github.com/moosefs/moosefs/commit/fb38c809f78a734678c90894d09a18783dcb26b4.patch.

Unfortunately, using AUTO still causes large git clones to fail. Using DIRECT works as expected. Are there any other patches I should be testing with?

@acid-maker
Copy link
Member

Strange. Yesterday I was able to clone whole "freebsd.git" on FreeBSD 12.1 using new client in AUTO mode without any issues. I'll check again.

@acid-maker
Copy link
Member

Just cloned linux.git:

# git clone https://github.com/torvalds/linux.git
Cloning into 'linux'...
remote: Enumerating objects: 7219575, done.
remote: Total 7219575 (delta 0), reused 0 (delta 0), pack-reused 7219575
Receiving objects: 100% (7219575/7219575), 2.68 GiB | 2.29 MiB/s, done.
Resolving deltas: 100% (5978191/5978191), done.
Updating files: 100% (67316/67316), done.
# uname -a
FreeBSD freebsd12.tt.lan 12.1-RELEASE FreeBSD 12.1-RELEASE r354233 GENERIC  amd64
# cat /mnt/mfs/.params | grep working_keep_cache
working_keep_cache_mode: FBSDAUTO

@jSML4ThWwBID69YC
Copy link
Author

I must have messed up the patch job. I'll wait until 3.0.110 is out and retest.

Thank you.

@jSML4ThWwBID69YC
Copy link
Author
jSML4ThWwBID69YC commented Feb 23, 2020

EDIT: This is working with 3.0.111. My last test still had a connected client using 3.0.109 which was causing the issue.

@chogata
Copy link
Member
chogata commented Feb 26, 2020

MooseFS: 3.0.111_1 from the ports tree on all servers.

For a moment there I thought I've found your problem... MooseFS from ports still uses the "old" (version 2) FUSE as a dependency. We will change it ASAP.

But I Installed fresh FreeBSD 12.1, updated it to p2, compiled 3.0.111 from ports and tried cloning linux from git and it finished successfully... With "old" FUSE, in FBSDAUTO mode.

Nevertheless, you can try recompiling and reinstalling with "new" (version 3) FUSE. Please, install manually the port /usr/ports/sysutils/fusefs-libs3 and then re-compile and re-install MooseFS client and re-mount your filesystem. Let me know if this helped with git clone.

@chogata
Copy link
Member
chogata commented Feb 26, 2020

Also, one additional question - are you cloning to an empty directory every time?

@jSML4ThWwBID69YC
Copy link
Author

Hello @chogata

The git clones are working with 3.0.111. I had a 3.0.109 client still connected that was causing the issue.

The moosefs3-client port has a hard requirement on fusefs-libs. How do I change that to allow for fusefs-libs3?

As I wrote above - we will change this requirement ASAP.

In the meantime, if you compile and install fusefs-libs3 BEFORE you start compiling moosefs client, it will link to fusefs-libs3 (it will still compile fusefs-libs as a dependency, but it will use the newer version, as the configuration script always tries that first).

@jSML4ThWwBID69YC
Copy link
Author

Thank you for the instructions. I've built fusefs-libs3 and rebuilt the moosefs3-client. Everything seems to be working as normal.

The original issue is resolved so I'll close this. Thank you.

@pkonopelko pkonopelko added the resolved Issue resolved label Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants
0