Fix Copy-On-Write cause memory waste #608

zhuangel · 2019-07-29T09:53:49Z

Sentry implement COW in pma, but its not on 4K page,
current COW process in pma will copy the range base
HugePage, this will cause lots memory waste.
Revise the COW process in pma base 4K page, make
sentry COW consume the same memory as kernel.

Sentry implement COW in pma, but its not on 4K page, current COW process in pma will copy the range base HugePage, this will cause lots memory waste. Revise the COW process in pma base 4K page, make sentry COW consume the same memory as kernel.

zhuangel · 2019-07-29T10:01:10Z

memory waste could be reproduce with the following case.

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <sys/mman.h>
#include <time.h>

#define TEST_SIZE (1024*4096)

int main()
{
void *ptr;
pid_t pid;
int idx,cnt;
unsigned char test=0;

ptr = mmap(NULL, TEST_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
if (ptr==MAP_FAILED) {
printf("[parent]mmap failed, (%s)\n", strerror(errno));
return 0;
}
printf("[parent]mmap address %p size 0x%x\n", ptr,TEST_SIZE);
memset(ptr,0,TEST_SIZE);
sleep(10);
printf("[parent] start to create child process");

for (idx=0;idx<50;idx++) {
pid = fork();
if (pid==0) {
printf("[child] pid %d forked\n",getpid());
sleep(2);
printf("[child] pid %d write mmap range\n",getpid());
memset(ptr+(4096*500),0,4);
printf("[child] pid %d write mmap range done\n",getpid());
sleep(30);
printf("[child] pid %d exit\n",getpid());
_exit(0);
}
}

printf("[parent] pid %d sleep start\n",getpid());
sleep(60);
munmap(ptr, TEST_SIZE);
printf("[parent] pid %d exit\n",getpid());
return 0;
}

Inside runsc container, all process will use about 123M memory, inside runc container, only used 7M memory.

execute free inside runsc container.

#docker exec -it test free
total used free shared buff/cache available
Mem: 2097152 123024 1924884 0 49244 1924884
Swap: 0 0 0

execute free inside runc container.

#docker exec -it test free
total used free shared buff/cache available
Mem: 48876436 7060 48848712 0 20664 48848712
Swap: 2097148 0 2097148

nixprime · 2019-07-29T22:23:56Z

Breaking copy-on-write on a granularity greater than a single page is intentional. Sentry-handled page faults can be quite expensive; expanding COW-break significantly reduces their frequency in many cases. In fact, we previously switched from per-page COW-break to 2MB COW-break to fix a user-observed performance regression (from switching from whole-pma COW-break to per-page COW-break).

Can you give more details about the workload you have that is affected by this?

zhuangel · 2019-07-30T04:53:52Z

@nixprime I just run two of our container inside test environment (not all process has started), a runc and a runsc, after a while when the container is running stable(about 17 process is running inside the container), I got memory information from cgroup and proc about the containers.

1. Drop all cache

echo 3 > /proc/sys/vm/drop_caches

2. cgroup memory.usage_in_bytes and memory.stat for runc

###################
cgroup memory.usage_in_bytes : 104333312
###################

###################
cgroup memory.stat:
cache 23810048
rss 80523264
rss_huge 0
mapped_file 21766144
dirty 69632
writeback 0
swap 0
workingset_refault 0
workingset_activate 0
workingset_restore 0
pgpgin 1156621
pgpgout 1131149
pgfault 2000838
pgmajfault 1317
pgoutrun 0
allocstall 0
kswapd_steal 0
pg_pgsteal 0
kswapd_pgscan 0
pg_pgscan 0
pgrefill 0
ppgrefill 0
ppgscan 0
ppgsteal 0
inactive_anon 1052672
active_anon 80523264
inactive_file 8650752
active_file 14106624
unevictable 0
hierarchical_memory_limit 4294967296
hierarchical_memsw_limit 8589934592
total_cache 23810048
total_rss 80523264
total_rss_huge 0
total_mapped_file 21766144
total_dirty 69632
total_writeback 0
total_swap 0
total_workingset_refault 0
total_workingset_activate 0
total_workingset_restore 0
total_pgpgin 1156621
total_pgpgout 1131149
total_pgfault 2000838
total_pgmajfault 1317
total_pgoutrun 0
total_allocstall 0
total_kswapd_steal 0
total_pg_pgsteal 0
total_kswapd_pgscan 0
total_pg_pgscan 0
total_pgrefill 0
total_inactive_anon 1052672
total_active_anon 80523264
total_inactive_file 8650752
total_active_file 14106624
total_unevictable 0
alloc_speed_max 0
nr_alloc_throttled 0
###################

4. cgroup memory.usage_in_bytes and memory.stat for runsc

###################
cgroup memory.usage_in_bytes : 312250368
###################

###################
cgroup memory.stat:
cache 168493056
rss 143757312
rss_huge 0
mapped_file 168382464
dirty 110592
writeback 0
swap 0
workingset_refault 0
workingset_activate 0
workingset_restore 0
pgpgin 8447500
pgpgout 8371267
pgfault 8464712
pgmajfault 235
pgoutrun 0
allocstall 0
kswapd_steal 0
pg_pgsteal 0
kswapd_pgscan 0
pg_pgscan 0
pgrefill 0
ppgrefill 0
ppgscan 0
ppgsteal 0
inactive_anon 1682 8000 39104
active_anon 143757312
inactive_file 233472
active_file 20480
unevictable 0
hierarchical_memory_limit 4294967296
hierarchical_memsw_limit 8589934592
total_cache 168493056
total_rss 143757312
total_rss_huge 0
total_mapped_file 168382464
total_dirty 110592
total_writeback 0
total_swap 0
total_workingset_refault 0
total_workingset_activate 0
total_workingset_restore 0
total_pgpgin 8447500
total_pgpgout 8371267
total_pgfault 8464712
total_pgmajfault 235
total_pgoutrun 0
total_allocstall 0
total_kswapd_steal 0
total_pg_pgsteal 0
total_kswapd_pgscan 0
total_pg_pgscan 0
total_pgrefill 0
total_inactive_anon 168239104
total_active_anon 143757312
total_inactive_file 233472
total_active_file 20480
total_unevictable 0
alloc_speed_max 0
nr_alloc_throttled 0
###################

5. status of runsc sandbox and gofer process
###################
sandbox detail:
Name: exe
Umask: 0022
State: S (sleeping)
Tgid: 76342
Ngid: 0
Pid: 76342
PPid: 76313
TracerPid: 0
Uid: 65534 65534 65534 65534
Gid: 65534 65534 65534 65534
FDSize: 1024
Groups:
NStgid: 76342 1
NSpid: 76342 1
NSpgid: 76342 1
NSsid: 76342 1
VmPeak: 68727000532 kB
VmSize: 68726877908 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 503200 kB
VmRSS: 328036 kB
RssAnon: 127872 kB
RssFile: 36664 kB
RssShmem: 163500 kB
VmData: 440576 kB
VmStk: 132 kB
VmExe: 19308 kB
VmLib: 8 kB
VmPTE: 2168 kB
VmPMD: 32 kB
VmSwap: 0 kB
...
###################
###################
gofer detail:
Name: exe
Umask: 0000
State: S (sleeping)
Tgid: 76336
Ngid: 0
Pid: 76336
PPid: 76313
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 4096
Groups:
NStgid: 76336 1
NSpid: 76336 1
NSpgid: 76313 0
NSsid: 127050 0
VmPeak: 533068 kB
VmSize: 475728 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 26372 kB
VmRSS: 24804 kB
RssAnon: 11576 kB
RssFile: 13228 kB
RssShmem: 0 kB
VmData: 194612 kB
VmStk: 132 kB
VmExe: 19308 kB
VmLib: 8 kB
VmPTE: 160 kB
VmPMD: 16 kB
VmSwap: 0 kB
...
###################

zhuangel · 2019-07-30T04:58:08Z

I also tried another way.
Run simple centos image, use free to check the initial memory status, then iterate run ten bash, again use free to check the memory status, runc container used about 3M+ memory, but runsc used 15M+ memory.

runc centos container

[root@6870c697c6d5 /]# free
total used free shared buff/cache available
Mem: 4194304 924 4171276 0 22104 4171276
Swap: 2097148 0 2097148
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# free
total used free shared buff/cache available
Mem: 4194304 4756 4167444 0 22104 4167444
Swap: 2097148 0 2097148

runsc centos container

[root@058437aebf07 /]# free
total used free shared buff/cache available
Mem: 4194304 1408 4139348 0 53548 4139348
Swap: 0 0 0
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# free
total used free shared buff/cache available
Mem: 4194304 16752 4124004 0 53548 4124004
Swap: 0 0 0

amscanne · 2019-08-20T20:42:20Z

I think we may be stalled here. There are lots of reasons to avoid doing per-page copy-on-write (excessive performance overhead) but also good reasons to avoid wasting memory by doing large regions.

Here is my proposal:

What if each PMA tracked the total number of COW-faults, and used (1 << max(16, p.cowFaults+12)) as the amount to fault? This will turn the first 64k region into ~4 faults, and subsequent faults will do the MapUnit size.

I think this should capture the simple bash use cases (not wasting such large regions) while avoiding a big performance cost. If there's still a lot of waste, there are pretty easy tweaks to experiment with here, e.g.

(1 << max(16, min(12, p.cowFaults)) // This provides 12 single page faults before growing.
(1 << max(p.cowFaults/2 + 12)) // This doubles the size every other fault.

I think we could probably come up with some good compromises here that will avoid high overheads due to faulting but also avoid wasting memory.

amscanne · 2019-09-06T02:07:20Z

Checking in on this. Is the proposal of interest? @nixprime

nixprime · 2019-10-01T22:26:19Z

It is not clear that linking COW-break granularity to PMAs (or VMAs) would be sufficient to avoid regressing workloads that are sensitive to this (the example we saw was a particular application's startup time); it would be the responsibility of someone proposing such a change to at least prove that it does not affect any of our benchmarks.

amscanne · 2019-10-02T01:50:12Z

Can we construct appropriate definitions for the benchmarks we care about here?

amscanne · 2020-03-30T16:18:19Z

Is this still active?

Alternative proposal #2:

We could just make COWUnitSize a parameter of the platform. Some platforms can handle faults much cheaper than others (e.g. KVM). The KVM platform could just use 4k as the COWUnitSize, and the others can use the same MapUnitSize as currently defined.

github-actions · 2020-06-29T00:03:00Z

This pull request is stale because it has been open 90 days with no activity. Remove the stale label or comment or this will be closed in 30 days.

Fix Copy-On-Write cause memory waste

5ac69ae

Sentry implement COW in pma, but its not on 4K page, current COW process in pma will copy the range base HugePage, this will cause lots memory waste. Revise the COW process in pma base 4K page, make sentry COW consume the same memory as kernel.

googlebot added the cla: yes CLA has been signed label Jul 29, 2019

hbhasker requested a review from nixprime July 29, 2019 17:28

ianlewis mentioned this pull request Oct 23, 2019

Copy-On-Write cause memory waste #1057

Closed

avagin assigned nixprime Jan 16, 2020

prattmic added area: mm Issue related to memory management area: performance Issue related to performance & benchmarks labels Jan 22, 2020

github-actions bot added the stale The Issue or PR is stale. label Jun 29, 2020

github-actions bot closed this Jul 30, 2020

benhurdelhey mentioned this pull request Apr 25, 2023

Eager copy-on-write with python multiprocessing #8879

Closed

qsays mentioned this pull request Nov 29, 2023

[Snyk] Fix for 16 vulnerabilities qsays/gvisor#91

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Copy-On-Write cause memory waste #608

Fix Copy-On-Write cause memory waste #608

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix Copy-On-Write cause memory waste #608

Fix Copy-On-Write cause memory waste #608

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!