8000 Fix Copy-On-Write cause memory waste by zhuangel · Pull Request #608 · google/gvisor · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fix Copy-On-Write cause memory waste #608

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

zhuangel
Copy link
Contributor

Sentry implement COW in pma, but its not on 4K page,
current COW process in pma will copy the range base
HugePage, this will cause lots memory waste.
Revise the COW process in pma base 4K page, make
sentry COW consume the same memory as kernel.

Sentry implement COW in pma, but its not on 4K page,
current COW process in pma will copy the range base
HugePage, this will cause lots memory waste.
Revise the COW process in pma base 4K page, make
sentry COW consume the same memory as kernel.
@googlebot googlebot added the cla: yes CLA has been signed label Jul 29, 2019
@zhuangel
Copy link
Contributor Author
zhuangel commented Jul 29, 2019

memory waste could be reproduce with the following case.

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <sys/mman.h>
#include <time.h>

#define TEST_SIZE (1024*4096)

int main()
{
void *ptr;
pid_t pid;
int idx,cnt;
unsigned char test=0;

ptr = mmap(NULL, TEST_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
if (ptr==MAP_FAILED) {
printf("[parent]mmap failed, (%s)\n", strerror(errno));
return 0;
}
printf("[parent]mmap address %p size 0x%x\n", ptr,TEST_SIZE);
memset(ptr,0,TEST_SIZE);
sleep(10);
printf("[parent] start to create child process");

for (idx=0;idx<50;idx++) {
pid = fork();
if (pid==0) {
printf("[child] pid %d forked\n",getpid());
sleep(2);
printf("[child] pid %d write mmap range\n",getpid());
memset(ptr+(4096*500),0,4);
printf("[child] pid %d write mmap range done\n",getpid());
sleep(30);
printf("[child] pid %d exit\n",getpid());
_exit(0);
}
}

printf("[parent] pid %d sleep start\n",getpid());
sleep(60);
munmap(ptr, TEST_SIZE);
printf("[parent] pid %d exit\n",getpid());
return 0;
}

Inside runsc container, all process will use about 123M memory, inside runc container, only used 7M memory.

execute free inside runsc container.

#docker exec -it test free
total used free shared buff/cache available
Mem: 2097152 123024 1924884 0 49244 1924884
Swap: 0 0 0

execute free inside runc container.

#docker exec -it test free
total used free shared buff/cache available
Mem: 48876436 7060 48848712 0 20664 48848712
Swap: 2097148 0 2097148

@hbhasker hbhasker requested a review from nixprime July 29, 2019 17:28
@nixprime
Copy link
Member

Breaking copy-on-write on a granularity greater than a single page is intentional. Sentry-handled page faults can be quite expensive; expanding COW-break significantly reduces their frequency in many cases. In fact, we previously switched from per-page COW-break to 2MB COW-break to fix a user-observed performance regression (from switching from whole-pma COW-break to per-page COW-break).

Can you give more details about the workload you have that is affected by this?

@zhuangel
Copy link
Contributor Author
zhuangel commented Jul 30, 2019

@nixprime I just run two of our container inside test environment (not all process has started), a runc and a runsc, after a while when the container is running stable(about 17 process is running inside the container), I got memory information from cgroup and proc about the containers.

1. Drop all cache

echo 3 > /proc/sys/vm/drop_caches

2. cgroup memory.usage_in_bytes and memory.stat for runc

###################
cgroup memory.usage_in_bytes : 104333312
###################

###################
cgroup memory.stat:
cache 23810048
rss 80523264
rss_huge 0
mapped_file 21766144
dirty 69632
writeback 0
swap 0
workingset_refault 0
workingset_activate 0
workingset_restore 0
pgpgin 1156621
pgpgout 1131149
pgfault 2000838
pgmajfault 1317
pgoutrun 0
allocstall 0
kswapd_steal 0
pg_pgsteal 0
kswapd_pgscan 0
pg_pgscan 0
pgrefill 0
ppgrefill 0
ppgscan 0
ppgsteal 0
inactive_anon 1052672
active_anon 80523264
inactive_file 8650752
active_file 14106624
unevictable 0
hierarchical_memory_limit 4294967296
hierarchical_memsw_limit 8589934592
total_cache 23810048
total_rss 80523264
total_rss_huge 0
total_mapped_file 21766144
total_dirty 69632
total_writeback 0
total_swap 0
total_workingset_refault 0
total_workingset_activate 0
total_workingset_restore 0
total_pgpgin 1156621
total_pgpgout 1131149
total_pgfault 2000838
total_pgmajfault 1317
total_pgoutrun 0
total_allocstall 0
total_kswapd_steal 0
total_pg_pgsteal 0
total_kswapd_pgscan 0
total_pg_pgscan 0
total_pgrefill 0
total_inactive_anon 1052672
total_active_anon 80523264
total_inactive_file 8650752
total_active_file 14106624
total_unevictable 0
alloc_speed_max 0
nr_alloc_throttled 0
###################

4. cgroup memory.usage_in_bytes and memory.stat for runsc

###################
cgroup memory.usage_in_bytes : 312250368
###################

###################
cgroup memory.stat:
cache 168493056
rss 143757312
rss_huge 0
mapped_file 168382464
dirty 110592
writeback 0
swap 0
workingset_refault 0
workingset_activate 0
workingset_restore 0
pgpgin 8447500
pgpgout 8371267
pgfault 8464712
pgmajfault 235
pgoutrun 0
allocstall 0
kswapd_steal 0
pg_pgsteal 0
kswapd_pgscan 0
pg_pgscan 0
pgrefill 0
ppgrefill 0
ppgscan 0
ppgsteal 0
inactive_anon 1682 8000 39104
active_anon 143757312
inactive_file 233472
active_file 20480
unevictable 0
hierarchical_memory_limit 4294967296
hierarchical_memsw_limit 8589934592
total_cache 168493056
total_rss 143757312
total_rss_huge 0
total_mapped_file 168382464
total_dirty 110592
total_writeback 0
total_swap 0
total_workingset_refault 0
total_workingset_activate 0
total_workingset_restore 0
total_pgpgin 8447500
total_pgpgout 8371267
total_pgfault 8464712
total_pgmajfault 235
total_pgoutrun 0
total_allocstall 0
total_kswapd_steal 0
total_pg_pgsteal 0
total_kswapd_pgscan 0
total_pg_pgscan 0
total_pgrefill 0
total_inactive_anon 168239104
total_active_anon 143757312
total_inactive_file 233472
total_active_file 20480
total_unevictable 0
alloc_speed_max 0
nr_alloc_throttled 0
###################

5. status of runsc sandbox and gofer process
###################
sandbox detail:
Name: exe
Umask: 0022
State: S (sleeping)
Tgid: 76342
Ngid: 0
Pid: 76342
PPid: 76313
TracerPid: 0
Uid: 65534 65534 65534 65534
Gid: 65534 65534 65534 65534
FDSize: 1024
Groups:
NStgid: 76342 1
NSpid: 76342 1
NSpgid: 76342 1
NSsid: 76342 1
VmPeak: 68727000532 kB
VmSize: 68726877908 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 503200 kB
VmRSS: 328036 kB
RssAnon: 127872 kB
RssFile: 36664 kB
RssShmem: 163500 kB
VmData: 440576 kB
VmStk: 132 kB
VmExe: 19308 kB
VmLib: 8 kB
VmPTE: 2168 kB
VmPMD: 32 kB
VmSwap: 0 kB
...
###################
###################
gofer detail:
Name: exe
Umask: 0000
State: S (sleeping)
Tgid: 76336
Ngid: 0
Pid: 76336
PPid: 76313
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 4096
Groups:
NStgid: 76336 1
NSpid: 76336 1
NSpgid: 76313 0
NSsid: 127050 0
VmPeak: 533068 kB
VmSize: 475728 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 26372 kB
VmRSS: 24804 kB
RssAnon: 11576 kB
RssFile: 13228 kB
RssShmem: 0 kB
VmData: 194612 kB
VmStk: 132 kB
VmExe: 19308 kB
VmLib: 8 kB
VmPTE: 160 kB
VmPMD: 16 kB
VmSwap: 0 kB
...
###################

@zhuangel
Copy link
Contributor Author

I also tried another way.
Run simple centos image, use free to check the initial memory status, then iterate run ten bash, again use free to check the memory status, runc container used about 3M+ memory, but runsc used 15M+ memory.

runc centos container

[root@6870c697c6d5 /]# free
total used free shared buff/cache available
Mem: 4194304 924 4171276 0 22104 4171276
Swap: 2097148 0 2097148
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# bash
[root@6870c697c6d5 /]# free
total used free shared buff/cache available
Mem: 4194304 4756 4167444 0 22104 4167444
Swap: 2097148 0 2097148

runsc centos container

[root@058437aebf07 /]# free
total used free shared buff/cache available
Mem: 4194304 1408 4139348 0 53548 4139348
Swap: 0 0 0
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# bash
[root@058437aebf07 /]# free
total used free shared buff/cache available
Mem: 4194304 16752 4124004 0 53548 4124004
Swap: 0 0 0

@amscanne
Copy link
Contributor

I think we may be stalled here. There are lots of reasons to avoid doing per-page copy-on-write (excessive performance overhead) but also good reasons to avoid wasting memory by doing large regions.

Here is my proposal:

What if each PMA tracked the total number of COW-faults, and used (1 << max(16, p.cowFaults+12)) as the amount to fault? This will turn the first 64k region into ~4 faults, and subsequent faults will do the MapUnit size.

I think this should capture the simple bash use cases (not wasting such large regions) while avoiding a big performance cost. If there's still a lot of waste, there are pretty easy tweaks to experiment with here, e.g.

(1 << max(16, min(12, p.cowFaults)) // This provides 12 single page faults before growing.
(1 << max(p.cowFaults/2 + 12)) // This doubles the size every other fault.

I think we could probably come up with some good compromises here that will avoid high overheads due to faulting but also avoid wasting memory.

@amscanne
Copy link
Contributor
amscanne commented Sep 6, 2019

Checking in on this. Is the proposal of interest? @nixprime

@nixprime
Copy link
Member
nixprime commented Oct 1, 2019

It is not clear that linking COW-break granularity to PMAs (or VMAs) would be sufficient to avoid regressing workloads that are sensitive to this (the example we saw was a particular application's startup time); it would be the responsibility of someone proposing such a change to at least prove that it does not affect any of our benchmarks.

@amscanne
Copy link
Contributor
amscanne commented Oct 2, 2019

Can we construct appropriate definitions for the benchmarks we care about here?

@prattmic prattmic added area: mm Issue related to memory management area: performance Issue related to performance & benchmarks labels Jan 22, 2020
@amscanne
Copy link
Contributor

Is this still active?

Alternative proposal #2:

We could just make COWUnitSize a parameter of the platform. Some platforms can handle faults much cheaper than others (e.g. KVM). The KVM platform could just use 4k as the COWUnitSize, and the others can use the same MapUnitSize as currently defined.

@github-actions
Copy link

This pull request is stale because it has been open 90 days with no activity. Remove the stale label or comment or this will be closed in 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: mm Issue related to memory management area: performance Issue related to performance & benchmarks cla: yes CLA has been signed stale The Issue or PR is stale.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0