8000 Potential memory leak on rbd image copy · Issue #904 · ceph/go-ceph · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
8000
Potential memory leak on rbd image copy #904
Open
@shell-skrimp

Description

@shell-skrimp

The tool I have written uses < 70MB RSS to perform exact same operations for qcow2 images (basically raw copying them). I can do this in rapid succession and will never break 100MB. However, if I do the same thing with ceph (copying image in a pool) with exact same throughput as "qcow2" based implementation after all is said and done my tool is using 1127M RSS and it's never freed. The ramp up to this amount of memory takes a bit, initially it's only using about 40MB more than it would if I wasnt talking to ceph. So I think there is some type of memory leak, just not entirely sure on where to find it or how to track it down with go-ceph/cgo.

I have used memleak-bpfcc -p <pid> on the process and then start a copy and see the following:

[20:07:34] Top 10 stacks with outstanding allocations:
        addr = 7f0a54038b00 size = 4
        // various omitted addresses from 4-1024
        addr = 7f0a44b3eb20 size = 1024
        addr = 7f0a54038500 size = 1232
        addr = 7f0a55712d90 size = 1280
        addr = 7f0a54003880 size = 1280
        addr = 7f0a557152b0 size = 1336
        addr = 7f0a54b10490 size = 1336
        addr = 7f0a54b6c570 size = 1336
        addr = 7f0a54002d90 size = 1432
        addr = 7f0a680099a0 size = 1792
        addr = 7f0a5c00fa30 size = 1792
        addr = 7f0a5403ead0 size = 1800
        addr = 7f0a556eb9b0 size = 1856
        addr = 7f0a54bcf990 size = 1856
        addr = 7f0a54b11dc0 size = 2304
        addr = 7f0a54ae2960 size = 3416
        addr = 7f0a54bda240 size = 3712
        addr = 7f0a54b67180 size = 4032
        addr = 7f0a54001d80 size = 4096
        addr = 7f0a4803f040 size = 4096
        addr = 7f0a4c01f040 size = 4096
        addr = 7f0a4cb63040 size = 4096
        addr = 7f0a496f3040 size = 4096
        addr = 7f0a4cb71a20 size = 4464
        addr = 7f0a44020810 size = 4464
        addr = 7f0a48bd2f80 size = 4464
        addr = 7f0a54b6a560 size = 8192
        addr = 7f0a54b1fd80 size = 8192
        addr = 7f0a54bd7bb0 size = 8192
        addr = 7f0a557132a0 size = 8192
        addr = 7f0a54b8ac90 size = 15320
        addr = 7f0a55690860 size = 33264
        addr = 7f0a54062ef0 size = 33264
        addr = 7f0a54b74ce0 size = 60000
        addr = 7f0a54bb8520 size = 60000
        addr = 7f0a54b25310 size = 60000
        addr = 7f0a5563f3a0 size = 65536
        addr = 7f0a54b90b60 size = 120000
        addr = 7f0a54aee750 size = 120000
        addr = 7f0a5403fde0 size = 120000
        addr = 7f0a54bdef90 size = 10880000
        6488 bytes in 38 allocations from stack
                operator new(unsigned long)+0x1c [libstdc++.so.6.0.30]
        13392 bytes in 3 allocations from stack
                operator new(unsigned long)+0x1c [libstdc++.so.6.0.30]
                ProtocolV2::send_auth_request(std::vector<unsigned int, std::allocator<unsigned int> >&)+0xe8 [libceph-common.so.2]
                [unknown]
                ProtocolV2::~ProtocolV2()+0x0 [libceph-common.so.2]
                [unknown]
        15320 bytes in 1 allocations from stack
                operator new(unsigned long)+0x1c [libstdc++.so.6.0.30]
                [unknown] [librados.so.2.0.0]
        19336 bytes in 15 allocations from stack
                posix_memalign+0x1a8 [libc.so.6]
                [unknown]
        39408 bytes in 21 allocations from stack
                operator new(unsigned long)+0x1c [libstdc++.so.6.0.30]
                [unknown]
                AsyncConnection::~AsyncConnection()+0x0 [libceph-common.so.2]
                [unknown]
        113285 bytes in 1653 allocations from stack
                operator new(unsigned long)+0x1c [libstdc++.so.6.0.30]
        180000 bytes in 3 allocations from stack
                EpollDriver::init(EventCenter*, int)+0x35 [libceph-common.so.2]
        267232 bytes in 3586 allocations from stack
                operator new(unsigned long)+0x1c [libstdc++.so.6.0.30]
                [unknown]
        467277 bytes in 681 allocations from stack
                operator new(unsigned long)+0x1c [libstdc++.so.6.0.30]
        10945592 bytes in 3 allocations from stack
                operator new(unsigned long)+0x1c [libstdc++.so.6.0.30]

There is also a less significant

[20:10:45] Top 10 stacks with outstanding allocations:
        addr = 7f0a54b20890 size = 448
        448 bytes in 1 allocations from stack
                _dl_allocate_tls+0x3a [ld-linux-x86-64.so.2]

So, what I'm trying to understand is if the allocations are reaped at some point or there is indeed a memory leak?

I have also checked through the pprof dumps and from what I can tell I am releasing memory properly (closing connections, ioctx, images, etc); uses about 512Kb on the heap after testing; so it appears that all the memory that's being used it outside the go runtime which leads me to believe its librados/go-ceph (especially since this started occurring once I added ceph support).

The code I use to do the copy:

func GetCephConn() (*rados.Conn, error) {
	conn, err := rados.NewConnWithClusterAndUser(cc.Cluster, cc.ID)
	if err != nil {
		return nil, err
	}
	if err1 := conn.ReadConfigFile(cc.Conf); err1 != nil {
		return nil, err1
	}
	if err1 := conn.SetConfigOption("client_mount_timeout", ClientMountTimeout); err1 != nil {
		return nil, err
	}
	if err1 := conn.SetConfigOption("rados_mon_op_timeout", RADOSMONOPTimeout); err1 != nil {
		return nil, err
	}
	if err1 := conn.SetConfigOption("rados_osd_op_timeout", RADOSOSDOPTimeout); err1 != nil {
		return nil, err
	}
	if err1 := conn.Connect(); err1 != nil {
		return nil, err1
	}

	return conn, nil
}


func CopyImage(pool, srcImg, dstImg string) error {
	conn, err := GetCephConn()
	if err != nil {
		return err
	}
	defer conn.Shutdown()

	srcIoCtx, err := conn.OpenIOContext(pool)
	if err != nil {
		return err
	}
	defer srcIoCtx.Destroy()

	rbdImg, err := rbd.OpenImageReadOnly(srcIoCtx, srcImg, rbd.NoSnapshot)
	if err != nil {
		return err
	}
	defer rbdImg.Close()

	return rbdImg.Copy(srcIoCtx, dstImg)
}

To show that it's not anything I'm doing in the runtime:

Showing top 10 nodes out of 28
      flat  flat%   sum%        cum   cum%
         0     0%     0%  1227.97kB 54.04%  github.com/valyala/fasthttp.init
 1227.97kB 54.04% 54.04%  1227.97kB 54.04%  github.com/valyala/fasthttp/stackless.NewFunc
         0     0% 54.04%  1227.97kB 54.04%  runtime.doInit
         0     0% 54.04%  1227.97kB 54.04%  runtime.main

If this is a memory leak, what more could I do to help get more info?

Using go 1.20 with go-ceph v0.22.0 on a local quincy test cluster.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0