results/classifier/zero-shot/105/graphic/1701449


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92

graphic: 0.849
device: 0.824
semantic: 0.788
instruction: 0.784
mistranslation: 0.772
other: 0.721
KVM: 0.710
assembly: 0.646
socket: 0.604
boot: 0.593
vnc: 0.484
network: 0.469

high memory usage when using rbd with client caching

Hi,
we are experiencing a quite high memory usage of a single qemu (used with KVM) process when using RBD with client caching as a disk backend. We are testing with 3GB memory qemu virtual machines and 128MB RBD client cache. When running 'fio' in the virtual machine you can see that after some time the machine uses a lot more memory (RSS) on the hypervisor than she should. We have seen values (in real production machines, no artificially fio tests) of 250% memory overhead. I reproduced this with qemu version 2.9 as well.

Here the contents of our ceph.conf on the hypervisor:
"""
[client]
rbd cache writethrough until flush = False
rbd cache max dirty = 100663296
rbd cache size = 134217728
rbd cache target dirty = 50331648
"""

How to reproduce:
* create a virtual machine with a RBD backed disk (100GB or so)
* install a linux distribution on it (we are using Ubuntu)
* install fio (apt-get install fio)
* run fio multiple times with (e.g.) the following test file:
"""
# This job file tries to mimic the Intel IOMeter File Server Access Pattern
[global]
description=Emulation of Intel IOmeter File Server Access Pattern
randrepeat=0
filename=/root/test.dat
# IOMeter defines the server loads as the following:
# iodepth=1     Linear
# iodepth=4     Very Light
# iodepth=8     Light
# iodepth=64    Moderate
# iodepth=256   Heavy
iodepth=8
size=80g
direct=0
ioengine=libaio

[iometer]
stonewall
bs=4M
rw=randrw

[iometer_just_write]
stonewall
bs=4M
rw=write

[iometer_just_read]
stonewall
bs=4M
rw=read
"""

You can measure the virtual machine RSS usage on the hypervisor with:
  virsh dommemstat <machine name> | grep rss
or if you are not using libvirt:
  grep RSS /proc/<PID of qemu process>/status

When switching off the RBD client cache, all is ok again, as the process does not use so much memory anymore.

There is already a ticket on the ceph bug tracker for this ([1]). However I can reproduce that memory behaviour only when using qemu (maybe it is using librbd in a special way?). Running directly 'fio' with the rbd engine does not result in that high memory usage.

[1] http://tracker.ceph.com/issues/20054

We are seeing pretty much the same issue with even small (1G mem) virtual instances using 2-3GB of RSS after running I/O intensive applications. Live migrating the instance to another machine pushes the memory usage back, but it will grow back again once I/O is back.

Any update on this?

Linking back to bug 1674481 which I think is the same issue seen in Ubuntu

Is there any progress on solving this or does anyone has an idea how to further debug this? I think we are kinda stuck in the ceph bug tracker issue as well [1].

[1] http://tracker.ceph.com/issues/20054

Any reason we are keeping this bug and #1674481 separate? We are not sure?

@Nick: if you can recreate the librbd memory growth, any chance you can help test a potential fix [1]?

[1] https://github.com/ceph/ceph/pull/24297