Analyse Ceph object directory mapping on disk
Useful to understand benchmark result and Ceph's second write penalty (this phenomena is explained here in the section I.1).
I. Use an RBD image and locate the objects
Let's start with a simple 40 MB RBD image and get some statistics about this image:
bash $ sudo rbd info volumes/2578a6ed-2bab-4f71-910d-d42f18c80d11_disk rbd image '2578a6ed-2bab-4f71-910d-d42f18c80d11_disk': size 40162 kB in 10 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.97ab74b0dc51 format: 2 features: layering
Now using my script to validate the placement of each object. Please note that all the blocks must be allocated, if not simply map the device and run dd
.
bash $ sudo ./rbd-placement volumes 2578a6ed-2bab-4f71-910d-d42f18c80d11_disk osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000000' -> pg 28.b52329a6 (28.6) -> up ([0,1], p0) acting ([0,1], p0) osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000009' -> pg 28.7ac71fc6 (28.6) -> up ([0,1], p0) acting ([0,1], p0) osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000002' -> pg 28.f9256dc8 (28.8) -> up ([1,0], p1) acting ([1,0], p1) osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000005' -> pg 28.141bf9ca (28.a) -> up ([1,0], p1) acting ([1,0], p1) osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000003' -> pg 28.58c5376b (28.b) -> up ([1,0], p1) acting ([1,0], p1) osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000008' -> pg 28.a310d3d0 (28.10) -> up ([1,0], p1) acting ([1,0], p1) osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000001' -> pg 28.88755b97 (28.17) -> up ([1,0], p1) acting ([1,0], p1) osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000004' -> pg 28.e52ce538 (28.18) -> up ([1,0], p1) acting ([1,0], p1) osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000006' -> pg 28.80a6755a (28.1a) -> up ([0,1], p0) acting ([0,1], p0) osdmap e2518 pool 'volumes' (28) object 'rbd_data.97ab74b0dc51.0000000000000007' -> pg 28.9c45d2fa (28.1a) -> up ([0,1], p0) acting ([0,1], p0)
This image is stored on OSD 0 and OSD 1. Then I just picked up the all the PGs and the rbd_prefix. We can reflect the placement in our directory hierarchy using the tree
command:
```bash $ sudo tree -Ph '97ab74b0dc51' /var/lib/ceph/osd/ceph-0/current/{28.6,28.8,28.a,28.b,28.10,28.17,28.18,28.1a}_head/ /var/lib/ceph/osd/ceph-0/current/28.6_head/ ├── [4.0M] rbd\udata.97ab74b0dc51.0000000000000000head_B52329A61c └── [3.2M] rbd\udata.97ab74b0dc51.0000000000000009head_7AC71FC61c /var/lib/ceph/osd/ceph-0/current/28.8_head/ └── [4.0M] rbd\udata.97ab74b0dc51.0000000000000002head_F9256DC81c /var/lib/ceph/osd/ceph-0/current/28.a_head/ └── [4.0M] rbd\udata.97ab74b0dc51.0000000000000005head_141BF9CA1c /var/lib/ceph/osd/ceph-0/current/28.b_head/ └── [4.0M] rbd\udata.97ab74b0dc51.0000000000000003head_58C5376B1c /var/lib/ceph/osd/ceph-0/current/28.10_head/ └── [4.0M] rbd\udata.97ab74b0dc51.0000000000000008head_A310D3D01c /var/lib/ceph/osd/ceph-0/current/28.17_head/ └── [4.0M] rbd\udata.97ab74b0dc51.0000000000000001head_88755B971c /var/lib/ceph/osd/ceph-0/current/28.18_head/ └── [4.0M] rbd\udata.97ab74b0dc51.0000000000000004head_E52CE5381c /var/lib/ceph/osd/ceph-0/current/28.1a_head/ ├── [4.0M] rbd\udata.97ab74b0dc51.0000000000000006head_80A6755A1c └── [4.0M] rbd\udata.97ab74b0dc51.0000000000000007head_9C45D2FA1c
0 directories, 10 files ```
II. Analyse your disk geometry
For the sake of simplicity I used a virtual hard drive disk attached to my virtual machine. The disk is 10GB big.
```bash root@ceph:~# fdisk -l /dev/sdb1
Disk /dev/sdb1: 10.5 GB, 10484711424 bytes 255 heads, 63 sectors/track, 1274 cylinders, total 20477952 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000
Disk /dev/sdb1 doesn't contain a valid partition table ```
So I have 20477952 sectors/blocks of 512 bytes in total, here (20477952*512)/1024/1024/1024 = ~10 GB
III. Print block mapping for each object
Now I will be assuming that the underlying filesystem of your OSDs data is XFS. Otherwise the following will not possible.
bash $ sudo for i in $(find /var/lib/ceph/osd/ceph-0/current/{28.6,28.8,28.a,28.b,28.10,28.17,28.18,28.1a}_head/*97ab74b0dc51*) ; do xfs_bmap -v $i ;done /var/lib/ceph/osd/ceph-0/current/28.6_head/rbd\udata.97ab74b0dc51.0000000000000000__head_B52329A6__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..2943]: 1992544..1995487 0 (1992544..1995487) 2944 1: [2944..8191]: 1987296..1992543 0 (1987296..1992543) 5248 /var/lib/ceph/osd/ceph-0/current/28.6_head/rbd\udata.97ab74b0dc51.0000000000000009__head_7AC71FC6__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..255]: 1987040..1987295 0 (1987040..1987295) 256 1: [256..1279]: 1986016..1987039 0 (1986016..1987039) 1024 2: [1280..6599]: 1978848..1984167 0 (1978848..1984167) 5320 /var/lib/ceph/osd/ceph-0/current/28.8_head/rbd\udata.97ab74b0dc51.0000000000000002__head_F9256DC8__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 19057336..19065527 3 (3698872..3707063) 8192 /var/lib/ceph/osd/ceph-0/current/28.a_head/rbd\udata.97ab74b0dc51.0000000000000005__head_141BF9CA__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 13909496..13917687 2 (3670520..3678711) 8192 /var/lib/ceph/osd/ceph-0/current/28.b_head/rbd\udata.97ab74b0dc51.0000000000000003__head_58C5376B__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..639]: 7303544..7304183 1 (2184056..2184695) 640 1: [640..8191]: 10090000..10097551 1 (4970512..4978063) 7552 /var/lib/ceph/osd/ceph-0/current/28.10_head/rbd\udata.97ab74b0dc51.0000000000000008__head_A310D3D0__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..639]: 12289352..12289991 2 (2050376..2051015) 640 1: [640..8191]: 13934072..13941623 2 (3695096..3702647) 7552 /var/lib/ceph/osd/ceph-0/current/28.17_head/rbd\udata.97ab74b0dc51.0000000000000001__head_88755B97__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 19049144..19057335 3 (3690680..3698871) 8192 /var/lib/ceph/osd/ceph-0/current/28.18_head/rbd\udata.97ab74b0dc51.0000000000000004__head_E52CE538__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 13901304..13909495 2 (3662328..3670519) 8192 /var/lib/ceph/osd/ceph-0/current/28.1a_head/rbd\udata.97ab74b0dc51.0000000000000006__head_80A6755A__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..6911]: 13917688..13924599 2 (3678712..3685623) 6912 1: [6912..8191]: 13932792..13934071 2 (3693816..3695095) 1280 /var/lib/ceph/osd/ceph-0/current/28.1a_head/rbd\udata.97ab74b0dc51.0000000000000007__head_9C45D2FA__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 13924600..13932791 2 (3685624..3693815) 8192
It seems that I have a bit of fragmentation on my filesystem since some files are mapping to more than one extend. Thus before going further I am going to defragment some files. Example for one file:
```bash $ sudo xfs_bmap -v /var/lib/ceph/osd/ceph-0/current/28.6_head/rbd\udata.97ab74b0dc51.0000000000000009head_7AC71FC61c /var/lib/ceph/osd/ceph-0/current/28.6_head/rbd\udata.97ab74b0dc51.0000000000000009head_7AC71FC61c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..255]: 1987040..1987295 0 (1987040..1987295) 256 1: [256..1279]: 1986016..1987039 0 (1986016..1987039) 1024 2: [1280..6599]: 1978848..1984167 0 (1978848..1984167) 5320
$ sudo xfs_fsr /var/lib/ceph/osd/ceph-0/current/28.6_head/rbd\udata.97ab74b0dc51.0000000000000009head_7AC71FC61c
$ sudo xfs_bmap -v /var/lib/ceph/osd/ceph-0/current/28.6_head/rbd\udata.97ab74b0dc51.0000000000000009head_7AC71FC61c /var/lib/ceph/osd/ceph-0/current/28.6_head/rbd\udata.97ab74b0dc51.0000000000000009head_7AC71FC61c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..6599]: 1860632..1867231 0 (1860632..1867231) 6600 ```
After the operation, we have the following repartition:
bash $ sudo for i in $(find /var/lib/ceph/osd/ceph-0/current/{28.6,28.8,28.a,28.b,28.10,28.17,28.18,28.1a}_head/*97ab74b0dc51*) ; do xfs_bmap -v $i ;done /var/lib/ceph/osd/ceph-0/current/28.6_head/rbd\udata.97ab74b0dc51.0000000000000000__head_B52329A6__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 1852440..1860631 0 (1852440..1860631) 8192 /var/lib/ceph/osd/ceph-0/current/28.6_head/rbd\udata.97ab74b0dc51.0000000000000009__head_7AC71FC6__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..6599]: 1860632..1867231 0 (1860632..1867231) 6600 /var/lib/ceph/osd/ceph-0/current/28.8_head/rbd\udata.97ab74b0dc51.0000000000000002__head_F9256DC8__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 19057336..19065527 3 (3698872..3707063) 8192 /var/lib/ceph/osd/ceph-0/current/28.a_head/rbd\udata.97ab74b0dc51.0000000000000005__head_141BF9CA__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 13909496..13917687 2 (3670520..3678711) 8192 /var/lib/ceph/osd/ceph-0/current/28.b_head/rbd\udata.97ab74b0dc51.0000000000000003__head_58C5376B__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 13932792..13940983 2 (3693816..3702007) 8192 /var/lib/ceph/osd/ceph-0/current/28.10_head/rbd\udata.97ab74b0dc51.0000000000000008__head_A310D3D0__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 14201728..14209919 2 (3962752..3970943) 8192 /var/lib/ceph/osd/ceph-0/current/28.17_head/rbd\udata.97ab74b0dc51.0000000000000001__head_88755B97__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 19049144..19057335 3 (3690680..3698871) 8192 /var/lib/ceph/osd/ceph-0/current/28.18_head/rbd\udata.97ab74b0dc51.0000000000000004__head_E52CE538__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 13901304..13909495 2 (3662328..3670519) 8192 /var/lib/ceph/osd/ceph-0/current/28.1a_head/rbd\udata.97ab74b0dc51.0000000000000006__head_80A6755A__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 14209920..14218111 2 (3970944..3979135) 8192 /var/lib/ceph/osd/ceph-0/current/28.1a_head/rbd\udata.97ab74b0dc51.0000000000000007__head_9C45D2FA__1c: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8191]: 13924600..13932791 2 (3685624..3693815) 8192
IV. Get an idea of your object mapping
As mentioned earlier, we have 20477952 blocks of 512 in total and the object have the following mapping:
- 1852440..1860631, a range of 8192 blocks of 512 bytes, (8192*512/1024/1024) = 4M
- 1860632..1867231
- 19057336..19065527
- 13909496..13917687
- 13932792..13940983
- 14201728..14209919
- 19049144..19057335
- 13901304..13909495
- 14209920..14218111
- 13924600..13932791
The average block position based on the range are:
- 1856535
- 1863135
- 13905399
- 13913591
- 13928695
- 13936887
- 14205823
- 14214015
- 19053239
- 19061431
We can now calculate the standard deviation of these positions: 6020910.93405966
The purpose of this article was to demonstre and justify the second write penalty into Ceph. The second write is being called by
syncfs
which is writing all the objects to their respective PG directories. Understanding the PG placement of the object in addition to the physical mapping of each object of the filesystem on the block device might be a great helper while debugging perfomance issue. Unfortunately this problem is hard to solve because of the client concurrency writes and the distributed nature of Ceph. Obviously what was written here remains pure theory (it's likely true though :p) given that determining the real placement of a data on a disk is difficult. One more thing abou the block placement returned by xfs, this placement gives us values but we don't know how the mapping of these ranges really looks like on the device.