Ceph – PHPor 的Blog

关于ceph osd偶尔收到SIGHUP信号的问题

8,858

11月 272018

ceph osd日志中显示，偶尔会收到来自于如下进程的信号：

killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw

1	killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw

如：

而且，通常出现在日志文件的第一条。

原因：

这个是 /etc/logrotate.d/ceph 中配置的用于切换日志文件的，没有其他影响

原文链接：https://phpor.net/blog/post/10714

libvritd 配置 rbd存储池

Ceph, Qemu, Virsh 2 Responses »

9,651

6月 262018

搭建openstack时，配置好rbd后，我们并没有在virsh pool-list 时看到一个存储池，但是如果我们要配置一个存储池也是可以的：

编写配置文件： rbd-volumes.xml

<pool type='rbd'>
<name>rbd-volumes</name>
<source>
<host name='10.88.12.4' port='6789'/>
<name>volumes</name>
<auth username='cinder' type='ceph'>
<secret uuid='9dd5c6f0-ffc2-476b-b89c-071998ad8462'/>
</auth>
</source>
</pool>

<name>rbd-volumes</name>

<name>volumes</name>

</auth>

</source>

</pool>

其中rbd-volumes是我们给这个存储池起的一个名字，随便你；

10.88.12.4 是ceph monitor节点地址

volumes 是rbd所在的ceph中的pool

auth里面有用户名cinder和预先定义好的秘钥（秘钥通过secret-define来定义）

然后执行：

virsh pool-define rbd-volumes.xml

1	virsh pool-define rbd-volumes.xml

然后就会自动生成文件 /etc/libvirt/storage/rbd-volumes.xml：

<pool type='rbd'>
  <name>rbd-volumes</name>
  <uuid>09ec5b59-509d-40b9-9c8a-e03e8de60b1d</uuid>
  <capacity unit='bytes'>0</capacity>
  <allocation unit='bytes'>0</allocation>
  <available unit='bytes'>0</available>
  <source>
    <host name='10.88.12.4' port='6789'/>
    <name>volumes</name>
    <auth type='ceph' username='cinder'>
      <secret uuid='9dd5c6f0-ffc2-476b-b89c-071998ad8462'/>
    </auth>
  </source>
</pool>

<name>rbd-volumes</name>

<name>volumes</name>

</auth>

</source>

</pool>

然后通过virsh pool-list 可以查看到定义好的存储池：

virsh pool-list

1	virsh pool-list

然后启动池子：

virsh pool-start rbd-volumes

1	virsh pool-start rbd-volumes

就可以列出来存储池中的rbd了；（我这里的volumes就是上面提到的rbd-volumes)

然后就可以使用这里的rbd来启动虚拟机了呗？不过又遇到问题，通过virt-manage来使用这里的rbd创建机器时报错：

话说这个和/root/volumes 有毛关系？

google 之，别人也有遇到： https://bugzilla.redhat.com/show_bug.cgi?id=1074169#c14

问题似乎出现在virt-manager 上，问题版本： 1.4.1 ；换个新的试试：

https://github.com/virt-manager/virt-manager

更新到 1.5.0 依然存在这个问题，稍后再研究；最新的版本构建起来麻烦一些，依赖python3的东西，我的安装源中有些找不到

原文链接：https://phpor.net/blog/post/8884

ceph health 之 status & overall_status

Ceph No Responses »

4,891

6月 132018

缘起：

为什么我执行ceph health时都是HEALTH_OK，但是搭建了Prometheus+grafana：（参考：https://www.2cto.com/net/201801/712794.html ），看到的状态却是HEALTH_WARN，why？

分析：

我们使用的ceph_exporter: github.com/digitalocean/ceph_exporter ；参考源码发现，这里使用json格式获取的health，而且参考的是overall_status ; 自己在命令行看看：

# ceph health -f json
{"checks":{},"status":"HEALTH_OK","overall_status":"HEALTH_WARN"}

1 2	# ceph health -f json {"checks":{},"status":"HEALTH_OK","overall_status":"HEALTH_WARN"}

果不其然，overall_status 为 HEALTH_WARN

办法一：

参考status，不参考overall_status;

缺点：

exporter并没有收集status信息，只收集了overall_status ，如果要使用status，还得修改exporter
HEALTH_WARN 毕竟是有问题，查明问题才是根本解决办法

办法二：

查明为什么overall_status 为HEALTH_WARN ，应该确实存在问题

查原因：

我的ceph版本： ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)

在 luminous 之前，ceph 输出的都是 overall_status ， luminous开始，就开始使用status了，但是，为了兼容以前的版本，还是输出了overall_status了，不过，为了让使用者意识到 overall_status 不建议使用了，所以，就强制将 overall_status 设置为了 HEALTH_WARN；有时候，这个逻辑显得不太友好，于是，从12.2.2 开始添加了一个选项：

mon_health_preluminous_compat_warning

1	mon_health_preluminous_compat_warning

可以通过设置该选项，来禁止这个警告。

但是，我使用的是12.2.1 ，咋办？要么修改exporter，要么干脆升级ceph

比较稳妥的做法是，在一个测试的机器上，启动一个12.2.5版本的ceph-mon，设置：

mon_health_preluminous_compat_warning=false

1	mon_health_preluminous_compat_warning=false

然后，ceph.conf中指定连接该ceph-mon，测试效果如下:

# ceph health -f json
{"checks":{},"status":"HEALTH_OK"}

1 2	# ceph health -f json {"checks":{},"status":"HEALTH_OK"}

没有了overall_status；如此的话，ceph_exporter 就是要overall_status的话，还真就得修改ceph_exporter了, fork 后修改之：

https://github.com/phpor/ceph_exporter

参考：

https://github.com/ceph/ceph/pull/17930

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/021031.html

原文链接：https://phpor.net/blog/post/8768

ceph 之 choose 和 chooseleaf

Ceph No Responses »

4,040

3月 092018

语法说明

step choose firstn $n type bucket

1	step choose firstn $n type bucket

即：选择 $n 个不同的bucket，其中， $n = 0 意味着需要多少个副本就选择多少个bucket， $n = -1 意味着还需要选择多少副本就选择多少副本，返回结果是bucket，如果bucket 是osd自然就是叶子节点

step chooseleaf firstn $n type bucket

1	step chooseleaf firstn $n type bucket

即：选择 $n 个不同的bucket，然后，再从每个bucket中任意选择一个osd，相当于：

step choose firstn $n type bucket
step choose firstn 1 type osd

1 2	step choose firstn $n type bucket step choose firstn 1 type osd

这种情况下，显然chooseleaf比choose简单许多，表达的意思都是以bucket为故障域，在里面选择$n 个osd，这也是比较常见的情况

情况二：

那么，如果我有多个row，但是期望将3份数据都存放在同一个row下，并且选择3个不同的rack存放呢？表达如下：

step choose firstn 1 type row
step chooseleaf firstn 0 type rack

1 2	step choose firstn 1 type row step chooseleaf firstn 0 type rack

等价于：

情况三：

假如我有5个row，但是仅仅希望存储的两个副本分别存放在row1和row2（我们知道，每个bucket都是有名字的），而不是随意选择两个，然后以rack为故障域来选择osd，该如何写呢？

这时候，仅仅使用choose和chooseleaf就搞不定了，还记得take语法吧，如下：

step take row1
step chooseleaf firstn 1 type rack
step emit

step take row2
step chooseleaf firstn -1 type rack
step emit

step take row1

step chooseleaf firstn 1 type rack

step emit

step take row2

step chooseleaf firstn -1 type rack

step emit

参考：

http://www.xuxiaopang.com/2016/11/08/easy-ceph-CRUSH/

原文链接：https://phpor.net/blog/post/7871

ceph 之 pg对象数量大小问题

Ceph No Responses »

5,943

3月 072018

ceph中pg中对象数量大小超过集群平均值时会有报警，问题：

为什么要设置该报警？
ceph中pg中对象数量多少算多？
如何在线修改pg数量？

原文链接：https://phpor.net/blog/post/7796

ceph 之 pg报警

Ceph No Responses »

5,028

3月 072018

现象：

分析：

哪个pool的问题？
原因：某个pool中单个pg的对象数量超过了集群每pg对象数量平均值的10倍；并不意味着肯定是有问题的
重现方法：只要你的集群中至少有一个pg是较多使用的，那么你再多创建几个pool，pg设置的大一些，这个报警就可以出现

解决办法：

删掉没用的pool，或者：

调整参数：

如下：（需要重启）

# ceph daemon osd.2 config set mon_pg_warn_max_object_skew 20
{
 "success": "mon_pg_warn_max_object_skew = '20.000000' (not observed, change may require restart) rocksdb_separate_wal_dir = 'false' (not observed, change may require restart) "
}

# ceph daemon osd.2 config set mon_pg_warn_max_object_skew 20

{

"success": "mon_pg_warn_max_object_skew = '20.000000' (not observed, change may require restart) rocksdb_separate_wal_dir = 'false' (not observed, change may require restart) "

}

调整报警的pool的pg数量

思考：

既然pg中对象太多就会报警，那么，创建pool时就指定一个较大的pg值（如： 1024）不就可以了？这样也不行，pg数量太大，理论上会影响性能，而且，单个osd上pg数量太多（超过mon_pg_warn_max_per_osd）也会报警： http://blog.csdn.net/styshoo/article/details/62722679 查看每个osd上pg数量的方法：

# ceph osd df

1

# ceph osd df

然而，我的osd上的pg数量也超过了300的设置值了，没有报警呢
具体需要调整的选项需要参考对应版本的文档或代码，如，12.2.1 的release notes 中就有这样的说明：
The maximum number of PGs per OSD before the monitor issues a warning has been reduced from 300 to 200 PGs. 200 is still twice the generally recommended target of 100 PGs per OSD. This limit can be adjusted via the mon_max_pg_per_osd option on the monitors. The older mon_pg_warn_max_per_osd option has been removed.

参考：

原文链接：https://phpor.net/blog/post/7794

ceph 块儿设备之读测试

Ceph No Responses »

4,415

3月 062018

缘起：

使用dd读取一个大文件时，速度可达 100MB+/s，但是cat读取大文件时，速度才达到 30MB/s，为何？

由于ceph块儿设备是从网络上读取数据，读取数据的效率和网络的性能由于直接关系，同时也和每次读取的块儿的大小有直接关系：

# dd if=/data2/bigfile bs=100M count=20 iflag=direct |pv &gt;/dev/null
记录了20+0 的读入 136MiB/s] [ &lt;=&gt; ]
记录了20+0 的写出
2097152000字节(2.1 GB)已复制，16.5394 秒，127 MB/秒
1.95GiB 0:00:16 [ 120MiB/s]

# dd if=/data2/bigfile bs=100M count=20 iflag=direct |pv >/dev/null

记录了20+0 的读入 136MiB/s] [ <=> ]

记录了20+0 的写出

2097152000字节(2.1 GB)已复制，16.5394 秒，127 MB/秒

1.95GiB 0:00:16 [ 120MiB/s]

在块儿大小为100MB的情况下，读取速度可达到 120MB/s

# dd if=/data2/bigfile bs=1M count=3000 iflag=direct |pv &gt;/dev/null
记录了3000+0 的读入.2MiB/s] [ &lt;=&gt; ]
记录了3000+0 的写出
3145728000字节(3.1 GB)已复制，42.9703 秒，73.2 MB/秒
2.93GiB 0:00:42 [69.8MiB/s]

# dd if=/data2/bigfile bs=1M count=3000 iflag=direct |pv >/dev/null

记录了3000+0 的读入.2MiB/s] [ <=> ]

记录了3000+0 的写出

3145728000字节(3.1 GB)已复制，42.9703 秒，73.2 MB/秒

2.93GiB 0:00:42 [69.8MiB/s]

在块儿大小为1MB的情况下，读取速度可达到 70MB/s

# dd if=/data2/bigfile bs=64K count=3000 iflag=direct |pv &gt;/dev/null
记录了3000+0 的读入.8MiB/s] [ &lt;=&gt; ]
记录了3000+0 的写出
196608000字节(197 MB)已复制，6.63725 秒，29.6 MB/秒
 187MiB 0:00:06 [28.2MiB/s]

# dd if=/data2/bigfile bs=64K count=3000 iflag=direct |pv >/dev/null

记录了3000+0 的读入.8MiB/s] [ <=> ]

记录了3000+0 的写出

196608000字节(197 MB)已复制，6.63725 秒，29.6 MB/秒

187MiB 0:00:06 [28.2MiB/s]

在块儿大小为64KB的情况下，读取速度可达到 30MB/s；，然而cat命令每次read的大小正好是64KB

针对这种情况，如果本机有较大内存的话，不妨先通过dd大块儿的方法使得文件被cache起来，然后再做其它类似cat的操作;

另外：增加IO大小，到达底层之后，会变成多个IO请求，相当于底层同时又多个IO请求，实际上是相当于增加了队列深度。

原文链接：https://phpor.net/blog/post/7784

ceph 之纠删码操作

Ceph No Responses »

7,589

3月 062018

通过命令行创建纠删码规则

首先，需要创建 erasure-code-profile ，当然，也可以使用默认的 erasure-code-profile ，列出现有的 erasure-code-profile ：

# ceph osd erasure-code-profile ls default

1
2

# ceph osd erasure-code-profile ls
default
查看指定erasure-code-profile 的详细内容：

# ceph osd erasure-code-profile get default k=2 m=1 plugin=jerasure technique=reed_sol_van

1
2
3
4
5

# ceph osd erasure-code-profile get default
k=2
m=1
plugin=jerasure
technique=reed_sol_van
自定义erasure-code-profile ，创建一个只用hdd的 erasure-code-profile：

# ceph osd erasure-code-profile set hdd-3-2 k=3 m=2 crush-device-class=hdd

1

# ceph osd erasure-code-profile set hdd-3-2 k=3 m=2 crush-device-class=hdd

可用的选项有：
- crush-root: the name of the CRUSH node to place data under [default: default].
- crush-failure-domain（故障域）: the CRUSH type to separate erasure-coded shards across [default: host].
- crush-device-class（设备分类）: the device class to place data on [default: none, meaning all devices are used].
- k and m (and, for the lrc plugin, l): these determine the number of erasure code shards, affecting the resulting CRUSH rule.
根据erasure-code-profile 创建crush rule：

# ceph osd crush rule create-erasure erasure_hdd hdd-3-2 created rule erasure_hdd at 5

1
2

# ceph osd crush rule create-erasure erasure_hdd hdd-3-2
created rule erasure_hdd at 5

查看crush rule：

# ceph osd crush rule dump erasure_hdd
{
 "rule_id": 5,
 "rule_name": "erasure_hdd",
 "ruleset": 5,
 "type": 3,
 "min_size": 3,
 "max_size": 5,
 "steps": [
 {
 "op": "set_chooseleaf_tries",
 "num": 5
 },
 {
 "op": "set_choose_tries",
 "num": 100
 },
 {
 "op": "take",
 "item": -2,
 "item_name": "default~hdd"
 },
 {
 "op": "chooseleaf_indep",
 "num": 0,
 "type": "host"
 },
 {
 "op": "emit"
 }
 ]
}

# ceph osd crush rule dump erasure_hdd

{

"rule_id": 5,

"rule_name": "erasure_hdd",

"ruleset": 5,

"type": 3,

"min_size": 3,

"max_size": 5,

"steps": [

{

"op": "set_chooseleaf_tries",

"num": 5

{

"op": "set_choose_tries",

"num": 100

{

"op": "take",

"item": -2,

"item_name": "default~hdd"

{

"op": "chooseleaf_indep",

"num": 0,

"type": "host"

{

"op": "emit"

}

]

}

创建一个使用纠删码规则的pool

# ceph osd pool create test-bigdata 256 256 erasure hdd-3-2 erasure_hdd pool 'test-bigdata' created

1
2

# ceph osd pool create test-bigdata 256 256 erasure hdd-3-2 erasure_hdd
pool 'test-bigdata' created

语法： osd pool create <poolname> <int[0-]> {<int[0-]>} {replicated|erasure} [<erasure_code_profile>] {<rule>} {<int>}
尽管crush rule 也是根据erasure_code_profile来创建的，但是这里创建纠删码pool的时候，还是需要明确指定erasure_code_profile的
参考： http://docs.ceph.com/docs/master/rados/operations/pools/
调优：

# ceph osd pool set test-bigdata fast_read 1 set pool 24 fast_read to 1

1
2

# ceph osd pool set test-bigdata fast_read 1
set pool 24 fast_read to 1

目前，这个fast_read 之针对纠删码池有效的

如果需要在该pool创建rbd，则需要：

参考： http://docs.ceph.com/docs/master/rados/operations/erasure-code/

# ceph osd pool set test-bigdata allow_ec_overwrites true set pool 24 allow_ec_overwrites to true

1
2

# ceph osd pool set test-bigdata allow_ec_overwrites true
set pool 24 allow_ec_overwrites to true

创建一个replication pool来做cache tier

# ceph osd pool create test-bigdata-cache-tier 128
pool 'test-bigdata-cache-tier' created

# ceph osd tier add test-bigdata test-bigdata-cache-tier
pool 'test-bigdata-cache-tier' is now (or already was) a tier of 'test-bigdata'

# ceph osd tier cache-modetest-bigdata-cache-tier writeback
set cache-mode for pool 'test-bigdata-cache-tier' to writeback

# ceph osd tier set-overlay test-bigdata test-bigdata-cache-tier
overlay for 'test-bigdata' is now (or already was) 'test-bigdata-cache-tier'

# ceph osd pool create test-bigdata-cache-tier 128

pool 'test-bigdata-cache-tier' created

# ceph osd tier add test-bigdata test-bigdata-cache-tier

pool 'test-bigdata-cache-tier' is now (or already was) a tier of 'test-bigdata'

# ceph osd tier cache-modetest-bigdata-cache-tier writeback

set cache-mode for pool 'test-bigdata-cache-tier' to writeback

# ceph osd tier set-overlay test-bigdata test-bigdata-cache-tier

overlay for 'test-bigdata' is now (or already was) 'test-bigdata-cache-tier'

其实，不仅纠删码池可以做cache tier，replication 池子也能做cache tier，因为，我们可能有一批ssd盘，我们就可以在ssd上创建pool来充当sas盘的cache tier以提高性能；结合纠删码、replication、sas、ssd，我们可以做出多种不同性能的存储用以应对不同的场景。

然后 ceph 会提示： 1 cache pools are missing hit_sets ，还要设置 hit_set_count 和 hit_set_type

# ceph osd pool set test-bigdata-cache-tier hit_set_count 1
set pool 29 hit_set_count to 1

# ceph osd pool set test-bigdata-cache-tier hit_set_type bloom
set pool 29 hit_set_type to bloom

# ceph osd pool set test-bigdata-cache-tier hit_set_count 1

set pool 29 hit_set_count to 1

# ceph osd pool set test-bigdata-cache-tier hit_set_type bloom

set pool 29 hit_set_type to bloom

通过编辑crushmap来添加规则

参考：https://phpor.net/blog/post/7080

参考：

实战中的问题：

12个SAS在 60MB/s 的速度evict的时候，磁盘都很慢了，每个盘达到100左右的tps， 20MB/s左右的读写；比较坑的是，我基本没法控制evict的速度，只好静静地等待evict结束
evict 的同时还在promote， promote的速度倒是可控,但是 osd_tier_promote_max_bytes_sec 默认是5242880 字节（并不算很大）；问题：池子已经没有写入了，为何还在evict和promote？
修改cache-mode试试：按说，修改cache-mode为proxy时，就不应该再出现evict和promote了

果然，修改之后，ceph -s 立刻就看不到evict和promote了 🙂
查看cache-mode:

ceph osd pool ls detail|grep cache_mode

1

ceph osd pool ls detail|grep cache_mode

原文链接：https://phpor.net/blog/post/7778

kvm ceph rbd

Ceph, kvm No Responses »

7,226

2月 072018

每个kvm虚拟机进程，如果挂载N个rbd设备，则会有N个 fn-radosclient 线程，每个fn-radosclient 线程针对特定的osd只有一个connection；如此，一个rbd设备上的某块儿数据如果落在了相同的osd上，（猜测rados协议不会再同一个连接上同时做多个事务），则意味着不可能并行写入，于是乎，特别是对于随机读写，网络延迟对存储效率有着直接的影响，即使虚拟机内部多线程也无济于事，因为kvm进程中的fn-radosclient 只有一个

# top -p 2795 -b -n 1 -H|grep rados
 2816 qemu      20   0 5259308 3.403g   7356 S  0.0  3.6  15:01.80 fn-radosclient

1 2	# top -p 2795 -b -n 1 -H\|grep rados 2816 qemu 20 0 5259308 3.403g 7356 S 0.0 3.6 15:01.80 fn-radosclient

原文链接：https://phpor.net/blog/post/7620

文件系统之 inode篇

Ceph, Linux & Unix No Responses »

5,096

1月 052018

缘起：

20T的1亿个小文件存放在xfs的文件系统中会存在inode被用光（但是存储空间还有很大空闲）的问题吗？

测试：

df -i 可以看到可用、已用inode数量，一般来讲，mkfs的时候，会划分 x% 的空间存放inode的，可用inode数量是按照文件个数计算的，不是按照占用空间计算的，如：

500GB的磁盘，格式化为xfs后，可以使用的inode数量约 2.6亿；那么1GB的磁盘格式化为xfs后，可用inode数量为 2.6亿/500 ~= 50万吗？测试如下：

确实，1GB默认可以存放约52万个文件，注意：目录也是占用inode的，而且也不可能把所有文件都放在一个目录的，所以真正计算inode的话，还需要把目录的数量算上；按照每个目录100个文件计算的话，50万个文件就需要5k个目录（可以忽略不计了）；其实不全对，5k个目录放在一个父目录下也不科学，为了保证每个目录最多100个的话，还要分到50个父目录里面，额.. 也没有多少目录

按照上面的公式计算： 20T/1G*50w ~= 1000亿个文件，不少了

当然，如果还不够的话，格式化的时候可以指定更大的inode数量

计算：

10亿个文件，打散到N个目录中，每个目录的子目录（文件）数量不超过100个，需要多少级子目录？

100^x > 10亿，x最少值为 x>=5 ，就是说 5层目录就够了

其实，xfs是个比较只能的文件系统，没有固定大小的inode区域，随着磁盘的使用，inode的总数量也在变化，基本不会出现inode已用光，但是存储空间很空闲很多的情况

原文链接：https://phpor.net/blog/post/7379

Older Entries