PHPor 的Blog

ceph 之 choose 和 chooseleaf

语法说明

step choose firstn $n type bucket

1	step choose firstn $n type bucket

即：选择 $n 个不同的bucket，其中， $n = 0 意味着需要多少个副本就选择多少个bucket， $n = -1 意味着还需要选择多少副本就选择多少副本，返回结果是bucket，如果bucket 是osd自然就是叶子节点

step chooseleaf firstn $n type bucket

1	step chooseleaf firstn $n type bucket

即：选择 $n 个不同的bucket，然后，再从每个bucket中任意选择一个osd，相当于：

step choose firstn $n type bucket
step choose firstn 1 type osd

1 2	step choose firstn $n type bucket step choose firstn 1 type osd

这种情况下，显然chooseleaf比choose简单许多，表达的意思都是以bucket为故障域，在里面选择$n 个osd，这也是比较常见的情况

情况二：

那么，如果我有多个row，但是期望将3份数据都存放在同一个row下，并且选择3个不同的rack存放呢？表达如下：

step choose firstn 1 type row
step chooseleaf firstn 0 type rack

1 2	step choose firstn 1 type row step chooseleaf firstn 0 type rack

等价于：

情况三：

假如我有5个row，但是仅仅希望存储的两个副本分别存放在row1和row2（我们知道，每个bucket都是有名字的），而不是随意选择两个，然后以rack为故障域来选择osd，该如何写呢？

这时候，仅仅使用choose和chooseleaf就搞不定了，还记得take语法吧，如下：

step take row1
step chooseleaf firstn 1 type rack
step emit

step take row2
step chooseleaf firstn -1 type rack
step emit

step take row1

step chooseleaf firstn 1 type rack

step emit

step take row2

step chooseleaf firstn -1 type rack

step emit

参考：

http://www.xuxiaopang.com/2016/11/08/easy-ceph-CRUSH/

ceph 之 pg对象数量大小问题

ceph中pg中对象数量大小超过集群平均值时会有报警，问题：

为什么要设置该报警？
ceph中pg中对象数量多少算多？
如何在线修改pg数量？

ceph 之 pg报警

现象：

分析：

哪个pool的问题？
原因：某个pool中单个pg的对象数量超过了集群每pg对象数量平均值的10倍；并不意味着肯定是有问题的
重现方法：只要你的集群中至少有一个pg是较多使用的，那么你再多创建几个pool，pg设置的大一些，这个报警就可以出现

解决办法：

删掉没用的pool，或者：

调整参数：

如下：（需要重启）

# ceph daemon osd.2 config set mon_pg_warn_max_object_skew 20
{
 "success": "mon_pg_warn_max_object_skew = '20.000000' (not observed, change may require restart) rocksdb_separate_wal_dir = 'false' (not observed, change may require restart) "
}

# ceph daemon osd.2 config set mon_pg_warn_max_object_skew 20

{

"success": "mon_pg_warn_max_object_skew = '20.000000' (not observed, change may require restart) rocksdb_separate_wal_dir = 'false' (not observed, change may require restart) "

}

调整报警的pool的pg数量

思考：

既然pg中对象太多就会报警，那么，创建pool时就指定一个较大的pg值（如： 1024）不就可以了？这样也不行，pg数量太大，理论上会影响性能，而且，单个osd上pg数量太多（超过mon_pg_warn_max_per_osd）也会报警： http://blog.csdn.net/styshoo/article/details/62722679 查看每个osd上pg数量的方法：

# ceph osd df

1

# ceph osd df

然而，我的osd上的pg数量也超过了300的设置值了，没有报警呢
具体需要调整的选项需要参考对应版本的文档或代码，如，12.2.1 的release notes 中就有这样的说明：
The maximum number of PGs per OSD before the monitor issues a warning has been reduced from 300 to 200 PGs. 200 is still twice the generally recommended target of 100 PGs per OSD. This limit can be adjusted via the mon_max_pg_per_osd option on the monitors. The older mon_pg_warn_max_per_osd option has been removed.

参考：

缘起：

使用dd读取一个大文件时，速度可达 100MB+/s，但是cat读取大文件时，速度才达到 30MB/s，为何？

由于ceph块儿设备是从网络上读取数据，读取数据的效率和网络的性能由于直接关系，同时也和每次读取的块儿的大小有直接关系：

# dd if=/data2/bigfile bs=100M count=20 iflag=direct |pv &gt;/dev/null
记录了20+0 的读入 136MiB/s] [ &lt;=&gt; ]
记录了20+0 的写出
2097152000字节(2.1 GB)已复制，16.5394 秒，127 MB/秒
1.95GiB 0:00:16 [ 120MiB/s]

# dd if=/data2/bigfile bs=100M count=20 iflag=direct |pv >/dev/null

记录了20+0 的读入 136MiB/s] [ <=> ]

记录了20+0 的写出

2097152000字节(2.1 GB)已复制，16.5394 秒，127 MB/秒

1.95GiB 0:00:16 [ 120MiB/s]

在块儿大小为100MB的情况下，读取速度可达到 120MB/s

# dd if=/data2/bigfile bs=1M count=3000 iflag=direct |pv &gt;/dev/null
记录了3000+0 的读入.2MiB/s] [ &lt;=&gt; ]
记录了3000+0 的写出
3145728000字节(3.1 GB)已复制，42.9703 秒，73.2 MB/秒
2.93GiB 0:00:42 [69.8MiB/s]

# dd if=/data2/bigfile bs=1M count=3000 iflag=direct |pv >/dev/null

记录了3000+0 的读入.2MiB/s] [ <=> ]

记录了3000+0 的写出

3145728000字节(3.1 GB)已复制，42.9703 秒，73.2 MB/秒

2.93GiB 0:00:42 [69.8MiB/s]

在块儿大小为1MB的情况下，读取速度可达到 70MB/s

# dd if=/data2/bigfile bs=64K count=3000 iflag=direct |pv &gt;/dev/null
记录了3000+0 的读入.8MiB/s] [ &lt;=&gt; ]
记录了3000+0 的写出
196608000字节(197 MB)已复制，6.63725 秒，29.6 MB/秒
 187MiB 0:00:06 [28.2MiB/s]

# dd if=/data2/bigfile bs=64K count=3000 iflag=direct |pv >/dev/null

记录了3000+0 的读入.8MiB/s] [ <=> ]

记录了3000+0 的写出

196608000字节(197 MB)已复制，6.63725 秒，29.6 MB/秒

187MiB 0:00:06 [28.2MiB/s]

在块儿大小为64KB的情况下，读取速度可达到 30MB/s；，然而cat命令每次read的大小正好是64KB

针对这种情况，如果本机有较大内存的话，不妨先通过dd大块儿的方法使得文件被cache起来，然后再做其它类似cat的操作;

另外：增加IO大小，到达底层之后，会变成多个IO请求，相当于底层同时又多个IO请求，实际上是相当于增加了队列深度。

ceph 之纠删码操作

通过命令行创建纠删码规则

首先，需要创建 erasure-code-profile ，当然，也可以使用默认的 erasure-code-profile ，列出现有的 erasure-code-profile ：

# ceph osd erasure-code-profile ls default

1
2

# ceph osd erasure-code-profile ls
default
查看指定erasure-code-profile 的详细内容：

# ceph osd erasure-code-profile get default k=2 m=1 plugin=jerasure technique=reed_sol_van

1
2
3
4
5

# ceph osd erasure-code-profile get default
k=2
m=1
plugin=jerasure
technique=reed_sol_van
自定义erasure-code-profile ，创建一个只用hdd的 erasure-code-profile：

# ceph osd erasure-code-profile set hdd-3-2 k=3 m=2 crush-device-class=hdd

1

# ceph osd erasure-code-profile set hdd-3-2 k=3 m=2 crush-device-class=hdd

可用的选项有：
- crush-root: the name of the CRUSH node to place data under [default: default].
- crush-failure-domain（故障域）: the CRUSH type to separate erasure-coded shards across [default: host].
- crush-device-class（设备分类）: the device class to place data on [default: none, meaning all devices are used].
- k and m (and, for the lrc plugin, l): these determine the number of erasure code shards, affecting the resulting CRUSH rule.
根据erasure-code-profile 创建crush rule：

# ceph osd crush rule create-erasure erasure_hdd hdd-3-2 created rule erasure_hdd at 5

1
2

# ceph osd crush rule create-erasure erasure_hdd hdd-3-2
created rule erasure_hdd at 5

查看crush rule：

# ceph osd crush rule dump erasure_hdd
{
 "rule_id": 5,
 "rule_name": "erasure_hdd",
 "ruleset": 5,
 "type": 3,
 "min_size": 3,
 "max_size": 5,
 "steps": [
 {
 "op": "set_chooseleaf_tries",
 "num": 5
 },
 {
 "op": "set_choose_tries",
 "num": 100
 },
 {
 "op": "take",
 "item": -2,
 "item_name": "default~hdd"
 },
 {
 "op": "chooseleaf_indep",
 "num": 0,
 "type": "host"
 },
 {
 "op": "emit"
 }
 ]
}

# ceph osd crush rule dump erasure_hdd

{

"rule_id": 5,

"rule_name": "erasure_hdd",

"ruleset": 5,

"type": 3,

"min_size": 3,

"max_size": 5,

"steps": [

{

"op": "set_chooseleaf_tries",

"num": 5

{

"op": "set_choose_tries",

"num": 100

{

"op": "take",

"item": -2,

"item_name": "default~hdd"

{

"op": "chooseleaf_indep",

"num": 0,

"type": "host"

{

"op": "emit"

}

]

}

创建一个使用纠删码规则的pool

# ceph osd pool create test-bigdata 256 256 erasure hdd-3-2 erasure_hdd pool 'test-bigdata' created

1
2

# ceph osd pool create test-bigdata 256 256 erasure hdd-3-2 erasure_hdd
pool 'test-bigdata' created

语法： osd pool create <poolname> <int[0-]> {<int[0-]>} {replicated|erasure} [<erasure_code_profile>] {<rule>} {<int>}
尽管crush rule 也是根据erasure_code_profile来创建的，但是这里创建纠删码pool的时候，还是需要明确指定erasure_code_profile的
参考： http://docs.ceph.com/docs/master/rados/operations/pools/
调优：

# ceph osd pool set test-bigdata fast_read 1 set pool 24 fast_read to 1

1
2

# ceph osd pool set test-bigdata fast_read 1
set pool 24 fast_read to 1

目前，这个fast_read 之针对纠删码池有效的

如果需要在该pool创建rbd，则需要：

参考： http://docs.ceph.com/docs/master/rados/operations/erasure-code/

# ceph osd pool set test-bigdata allow_ec_overwrites true set pool 24 allow_ec_overwrites to true

1
2

# ceph osd pool set test-bigdata allow_ec_overwrites true
set pool 24 allow_ec_overwrites to true

创建一个replication pool来做cache tier

# ceph osd pool create test-bigdata-cache-tier 128
pool 'test-bigdata-cache-tier' created

# ceph osd tier add test-bigdata test-bigdata-cache-tier
pool 'test-bigdata-cache-tier' is now (or already was) a tier of 'test-bigdata'

# ceph osd tier cache-modetest-bigdata-cache-tier writeback
set cache-mode for pool 'test-bigdata-cache-tier' to writeback

# ceph osd tier set-overlay test-bigdata test-bigdata-cache-tier
overlay for 'test-bigdata' is now (or already was) 'test-bigdata-cache-tier'

# ceph osd pool create test-bigdata-cache-tier 128

pool 'test-bigdata-cache-tier' created

# ceph osd tier add test-bigdata test-bigdata-cache-tier

pool 'test-bigdata-cache-tier' is now (or already was) a tier of 'test-bigdata'

# ceph osd tier cache-modetest-bigdata-cache-tier writeback

set cache-mode for pool 'test-bigdata-cache-tier' to writeback

# ceph osd tier set-overlay test-bigdata test-bigdata-cache-tier

overlay for 'test-bigdata' is now (or already was) 'test-bigdata-cache-tier'

其实，不仅纠删码池可以做cache tier，replication 池子也能做cache tier，因为，我们可能有一批ssd盘，我们就可以在ssd上创建pool来充当sas盘的cache tier以提高性能；结合纠删码、replication、sas、ssd，我们可以做出多种不同性能的存储用以应对不同的场景。

然后 ceph 会提示： 1 cache pools are missing hit_sets ，还要设置 hit_set_count 和 hit_set_type

# ceph osd pool set test-bigdata-cache-tier hit_set_count 1
set pool 29 hit_set_count to 1

# ceph osd pool set test-bigdata-cache-tier hit_set_type bloom
set pool 29 hit_set_type to bloom

# ceph osd pool set test-bigdata-cache-tier hit_set_count 1

set pool 29 hit_set_count to 1

# ceph osd pool set test-bigdata-cache-tier hit_set_type bloom

set pool 29 hit_set_type to bloom

通过编辑crushmap来添加规则

参考：https://phpor.net/blog/post/7080

参考：

实战中的问题：

12个SAS在 60MB/s 的速度evict的时候，磁盘都很慢了，每个盘达到100左右的tps， 20MB/s左右的读写；比较坑的是，我基本没法控制evict的速度，只好静静地等待evict结束
evict 的同时还在promote， promote的速度倒是可控,但是 osd_tier_promote_max_bytes_sec 默认是5242880 字节（并不算很大）；问题：池子已经没有写入了，为何还在evict和promote？
修改cache-mode试试：按说，修改cache-mode为proxy时，就不应该再出现evict和promote了

果然，修改之后，ceph -s 立刻就看不到evict和promote了 🙂
查看cache-mode:

ceph osd pool ls detail|grep cache_mode

1

ceph osd pool ls detail|grep cache_mode

vsftp in docker

当vsftpd在容器里面，而且容器IP又是host内部的私有IP的情况，client采用passive模式来下载数据能行得通吗？可以的

passive模式下，vsftpd需要listen临时端口来传输数据，所以docker创建容器时，不仅要暴露21端口，还要暴露可能listen的临时端口，为了不映射太多端口，可以在vsftpd的配置文件中配置可能的临时端口的范围
passive模式下，vsftpd需要通过协议内容告知client临时端口及IP地址，然而，容器网卡IP地址显然是不能被client直接访问到的，庆幸的是，vsftpd配置文件中有关于可以告知client的ip地址的配置，该ip地址并不要求本机上必须有的，显然该配置是给类似情况准备的

每个kvm虚拟机进程，如果挂载N个rbd设备，则会有N个 fn-radosclient 线程，每个fn-radosclient 线程针对特定的osd只有一个connection；如此，一个rbd设备上的某块儿数据如果落在了相同的osd上，（猜测rados协议不会再同一个连接上同时做多个事务），则意味着不可能并行写入，于是乎，特别是对于随机读写，网络延迟对存储效率有着直接的影响，即使虚拟机内部多线程也无济于事，因为kvm进程中的fn-radosclient 只有一个

# top -p 2795 -b -n 1 -H|grep rados
 2816 qemu      20   0 5259308 3.403g   7356 S  0.0  3.6  15:01.80 fn-radosclient

1 2	# top -p 2795 -b -n 1 -H\|grep rados 2816 qemu 20 0 5259308 3.403g 7356 S 0.0 3.6 15:01.80 fn-radosclient

bash 之 <($cmd)

<($cmd) 可以模拟类似协程的效果，如下：

# cat <(while :; do echo 1; sleep 1; done )
1
1
1
1
1
1
^C

# cat <(while :; do echo 1; sleep 1; done )

当然，效果上虽然和走管道类似，但是如果程序不支持读标准输入的话，这也不失为一种不错的替代; 当然，这个也可以用于标准输入的重定向：

# cat < <(while :; do echo 1; sleep 1; done )

1	# cat < <(while :; do echo 1; sleep 1; done )

这两种写法只差一个 < ，后者是shell直接帮做了标准输入的重定向，前者没有做标准输入的重定向，只是向打开普通文件一样的方式打开了bash帮忙生成的一个临时文件。

这种语法叫做Process Substitution

OSS 文件分析

文件大小分布
目录最后修改时间（确定是否长期未使用）
目录总容量

脚本：

#!/bin/bash
# 分析目标下的所有子目录

target=$1

time_start=$(date +%s)
entry=$(ossutil64 ls -d $target | grep "^oss://")
printf "%-32s%16s%16s%16s%16s%16s%10s%9s%20s\n" "DIR" "<100KB" "<300KB" "<1MB" "<5MB" ">5MB" "All" "Capcity" "lastmodify"
for e in $entry; do
	ossutil64 ls $e |awk -v e=$e -f fenxi.awk
done

time_end=$(date +%s)
time_use=$((time_end - time_start))
echo
echo elapsed time:  $(( time_use / 3600 )) hours $(( time_use % 3600 / 60)) mins $(( time_use % 60)) s

#!/bin/bash

# 分析目标下的所有子目录

target=$1

time_start=$(date +%s)

entry=$(ossutil64 ls -d $target | grep "^oss://")

printf "%-32s%16s%16s%16s%16s%16s%10s%9s%20s\n" "DIR" "<100KB" "<300KB" "<1MB" "<5MB" ">5MB" "All" "Capcity" "lastmodify"

for e in $entry; do

ossutil64 ls $e |awk -v e=$e -f fenxi.awk

done

time_end=$(date +%s)

time_use=$((time_end - time_start))

echo

echo elapsed time: $(( time_use / 3600 )) hours $(( time_use % 3600 / 60)) mins $(( time_use % 60)) s

function format_capcity(num) {
    KB=1024;
    MB=KB*1024;
    GB=MB*1024;
    TB=GB*1024;
    if(num > TB) return sprintf("%6.2fTB", num/TB);
    if(num > GB) return sprintf("%6.2fGB", num/GB);
    if(num > MB) return sprintf("%6.2fMB", num/MB);
    if(num > KB) return sprintf("%6.2fKB", num/KB);
    return sprintf("%6d", num);
}
 
BEGIN{
    k100=0;k300=0;m1=0;m5=0;marge=0;all=0;lastmodify=0;capcity_all=0;
}
 
NR>1 && $8 ~ /^oss:/ {
    all++;
    capcity_all+=$5;
    if($5 < 102400) k100++;
    else if($5 < 1024*300) k300++;
    else if($5 < 1024*1000) {m1++;}
    else if ($5 < 1024 * 1024 * 5) {m5++;}
    else large++;
    if (lastmodify < $1"T"$2) lastmodify=$1"T"$2;
}
 
END{
    gsub("oss://[^/]+", "", e)
    printf("%-32s%9s(%4.1f%%)%9s(%4.1f%%)%9s(%4.1f%%)%9s(%4.1f%%)%9s(%4.1f%%)%10s%9s%20s\n", e, k100, k100*100/all,k300, k300*100/all,m1, m1*100/all,m5, m5*100/all,large, large*100/all, all, format_capcity(capcity_all), lastmodify);
}

function format_capcity(num) {

KB=1024;

MB=KB*1024;

GB=MB*1024;

TB=GB*1024;

if(num > TB) return sprintf("%6.2fTB", num/TB);

if(num > GB) return sprintf("%6.2fGB", num/GB);

if(num > MB) return sprintf("%6.2fMB", num/MB);

if(num > KB) return sprintf("%6.2fKB", num/KB);

return sprintf("%6d", num);

}

BEGIN{

k100=0;k300=0;m1=0;m5=0;marge=0;all=0;lastmodify=0;capcity_all=0;

}

NR>1 && $8 ~ /^oss:/ {

all++;

capcity_all+=$5;

if($5 < 102400) k100++;

else if($5 < 1024*300) k300++;

else if($5 < 1024*1000) {m1++;}

else if ($5 < 1024 * 1024 * 5) {m5++;}

else large++;

if (lastmodify < $1"T"$2) lastmodify=$1"T"$2;

}

END{

gsub("oss://[^/]+", "", e)

printf("%-32s%9s(%4.1f%%)%9s(%4.1f%%)%9s(%4.1f%%)%9s(%4.1f%%)%9s(%4.1f%%)%10s%9s%20s\n", e, k100, k100*100/all,k300, k300*100/all,m1, m1*100/all,m5, m5*100/all,large, large*100/all, all, format_capcity(capcity_all), lastmodify);

}

命令：

DIR                                       <100KB          <300KB            <1MB            <5MB            >5MB       All  Capcity          lastmodify
/dir1/                            2602(16.8%)     7584(49.0%)     3752(24.2%)     1548(10.0%)        2( 0.0%)     15488  11.34GB 2016-06-07T16:15:10
/dir2/                           26( 1.2%)        0( 0.0%)        1( 0.0%)       61( 2.9%)     2034(95.9%)      2122  26.09TB 2018-01-26T18:08:38
/dir3/                                    1(10.0%)        1(10.0%)        2(20.0%)        2(20.0%)        4(40.0%)        10 235.13MB 2016-02-22T15:03:58
/dir4/                                  0( 0.0%)        0( 0.0%)        0( 0.0%)        0( 0.0%)        2(100.0%)         2 396.21MB 2017-03-09T10:17:27

elapsed time: 0 hours 0 mins 3 s

DIR <100KB <300KB <1MB <5MB >5MB All Capcity lastmodify

/dir1/ 2602(16.8%) 7584(49.0%) 3752(24.2%) 1548(10.0%) 2( 0.0%) 15488 11.34GB 2016-06-07T16:15:10

/dir2/ 26( 1.2%) 0( 0.0%) 1( 0.0%) 61( 2.9%) 2034(95.9%) 2122 26.09TB 2018-01-26T18:08:38

/dir3/ 1(10.0%) 1(10.0%) 2(20.0%) 2(20.0%) 4(40.0%) 10 235.13MB 2016-02-22T15:03:58

/dir4/ 0( 0.0%) 0( 0.0%) 0( 0.0%) 0( 0.0%) 2(100.0%) 2 396.21MB 2017-03-09T10:17:27

elapsed time: 0 hours 0 mins 3 s

注意：

结果如果要按照空格分域粘贴到Excel中，需要将百分数前面可能存在的空格给替换掉：

sed s/\(\ /\(/g

1

sed s/\(\ /\(/g

awk 知识点：

自定义函数
内置函数gsub使用
printf格式化函数，sprintf函数
awk 中字符串连接的语法

bash 之标准输入、标准输出、标准错误重定向

一般来讲：

cat a.txt |wc -l

1	cat a.txt \|wc -l

我们都知道这是什么意思，或者：

cat a.txt >/dev/null

1	cat a.txt >/dev/null

但是：

<a.txt >/dev/null cat

1	<a.txt >/dev/null cat

这又是什么鬼？

其实，仅仅是把重定向写到命令前面而已, 等效于：

cat <a.txt >/dev/null

1	cat <a.txt >/dev/null

而且都是之影响一条可执行命令而已：

>/tmp/c echo a && echo b
b

1 2	>/tmp/c echo a && echo b b

这个只影响echo a 的输出重定向，没有影响到echo b 的输出重定向