PHPor 的Blog

关于nsenter

# nsenter  -h

用法：
 nsenter [options] <program> [<argument>...]

Run a program with namespaces of other processes.

选项：
 -t, --target <pid>     要获取名字空间的目标进程
 -m, --mount[=<file>]   enter mount namespace
 -u, --uts[=<file>]     enter UTS namespace (hostname etc)
 -i, --ipc[=<file>]     enter System V IPC namespace
 -n, --net[=<file>]     enter network namespace
 -p, --pid[=<file>]     enter pid namespace
 -U, --user[=<file>]    enter user namespace
 -S, --setuid <uid>     set uid in entered namespace
 -G, --setgid <gid>     set gid in entered namespace
     --preserve-credentials do not touch uids or gids
 -r, --root[=<dir>]     set the root directory
 -w, --wd[=<dir>]       set the working directory
 -F, --no-fork          执行 <程序> 前不 fork
 -Z, --follow-context   set SELinux context according to --target PID

 -h, --help     显示此帮助并退出
 -V, --version  输出版本信息并退出

更多信息请参阅 nsenter(1)。

# nsenter -h

用法：

nsenter [options] <program> [<argument>...]

Run a program with namespaces of other processes.

选项：

-t, --target <pid> 要获取名字空间的目标进程

-m, --mount[=<file>] enter mount namespace

-u, --uts[=<file>] enter UTS namespace (hostname etc)

-i, --ipc[=<file>] enter System V IPC namespace

-n, --net[=<file>] enter network namespace

-p, --pid[=<file>] enter pid namespace

-U, --user[=<file>] enter user namespace

-S, --setuid <uid> set uid in entered namespace

-G, --setgid <gid> set gid in entered namespace

--preserve-credentials do not touch uids or gids

-r, --root[=<dir>] set the root directory

-w, --wd[=<dir>] set the working directory

-F, --no-fork 执行 <程序> 前不 fork

-Z, --follow-context set SELinux context according to --target PID

-h, --help 显示此帮助并退出

-V, --version 输出版本信息并退出

更多信息请参阅 nsenter(1)。

从 help 来看，只要使用了 -p 选项，就可以进入目标进程的pid名字空间，换言之，就可以只看到目标进程所在的名字空间的进程，用法如下：

nsenter -p -t $pid $cmd

1	nsenter -p -t $pid $cmd

事实上，

nsenter -p -t 6806 top -n 1

1	nsenter -p -t 6806 top -n 1

看到的却是nsenter当前所在名字空间（严格来讲，这样描述也不太准确）的所有进程，为什么呢？

因为top参考的是 /proc 文件系统，所以，进入相应的mount空间也很重要，所以，正确的写法为：

nsenter -m -p -t 6806 top -n 1

1	nsenter -m -p -t 6806 top -n 1

syslog协议之PRI解析

syslog协议的第一部分是尖括号引用的一个数字，如： <182>

该数字大小范围为： 0 ~ 255，为1个字节表达的数字，包含两部分内容：

低三位：（0 ~ 7）称作： Severity

        Numerical         Severity
          Code

           0       Emergency: system is unusable
           1       Alert: action must be taken immediately
           2       Critical: critical conditions
           3       Error: error conditions
           4       Warning: warning conditions
           5       Notice: normal but significant condition
           6       Informational: informational messages
           7       Debug: debug-level messages

Numerical Severity

Code

0 Emergency: system is unusable

1 Alert: action must be taken immediately

2 Critical: critical conditions

3 Error: error conditions

4 Warning: warning conditions

5 Notice: normal but significant condition

6 Informational: informational messages

7 Debug: debug-level messages

高5位（右移3位后）：（0 ~ 31）称作：Facility

       Numerical             Facility
          Code

           0             kernel messages
           1             user-level messages
           2             mail system
           3             system daemons
           4             security/authorization messages (note 1)
           5             messages generated internally by syslogd
           6             line printer subsystem
           7             network news subsystem
           8             UUCP subsystem
           9             clock daemon (note 2)
          10             security/authorization messages (note 1)
          11             FTP daemon
          12             NTP subsystem
          13             log audit (note 1)
          14             log alert (note 1)
          15             clock daemon (note 2)
          16             local use 0  (local0)
          17             local use 1  (local1)
          18             local use 2  (local2)
          19             local use 3  (local3)
          20             local use 4  (local4)
          21             local use 5  (local5)
          22             local use 6  (local6)
          23             local use 7  (local7)

Numerical Facility

Code

0 kernel messages

1 user-level messages

2 mail system

3 system daemons

4 security/authorization messages (note 1)

5 messages generated internally by syslogd

6 line printer subsystem

7 network news subsystem

8 UUCP subsystem

9 clock daemon (note 2)

10 security/authorization messages (note 1)

11 FTP daemon

12 NTP subsystem

13 log audit (note 1)

14 log alert (note 1)

15 clock daemon (note 2)

16 local use 0 (local0)

17 local use 1 (local1)

18 local use 2 (local2)

19 local use 3 (local3)

20 local use 4 (local4)

21 local use 5 (local5)

22 local use 6 (local6)

23 local use 7 (local7)

根据尖括号中的数字还原上面两个部分的方法，以 182 为例：

lijunjiedeMacBook-Pro:~ phpor$ php -r 'echo 182 & 0x07;'
6
lijunjiedeMacBook-Pro:~ phpor$ php -r 'echo (182 >> 3) & 0x1f;'
22

lijunjiedeMacBook-Pro:~ phpor$ php -r 'echo 182 & 0x07;'

lijunjiedeMacBook-Pro:~ phpor$ php -r 'echo (182 >> 3) & 0x1f;'

即： local6的information

参考：https://tools.ietf.org/html/rfc3164#section-4.1.1

存储协议栈

参考： http://brasstacksblog.typepad.com/brass-tacks/

http://brasstacksblog.typepad.com/brass-tacks/2016/02/index.html

diff by size

# diff <(cd /data1/weedfs/volume4.2; find . -type f -printf "%p %s\n" ) <(cd /mnt/weedfs/volume4.2; find . -type f -printf "%p %s\n" )

1	# diff <(cd /data1/weedfs/volume4.2; find . -type f -printf "%p %s\n" ) <(cd /mnt/weedfs/volume4.2; find . -type f -printf "%p %s\n" )

参考： http://stackoverflow.com/questions/11087244/compare-2-folders-and-find-files-with-differing-byte-counts

The process list is run asynchronously, and its input or output appears as a filename. This filename is passed as an argument to the current command as the result of the expansion. If the >(list) form is used, writing to the file will provide input for list. If the <(list) form is used, the file passed as an argument should be read to obtain the output of list. Note that no space may appear between the < or > and the left parenthesis, otherwise the construct would be interpreted as a redirection. Process substitution is supported on systems that support named pipes (FIFOs) or the /dev/fd method of naming open files.

When available, process substitution is performed simultaneously with parameter and variable expansion, command substitution, and arithmetic expansion.

该语法结构在zsh中记作： =(list)

注意1：

这里产生的文件是管道，是管道，是管道，只能读一次，不能到做普通文件使用，如：

这里cat了两次，只输出一次hello

为什么写个简单的测试还要放在function里面？拿出来不好使

脚本：

#!/bin/bash
function hello() {
	f=<(echo hello)
	cat $f
	cat $f
}
hello

#!/bin/bash

function hello() {

f=<(echo hello)

cat $f

}

hello

注意2：

进程替换只对于bash生效，对于sh不生效； sh 不理解进程替换，至少对于4.2.46版本的bash来讲，如果把bash mv成为 sh，就不再理解进程替换了；更有甚者，如果 sh 软连接到bash，然后使用sh执行上面脚本就得出现错误

参考：

https://unix.stackexchange.com/questions/62140/filesize-difference-of-same-name-folders

http://tiswww.case.edu/php/chet/bash/bashref.html#Process-Substitution

关于pagecache

http://mp.weixin.qq.com/s/qIYbLiOpi8PuLGU0w709Cg

ceph osd 操作

ceph osd down $id：将osd $id 标记为down（mark down），达到不再访问的效果，并不真正停止进程，（仍然参与hash？），ceph osd tree 查看的时候，依然可能是up的状态

ceph osd out $id: 将weight 设置为0（零），达到不再访问的效果，（不参与hash？）

ceph osd lost $id: 删除该osd上的所有数据，该操作比较危险，需要明确指定 –yes-i-really-mean-it，如：

ceph osd rm $id: 从集群中彻底删除该osd；如果要删除某osd，必须先停止进程，仅仅标记为down（ceph osd down $id）是不够的，如：

# /bin/ceph osd rm 3
Error EBUSY: osd.3 is still up; must be down before removal.

1 2	# /bin/ceph osd rm 3 Error EBUSY: osd.3 is still up; must be down before removal.

停止指定osd进程：

仅仅rm掉osd还是可以在ceph osd tree中看到，如下：

需要从crush中移除：

然而，依然删除的不够干净，如 auth中还有相关信息：

删除：

查看所有osd:

ControlGroupInterface

https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/

docker info 之cgroupdriver和runc

关于cgroupdriver：

docker 默认的cgroupdriver为cgroupfs，也可以手动指定（通过环境变量、命令行参数、daemon.json）systemd，有些docker的rpm包会在 /usr/lib/systemd/system/docker.service 中通过命令行的方式指定cgroupdriver为systemd，当前支持的cgroupdriver有： cgroupfs、systemd；命令行设置方式：

--exec-opt native.cgroupdriver=systemd

1	--exec-opt native.cgroupdriver=systemd

虽然创建容器时没有指定cgroupdriver的选项，但是通过修改cgroupdriver重启daemon，可以使得同一个daemon下的容器使用不同的cgroupdriver的（仅仅是出于理解技术实现的思考，实践中似乎没有任何必要）

至于容器使用的native.cgroupdrive 是cgroupfs 还是 systemd 不是配置在容器上的，而是取决于dockerd启动时的配置，如果dockerd配置的是cgroupfs，则容器启动的时候就是使用的cgroupfs，如果dockerd在配置为cgroupfs时已经启动了几个容器，然后修改配置为systemd，则现在启动的容器就是systemd的了，这时候，两种类型的cgroupdriver的容器是并存的，切换cgroupdriver只需要修改dockerd配置后重启容器即可

曾经有文章中疑惑，有的docker容器的cgroup都是在相同的docker目录下的，有的却是docker-xxxx.scope; 原因就在于前者是通过cgroupfs来实现的，后者是通过systemd来实现的；另外 docker daemon也可以通过选项–cgroup-parent 来指定一个父cgroup（cgroup是有层级结构的），默认为docker，根据cgroupdriver的不同，该cgroup-parent的表现形式也不同，cgroupfs中表现为父目录，systemd中表现为前缀（参考： https://docs.docker.com/engine/reference/commandline/dockerd/#default-cgroup-parent ）；每个容器都可以有自己的–cgroup-parent，这个对于不同容器进行分组时似乎是不错的。

关于runc：

默认runc为 docker-runc，其实runc表现为一个可执行的二进制文件，docker info中显示的只是一个配置时指定的名字，至于该名字对应哪个二进制文件，是在配置的时候指定的，也就是说，你可以在不修改配置的情况下，直接修改对应的二进制文件来修改runc，重启容器就会生效；另外，创建容器的时候，可以指定runc，指定的是配置daemon时使用的名字，每个容器可以有不同的runc

# docker info
Containers: 162
 Running: 56
 Paused: 0
 Stopped: 106
Images: 161
Server Version: 1.12.5
Storage Driver: devicemapper
 Pool Name: data-docker_thinpool
 Pool Blocksize: 524.3 kB
 Base Device Size: 32.21 GB
 Backing Filesystem: xfs
 Data file:
 Metadata file:
 Data Space Used: 469.4 GB
 Data Space Total: 644.2 GB
 Data Space Available: 174.9 GB
 Metadata Space Used: 73.79 MB
 Metadata Space Total: 16.98 GB
 Metadata Space Available: 16.9 GB
 Thin Pool Minimum Free Space: 64.42 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: journald
Cgroup Driver: systemd
Plugins:
 Volume: local
 Network: null overlay host bridge
Swarm: inactive
Runtimes: runc docker-runc
Default Runtime: docker-runc
Security Options: seccomp
Kernel Version: 3.10.0-514.6.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 2
CPUs: 40
Total Memory: 94.14 GiB
Name: VM-2-10-12
ID: NYBE:NZML:4KQQ:PF2J:RXCB:IPPI:Y3BI:CY7E:RVAC:WVWV:VDM2:3EEK
Docker Root Dir: /data1/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 docker-registry.i.bbtfax.com:5000
 127.0.0.0/8
Registries: docker.io (secure)

# docker info

Containers: 162

Running: 56

Paused: 0

Stopped: 106

Images: 161

Server Version: 1.12.5

Storage Driver: devicemapper

Pool Name: data-docker_thinpool

Pool Blocksize: 524.3 kB

Base Device Size: 32.21 GB

Backing Filesystem: xfs

Data file:

Metadata file:

Data Space Used: 469.4 GB

Data Space Total: 644.2 GB

Data Space Available: 174.9 GB

Metadata Space Used: 73.79 MB

Metadata Space Total: 16.98 GB

Metadata Space Available: 16.9 GB

Thin Pool Minimum Free Space: 64.42 GB

Udev Sync Supported: true

Deferred Removal Enabled: true

Deferred Deletion Enabled: false

Deferred Deleted Device Count: 0

Library Version: 1.02.135-RHEL7 (2016-11-16)

Logging Driver: journald

Cgroup Driver: systemd

Plugins:

Volume: local

Network: null overlay host bridge

Swarm: inactive

Runtimes: runc docker-runc

Default Runtime: docker-runc

Security Options: seccomp

Kernel Version: 3.10.0-514.6.1.el7.x86_64

Operating System: CentOS Linux 7 (Core)

OSType: linux

Architecture: x86_64

Number of Docker Hooks: 2

CPUs: 40

Total Memory: 94.14 GiB

Name: VM-2-10-12

ID: NYBE:NZML:4KQQ:PF2J:RXCB:IPPI:Y3BI:CY7E:RVAC:WVWV:VDM2:3EEK

Docker Root Dir: /data1/docker

Debug Mode (client): false

Debug Mode (server): false

Registry: https://index.docker.io/v1/

Insecure Registries:

docker-registry.i.bbtfax.com:5000

127.0.0.0/8

Registries: docker.io (secure)

磁盘调度算法

参考： http://blog.csdn.net/theorytree/article/details/6259104

说明：显示所有支持的调度策略，方框内的是当前启用的调度策略

查看当前系统支持的调度算法：

# dmesg | grep -i scheduler
[ 0.605910] io scheduler noop registered
[ 0.605916] io scheduler deadline registered (default)
[ 0.605974] io scheduler cfq registered

# dmesg | grep -i scheduler

[ 0.605910] io scheduler noop registered

[ 0.605916] io scheduler deadline registered (default)

[ 0.605974] io scheduler cfq registered

难道调度算法和设备本身也有关系？从下面来看，阿里云的云盘不支持任何调度策略：

# cat /sys/block/xvda/queue/scheduler
none

1 2	# cat /sys/block/xvda/queue/scheduler none

但是：

值得一提的是，Anticipatory算法从Linux 2.6.33版本后，就被移除了，因为CFQ通过配置也能达到Anticipatory算法的效果。

查资料发现，调度策略为 ‘none’ 的现象和阿里云虚拟机没关系，和阿里云云盘没关系，和操作系统版本也没有（直接）关系，仅仅和内核版本有关系， linux内核从3.13开始引入 blk-mq 队列机制，并在3.16得以全部实现，上面看到的非‘none’的情况，内核版本都在3.10之前，为‘none’的情况是被我手动升级内核到4.4.61 的

如何验证是否启用了blk-mq机制？可以通过查看是否存在mq目录，如下：

目录不存在，说明没有启用该机制

存在mq目录，说明使用的是blk-mq机制

参考： https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mechanism_(blk-mq)

参考： http://www.cnblogs.com/cobbliu/p/5389556.html