用 opkg 安装 xl2tpd 及其依赖包,在 /etc/config/network 中添加如下配置,其中将***替换为统一认证的用户名密码。完成后在网络-接口中可见名为nju的接口,开机会自动链接。
config interface 'nju'
option proto 'l2tp'
option server '202.119.36.101'
option username '***'
option password '***'
option ipv6 '0'
option mtu '1452'
option defaultroute '0'
option delegate '0'
option ip4table 'main'
config route
option interface 'nju'
option target '114.212.0.0/16'
option mtu '1452'
config route
option interface 'nju'
option target '180.209.0.0/20'
option mtu '1452'
config route
option interface 'nju'
option target '202.119.32.0/19'
option mtu '1452'
config route
option interface 'nju'
option target '202.127.247.0/24'
option mtu '1452'
config route
option interface 'nju'
option target '202.38.126.160/28'
option mtu '1452'
config route
option interface 'nju'
option target '202.38.2.0/23'
option mtu '1452'
config route
option interface 'nju'
option target '210.28.128.0/20'
option mtu '1452'
config route
option interface 'nju'
option target '210.29.240.0/20'
option mtu '1452'
config route
option interface 'nju'
option target '219.219.112.0/20'
option mtu '1452'
config route
option interface 'nju'
option target '58.192.32.0/20'
option mtu '1452'
config route
option interface 'nju'
option target '58.192.48.0/21'
option mtu '1452'
config route
option interface 'nju'
option target '58.193.224.0/19'
option mtu '1452'
config route
option interface 'nju'
option target '58.195.80.0/20'
option mtu '1452'
]]>X20的升级还是值得的,至少从清洁的角度方便了很多,虽然感觉洗涤容量少了,不过可以天天洗嘛,反正买了延保卡,使劲用~~
]]>通过修改 gitlab.rb 或者环境变量插入一个 nginx 配置文件
nginx['custom_nginx_config'] = "include /etc/gitlab/nginx-default.conf;"
nginx-default.conf 配置文件内容如下
server {
listen 80 default_server;
listen [::]:80 default_server;
listen 443 default_server ssl;
listen [::]:443 default_server ssl;
http2 on;
server_name _;
server_tokens off;
ssl_reject_handshake on;
return 444;
}
]]>DSS-G 使用组件数据库(compDB )来描述硬件组件,如机柜、DSS-G的服务器和JBOD。使用 DSS-G 不需要compDB,但是健康监控需要。
第一步:生成组件数据库
始终使用dry run来生成组件数据库(compDB),其中dssg01是DSS-G两台服务器的node class(dssgmkstorage时创建的)
[root@dss01 ~]# dssgmkcompdb --racktype RACK42U --verbose --dryrun -N dssg01
DSS-G 5.0c
Parsing options: --racktype RACK42U --verbose --dryrun -N dssg01 --
Entering verbose mode
Entering dry run mode
Using provided nodeclass or node list
dss01
dss02
Checking whether all nodes belong to the same cluster...
Using rack type "RACK42U" of height 42U
Checking server model...
…………
# DSS-G210-1
Setting componentDB...
Note: all commands below support the --dry-run option
DRYRUN: mmaddcomp -F /root/yaoge123gpfs.io01-comp.2025-03-15.233334.851103.stanza --replace
DRYRUN: mmchcomploc -F /root/yaoge123gpfs.io01-comploc.2025-03-15.233334.851103.stanza
DRYRUN: /root/yaoge123gpfs.io01-dispid.2025-03-15.233334.851103.sh
All done
第二步:检查两个stanza文件内容,其中组件安装在机柜中的具体位置在 -comploc..stanza 文件中
第三步:导入组件数据库,照抄dssgmkcompdb最后面三行DRYRUN后面的命令运行,然后用mmlscomp和mmlscomploc检查结果。
[root@dss01 ~]# mmlscomp
Rack Components
Comp ID Part Number Serial Number Name
------- ----------- ------------- ----
6 RACK42U ******* A1
Server Components
Comp ID Part Number Serial Number Name Node Number
------- ----------- ------------- ---------------- -----------
9 7D9ECTOLWW ******** SR655V3-******** 101
10 7D9ECTOLWW ******** SR655V3-******** 102
Storage Enclosure Components
Comp ID Part Number Serial Number Name Display ID
------- ----------- ------------- -------------- ----------
8 7DAHCT0LWW ******** D4390-********
Storage Server Components
Comp ID Part Number Serial Number Name
------- ----------- ------------- ----------
7 DSS-G210 DSS-G210-1
[root@dss01 ~]# mmlscomploc
Rack Location Component
---- -------- ----------------
A1 U35-36 SR655V3-********
A1 U33-34 SR655V3-********
A1 U01-04 D4390-********
Storage Server Index Component
-------------- ----- ----------------
DSS-G210-1 3 SR655V3-********
DSS-G210-1 2 SR655V3-********
DSS-G210-1 1 D4390-********
因为DSS-G本身的冗余性(节点冗余、链路冗余等),出现故障不一定会导致整个文件系统失效,这往往会导致故障没有被立刻发现,但是这些故障会破坏冗余性或导致性能下降,管理员应该知晓这些故障并积极修复。dssghealthmon 它会周期性(默认1小时)检查整个DSS-G的健康状态,如有故障会自动发送邮件通知。dssghealthmon 可以通过confluent节点带外访问XCC(服务器的BMC)检查硬件健康状态,所以需要DSS-G服务器到Confluent的免密码SSH。
第一步:编辑配置文件 /etc/dssg/dssghealthmon.conf
第二步:先查看一下DSS-G的nodeclass名称,然后启动监控
[root@dss01 ~]# mmlsnodeclass
Node Class Name Members
--------------------- -----------------------------------------------------------
dssg01 dss01,dss02
[root@dss01 ~]# dssghealthmon_startup dssg01 confluent
Obtaining Confluent version from management server confluent
3.11.1
Processing nodeclass...
Node Class Name Members
--------------------- -----------------------------------------------------------
dssg01 dss01,dss02
Parsing configuration file...
Copying configuration file...
Creating tuple file...
dss01:dss01:7D9ECTOLWW:********
dss02:dss02:7D9ECTOLWW:********
Copying tuple file...
Setting or replacing the dssghealthmon_erflist cronjob...
Warning: dssghealthmon_erflist cronjob has NOT been specified
Creating and copying daemon environment file...
Starting dssghealthmond...
The dssghealthmon system has been successfully started
第三步:在两个服务器上检查监控状态和配置文件
[root@dss01 ~]# dssghealthmon_status dssg01
Processing nodeclass...
Node Class Name Members
--------------------- -----------------------------------------------------------
dssg01 dss01,dss02
Obtaining status of the DSS-G health monitor...
dss01: active
dss02: active
[root@dss01 ~]# systemctl status dssghealthmond.service
● dssghealthmond.service - DSS-G Health Monitor
Loaded: loaded (/etc/systemd/system/dssghealthmond.service; disabled; preset: disabled)
Active: active (running) since Sat 2025-04-05 18:52:43 CST; 1min 57s ago
Process: 1361859 ExecStart=/opt/lenovo/dss/dssghealthmon/dssghealthmond $NODECLASS $MGMT (code=exited, status=0/SUCCESS)
Main PID: 1361946 (dssghealthmond)
Tasks: 2 (limit: 2468168)
Memory: 2.3M
CPU: 34.410s
CGroup: /system.slice/dssghealthmond.service
├─1361946 /bin/bash /opt/lenovo/dss/dssghealthmon/dssghealthmond dssg01 confluent 1361859
└─1362172 sleep 3600
Apr 05 18:52:43 dss01 systemd[1]: Starting DSS-G Health Monitor...
Apr 05 18:52:43 dss01 systemd[1]: Started DSS-G Health Monitor.
[root@dss01 ~]# cat /etc/dssg/dssghealthmon.env
NODECLASS=dssg01
MGMT=confluent
[root@dss01 ~]# cat /etc/dssg/dssghealthmon.hosts
dss01:dss01:7D9ECTOLWW:********
dss02:dss02:7D9ECTOLWW:********
[root@dss02 ~]# systemctl status dssghealthmond.service
● dssghealthmond.service - DSS-G Health Monitor
Loaded: loaded (/etc/systemd/system/dssghealthmond.service; disabled; preset: disabled)
Active: active (running) since Sat 2025-04-05 18:52:43 CST; 4min 52s ago
Process: 87519 ExecStart=/opt/lenovo/dss/dssghealthmon/dssghealthmond $NODECLASS $MGMT (code=exited, status=0/SUCCESS)
Main PID: 87601 (dssghealthmond)
Tasks: 2 (limit: 2468168)
Memory: 2.2M
CPU: 1.344s
CGroup: /system.slice/dssghealthmond.service
├─87601 /bin/bash /opt/lenovo/dss/dssghealthmon/dssghealthmond dssg01 confluent 87519
└─87823 sleep 3600
Apr 05 18:52:43 dss02 systemd[1]: Starting DSS-G Health Monitor...
Apr 05 18:52:43 dss02 systemd[1]: Started DSS-G Health Monitor.
[root@dss02 ~]# cat /etc/dssg/dssghealthmon.env
NODECLASS=dssg01
MGMT=confluent
[root@dss02 ~]# cat /etc/dssg/dssghealthmon.hosts
dss01:dss01:7D9ECTOLWW:********
dss02:dss02:7D9ECTOLWW:********
第四步:两台服务器均配置和测试发邮件,以QQ企业邮箱为例,注意需要urlencode
[root@dss01 ~]# cat ~/.mailrc
set v15-compat
set mta=smtps://yaoge%40yaoge123.com:************@smtp.exmail.qq.com:465 smtp-auth=login
set from=cicam@nju.edu.cn
[root@dss01 ~]# echo "$HOSTNAME" | mailx -s "TEST" yaoge@yaoge123.com
[root@dss01 ~]# scp ~/.mailrc dss02:/root/
.mailrc
[root@dss02 ~]# echo "$HOSTNAME" | mailx -s "TEST" yaoge@yaoge123.com
第五步:模拟故障测试
[root@dss01 ~]# mmvdisk rg list
[root@dss01 ~]# mmvdisk pdisk list --rg dss02
# 找到一个pdisk模拟失效
[root@dss01 ~]# mmvdisk pdisk change --rg dss02 --pdisk e1s44 --simulate-dead
[root@dss01 ~]# mmvdisk pdisk list --rg dss02 --not-ok
declustered
recovery group pdisk array paths capacity free space FRU (type) state
-------------- ------------ ----------- ----- -------- ---------- --------------- -----
dss02 e1s44 DA1 0 20 TiB 1024 GiB 03LC215 simulatedDead/draining/replace
# 可见已经开始rebuilding
[root@dss01 ~]# mmvdisk rg list --rg dss02 --all
needs user
recovery group node class active current or master server service vdisks remarks
-------------- ---------- ------- -------------------------------- ------- ------ -------
dss02 dssg01 yes dss02 yes 2
……
declustered needs vdisks pdisks capacity
array service type BER trim user log total spare rt total raw free raw background task
----------- ------- ---- ------- ---- ---- --- ----- ----- -- --------- -------- ---------------
NVR no NVR enable - 0 1 2 0 1 - - scrub 14d (66%)
SSD no SSD enable - 0 1 1 0 1 - - scrub 14d (20%)
DA1 yes HDD enable no 2 1 44 2 2 834 TiB 144 GiB rebuild-1r (8%)
……
vdisk RAID code disk group fault tolerance remarks
------------------ --------------- --------------------------------- -------
RG002LOGHOME 4WayReplication - rebuilding
RG002LOGTIP 2WayReplication 1 pdisk
RG002LOGTIPBACKUP Unreplicated 0 pdisk
RG002VS001 3WayReplication - rebuilding
RG002VS002 8+2p - rebuilding
第六步:等收到报警邮件后,恢复正常
[root@dss01 ~]# mmvdisk pdisk change --rg dss02 --pdisk e1s44 --revive
[root@dss01 ~]# mmvdisk pdisk list --rg dss02 --not-ok
mmvdisk: All pdisks of recovery group 'dss02' are ok.
第七步:检查日志
监控中数据获取有 Pull 和 Push 两大流派:
Node Exporter 是 Prometheus 官方提供的物理机监控客户端,注意更改版本号为最新的。
wget -q https://mirror.nju.edu.cn/github-release/prometheus/node_exporter/LatestRelease/node_exporter-1.9.0.linux-amd64.tar.gz
tar xf node_exporter-*
sudo mv node_exporter-*/node_exporter /usr/local/sbin/
sudo chown root:root /usr/local/sbin/node_exporter
rm -rf ./node_exporter-*
sudo cat > /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=Prometheus Node Exporter
After=network.target
[Service]
Restart=always
ExecStart=/usr/local/sbin/node_exporter
ExecReload=/bin/kill -HUP $MAINPID
TimeoutStopSec=20s
SendSIGKILL=no
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl enable --now node_exporter.service
DCGM-Exporter 是 NVIDIA 官方提供的GPU监控客户端,它基于DCGM。
安装 NVIDIA Data Center GPU Manager (DCGM),注意更改操作系统版本、架构、key包版本。
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
rm -f cuda-keyring_1.1-1_all.deb
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
sudo apt-get update && sudo apt-get install datacenter-gpu-manager-4-cuda-all
systemctl enable --now nvidia-dcgm
安装 Go 然后编译 DCGM-Exporter,注意更改Go的版本
wget https://mirror.nju.edu.cn/golang/go1.24.1.linux-amd64.tar.gz
tar -C /usr/local -xzf go*.tar.gz
rm -f go*.tar.gz
export PATH=$PATH:/usr/local/go/bin
go env -w GO111MODULE=on
go env -w GOPROXY="https://repo.nju.edu.cn/go/,direct"
git clone https://github.com/NVIDIA/dcgm-exporter
cd dcgm-exporter
make binary
cp cmd/dcgm-exporter/dcgm-exporter /usr/local/sbin/
mkdir /usr/local/etc/dcgm-exporter
cp etc/* /usr/local/etc/dcgm-exporter/
cd ..
rm -rf dcgm-exporter
cat > /etc/systemd/system/dcgm-exporter.service <<EOF
[Unit]
Description=Prometheus DCGM exporter
Wants=network-online.target nvidia-dcgm.service
After=network-online.target nvidia-dcgm.service
[Service]
Type=simple
Restart=always
ExecStart=/usr/local/sbin/dcgm-exporter --collectors /usr/local/etc/dcgm-exporter/default-counters.csv
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now dcgm-exporter.service
注意 DCGM 和 DCGM-Exporter 的版本要匹配,DCGM-Exporter 的版本格式为 DCGM版本-Exporter版本。如下列 4.1.1-4.0.4 意思是 DCGM-Exporter 本身的版本是4.0.4,对 DCGM 的版本要求是 4.1.1,经查看是匹配的。
/usr/local/sbin/dcgm-exporter --version
2025/03/23 10:46:13 maxprocs: Leaving GOMAXPROCS=112: CPU quota undefined
DCGM Exporter version 4.1.1-4.0.4
(base) root@ubuntu:~# dcgmi --version
dcgmi version: 4.1.1
先把 Docker 和 Docker Compose V2 装好,然后用容器部署 Grafana 和 Prometheus。简单起见不考虑前面套反向代理、Prometheus加认证、consul 实现自动服务发现注册。
先建好目录并更改所有者/组,否则启动的时候会报没有权限
mkdir grafana
mkdir prometheus
mkdir prometheus-conf
sudo chown 472:0 grafana/
sudo chown 65534:65534 prometheus
编辑 docker-compose.yml
services:
prometheus:
image: prom/prometheus
container_name: prometheus
restart: always
volumes:
- ./prometheus:/prometheus
- ./prometheus-conf:/etc/prometheus/conf
command:
- --config.file=/etc/prometheus/conf/prometheus.yml
- --web.console.libraries=/usr/share/prometheus/console_libraries
- --web.console.templates=/usr/share/prometheus/consoles
- --web.listen-address=0.0.0.0:9090
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention.size=100GB #数据库存储容量限制,超过后会自动删除最老的数据
- --storage.tsdb.wal-compression
grafana:
image: grafana/grafana
container_name: grafana
restart: always
ports:
- 3000:3000
volumes:
- ./grafana:/var/lib/grafana
environment:
- GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-piechart-panel
- GF_SECURITY_ADMIN_PASSWORD=yaoge123 #admin的初始密码
- GF_SERVER_ENABLE_GZIP=true
- GF_SERVER_DOMAIN=192.168.1.10 #如果前面有反代要改
- GF_SERVER_ROOT_URL=http://192.168.1.10 #如果前面有反代要改
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_NAME=yaoge123
#- GF_SERVER_SERVE_FROM_SUB_PATH=true #反代后有子路径
#- GF_SERVER_ROOT_URL=%(protocol)s://%(domain)s:%(http_port)s/grafana #反代后有子路径
#- GF_SECURITY_COOKIE_SECURE=true #反代上有HTTPS
depends_on:
- prometheus
导入默认配置 prometheus.yml
cd prometheus-conf
wget https://raw.githubusercontent.com/prometheus/prometheus/refs/heads/main/documentation/examples/prometheus.yml
prometheus.yml 中主要编辑的是监控目标
# my global config
global:
scrape_interval: 30s # Set the scrape interval to every 15 seconds. Default is every 1 minute.采样间隔
evaluation_interval: 30s # Evaluate rules every 15 seconds. The default is every 1 minute.
scrape_timeout: 30s #采样超时,有一些exporter读取很慢,需要放宽超时。
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
# The label name is added as a label `label_name=<label_value>` to any timeseries scraped from this config.
labels:
app: "prometheus"
- job_name: node_exporter
static_configs:
- targets:
- '192.168.1.101:9100'
- '192.168.1.102:9100'
- job_name: dcgm-exporter
static_configs:
- targets:
- '192.168.1.101:9400'
- '192.168.1.102:9400'
把容器启起来 cd .. && docker compose up -d
下面就是配置 Grafana
至此一个简单的监控平台就已经搭建好了,在真实的生产环境中使用 consul 实现被监控对象的管理是必不可少的,Prometheus 最好加上认证,Grafana 前面最好套一个 NGINX 做反代实现 HTTPS。
]]>GPFS的数据冗余保护主要有下列三种方式,其中1是最传统的由外部设备提供数据保护,2和3都是由GPFS提供数据保护。这两者的区别在于:2类似于集中存储,奇偶校验不会在网络上传输;3类似于分布式存储,奇偶校验在网络上传输。
GPFS的文件系统基于Network Shared Disk (NSD)构建,可以对NSD划分不同的存储池(storage pool),其中system pool必定存在,还可以增加其它的pool,不同的pool可以使用不同类型的NSD做自动化的数据迁移和放置规则。同一个pool中数据在NSD上平均的分布,如果文件系统的元数据和数据的副本设置为1(默认值)则类似于RAID0,如果元数据和数据的副本设置为2则类似于RAID1。文件系统元数据和数据的副本数量可以独立设置可以不同,但是不能超过最大值,最大值在创建文件系统的时候设置,默认都是2,可以指定为3。元数据和数据的副本数除了在文件系统层面配置,还可以在规则中进行更细粒度的控制。NSD用途有四种,主要使用到的是三种,系统池的默认值 数据和元数据(dataAndMetadata),非系统池的默认值 仅数据(dataOnly),仅元数据(metadataOnly)。
块(Block)是可分配给文件的最大连续硬盘空间(在一个NSD上面),也是单次 I/O 操作中下发的最大大小。块由一定数量的子块组成,子块是可分配给文件空间的最小单位。大于一个块的文件存储在一个或多个块中,加上额外的一个或多个子块以保存剩余数据。小于一个块的文件存储在一个或多个子块中。当流式写入大文件时,当一个NSD写满一个Block Size时,就会转到下一个NSD继续写入,从而平衡各个NSD之间性能和空间的消耗,显然更大的Block Size有助于提升存储系统的吞吐量。块大小和子块大小的对应关系如下:64 KB 块对应2 KB 子块,128 KB 块对应4 KB 子块,256 KB-4 MB 块对应8 KiB 子块,8-16 MB 块对应16 KiB 子块。
块大小(Block Size)是GPFS中重要的参数,文件系统的块大小、子块大小和每个块的子块数量是在创建文件系统时设置的,以后不能更改,如果更改只能创建新的并自己迁移数据,这个开销非常大往往是不能接受的。文件系统的块大小还不能超过全局配置maxblocksize的值,更改maxblocksize需要整个GPFS停机。文件系统中所有数据的块大小和子块大小是相同的,所有元数据的块大小可以单独设置,但是数据和元数据的每个块的子块数量是相同的。例如,块大小为16 MB,元数据块大小为1 MB,则数据子块大小为128 KB,元数据子块大小为 8 KB,子块数量128个,特别注意这里的数据块子块大小要比标准的16K大,这是因为元数据的子块数量规定的数据的子块数量。
Lenovo DSS-G 是联想将GPFS集成的软硬件一体化设备,其硬件、固件和软件是紧密集成的,使用GNR或EC实现数据保护,抛弃了传统的RAID架构。其中DSS-G2xy采用GNR架构,DSS-G100 ECE采用EC架构。
DSS-G2xy硬件为两台相同配置的联想x86服务器和一台或多台JBOD,型号中的xy代表JBOD的类型和数量,x代表4U/5U高密3.5英寸HDD扩展柜,y代表2U 2.5英寸SSD扩展柜。如DSS-G210代表只有一台高密3.5英寸扩展柜。
服务器的配置是死的,只能选择内存容量(384/768)和IB卡种类(无/单口NDR/双口NDR200),板载25Gb以太网接口,四个HBA卡是固定位置安装的,选了IB卡也是固定位置安装两个,也就是所有扩展槽都是定死的。JBOD只能选择硬盘容量,所有槽位都是装满且一样的,除了第一个高密HDD扩展柜的两个固定位置要安装固定800G容量的SSD。服务器和JBOD的联线也是定死的,所有JBOD都适合两个服务器冗余直接链接。在安装过程中,安装脚本会检查硬件配置、整个链接拓扑、升级所有固件。所以说DSS-G是一套硬件、固件和软件紧密集成的系统,升级也是包括操作系统在内的全部软件和固件的升级。
由于联想定制了Confluent专用于DSS-G的安装部署,而且后续的监控还需要DSS-G免密码登录Confluent节点,所以强烈建议Confluent专用于安装维护DSS-G,与集群的管理节点完全独立。
第一步:使用Lenovo为DSS定制的Confluent安装两台服务器的系统(dssg-install),该过程会自动安装操作系统、IB驱动和GPFS等必备软件。
第二步:使用dsslsadapters检查PCIe扩展卡安装位置、dsschmod-drive更改HDD配置、dssgcktopology检查链接拓扑、dssgckdisks测试硬盘性能。
第三步:创建或加入现有集群,使用mmlsconfig验证nsdRAIDFirmwareDirectory 为 /opt/lenovo/dss/firmware,再使用mmlsfirmware检查固件版本。
以上三步安装DSS的文档操作即可,除了可以配置两个服务器的名字和IP以外,没有什么可以改的。
系统装好后可以做一些优化,比如以太网配置LACP,IB网络添加IPoIB,在服务器上安装一些管理和监控软件(如lldpd、node_exporter等),其中IPoIB并不是GPFS必须的。
第四步:使用dssgmkstorage创建存储,这一步是将JBOD和服务器上的硬盘创建pdisk、Recovery Group、Declustered Array。把pdisk分配给RG,每个pdisk都有主用服务器和备用服务器,这些pdisk组成DA。
如果把DSS比做一套存储,JBOD就是扩展柜,服务器就是控制器,pdisk是每个物理硬盘,RG就是定义每个物理硬盘的主用和备用控制器,DA是硬盘池,vdisk是存储池和LUN。三个log是存储内部使用的存储空间,比如保险箱盘。但是DSS没有电池,所有写缓存均需要立刻落盘。类似于双控主备模式,每个物理硬盘和存储池/LUN同一个时刻只能属于一个控制器。
创建完成后查看一下,可见两个RG,
[root@dss01 ~]# mmvdisk recoverygroup list --declustered-array
declustered needs capacity pdisks
recovery group array service type BER trim total raw free raw free% total spare background task
-------------- ----------- ------- ---- ------- ---- --------- -------- ----- ----- ----- ---------------
dss01 NVR no NVR enable - - - - 2 0 scrub (16%)
dss01 SSD no SSD enable - - - - 1 0 scrub (8%)
dss01 DA1 no HDD enable no 834 TiB 834 TiB 100% 44 2 scrub (0%)
dss02 NVR no NVR enable - - - - 2 0 scrub (16%)
dss02 SSD no SSD enable - - - - 1 0 scrub (8%)
dss02 DA1 no HDD enable no 834 TiB 834 TiB 100% 44 2 scrub (0%)
mmvdisk: Total capacity is the raw space before any vdisk set definitions.
mmvdisk: Free capacity is what remains for additional vdisk set definitions.
[root@dss01 ~]# mmvdisk recoverygroup list --recovery-group dss01 --all
needs user
recovery group node class active current or master server service vdisks remarks
-------------- ---------- ------- -------------------------------- ------- ------ -------
dss01 dssg01 yes dss01 no 0
recovery group format version
recovery group current allowable mmvdisk version
-------------- ------------- ------------- ---------------
dss01 5.1.5.0 5.1.5.0 5.1.9.2
node
number server active remarks
------ -------------------------------- ------- -------
922 dss01 yes primary, serving dss01
923 dss02 yes backup
declustered needs vdisks pdisks capacity
array service type BER trim user log total spare rt total raw free raw background task
----------- ------- ---- ------- ---- ---- --- ----- ----- -- --------- -------- ---------------
NVR no NVR enable - 0 1 2 0 1 - - scrub 14d (16%)
SSD no SSD enable - 0 1 1 0 1 - - scrub 14d (8%)
DA1 no HDD enable no 0 1 44 2 2 834 TiB 834 TiB scrub 14d (2%)
mmvdisk: Total capacity is the raw space before any vdisk set definitions.
mmvdisk: Free capacity is what remains for additional vdisk set definitions.
declustered paths AU
pdisk array active total capacity free space log size state
------------ ----------- ------ ----- -------- ---------- -------- -----
n922v001 NVR 1 1 7992 MiB 7816 MiB 120 MiB ok
n923v001 NVR 1 1 7992 MiB 7816 MiB 120 MiB ok
e1s01ssd SSD 2 4 745 GiB 744 GiB 120 MiB ok
e1s02 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s03 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s04 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s05 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s06 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s07 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s16 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s17 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s18 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s19 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s20 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s21 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s22 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s23 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s31 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s32 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s33 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s34 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s35 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s36 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s37 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s46 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s47 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s48 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s49 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s50 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s51 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s52 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s53 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s61 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s62 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s63 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s64 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s65 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s66 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s67 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s76 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s77 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s78 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s79 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s80 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s81 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s82 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s83 DA1 2 4 20 TiB 19 TiB 40 MiB ok
declustered capacity all vdisk sets defined
recovery group array type total raw free raw free% in the declustered array
-------------- ----------- ---- --------- -------- ----- ------------------------
dss01 DA1 HDD 834 TiB 834 TiB 100% -
vdisk set map memory per server
node class available required required per vdisk set
---------- --------- -------- ----------------------
dssg01 90 GiB 387 MiB -
declustered block size and
vdisk array activity capacity RAID code checksum granularity remarks
------------------ ----------- -------- -------- --------------- --------- --------- -------
RG001LOGHOME DA1 normal 48 GiB 4WayReplication 2 MiB 4096 log home
RG001LOGTIP NVR normal 48 MiB 2WayReplication 2 MiB 4096 log tip
RG001LOGTIPBACKUP SSD normal 48 MiB Unreplicated 2 MiB 4096 log tip backup
declustered VCD spares
configuration data array configured actual remarks
------------------ ----------- ---------- ------ -------
relocation space DA1 24 28 must contain VCD
configuration data disk group fault tolerance remarks
------------------ --------------------------------- -------
rg descriptor 4 pdisk limiting fault tolerance
system index 4 pdisk limited by rg descriptor
vdisk RAID code disk group fault tolerance remarks
------------------ --------------- --------------------------------- -------
RG001LOGHOME 4WayReplication 3 pdisk
RG001LOGTIP 2WayReplication 1 pdisk
RG001LOGTIPBACKUP Unreplicated 0 pdisk
[root@dss01 ~]# mmvdisk recoverygroup list --recovery-group dss02 --all
needs user
recovery group node class active current or master server service vdisks remarks
-------------- ---------- ------- -------------------------------- ------- ------ -------
dss02 dssg01 yes dss02 no 0
recovery group format version
recovery group current allowable mmvdisk version
-------------- ------------- ------------- ---------------
dss02 5.1.5.0 5.1.5.0 5.1.9.2
node
number server active remarks
------ -------------------------------- ------- -------
922 dss01 yes backup
923 dss02 yes primary, serving dss02
declustered needs vdisks pdisks capacity
array service type BER trim user log total spare rt total raw free raw background task
----------- ------- ---- ------- ---- ---- --- ----- ----- -- --------- -------- ---------------
NVR no NVR enable - 0 1 2 0 1 - - scrub 14d (16%)
SSD no SSD enable - 0 1 1 0 1 - - scrub 14d (8%)
DA1 no HDD enable no 0 1 44 2 2 834 TiB 834 TiB scrub 14d (2%)
mmvdisk: Total capacity is the raw space before any vdisk set definitions.
mmvdisk: Free capacity is what remains for additional vdisk set definitions.
declustered paths AU
pdisk array active total capacity free space log size state
------------ ----------- ------ ----- -------- ---------- -------- -----
n922v002 NVR 1 1 7992 MiB 7816 MiB 120 MiB ok
n923v002 NVR 1 1 7992 MiB 7816 MiB 120 MiB ok
e1s12ssd SSD 2 4 745 GiB 744 GiB 120 MiB ok
e1s08 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s09 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s10 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s11 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s13 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s14 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s15 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s24 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s25 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s26 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s27 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s28 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s29 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s30 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s38 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s39 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s40 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s41 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s42 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s43 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s44 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s45 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s54 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s55 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s56 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s57 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s58 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s59 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s60 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s68 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s69 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s70 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s71 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s72 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s73 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s74 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s75 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s84 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s85 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s86 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s87 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s88 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s89 DA1 2 4 20 TiB 19 TiB 40 MiB ok
e1s90 DA1 2 4 20 TiB 19 TiB 40 MiB ok
declustered capacity all vdisk sets defined
recovery group array type total raw free raw free% in the declustered array
-------------- ----------- ---- --------- -------- ----- ------------------------
dss02 DA1 HDD 834 TiB 834 TiB 100% -
vdisk set map memory per server
node class available required required per vdisk set
---------- --------- -------- ----------------------
dssg01 90 GiB 387 MiB -
declustered block size and
vdisk array activity capacity RAID code checksum granularity remarks
------------------ ----------- -------- -------- --------------- --------- --------- -------
RG002LOGHOME DA1 normal 48 GiB 4WayReplication 2 MiB 4096 log home
RG002LOGTIP NVR normal 48 MiB 2WayReplication 2 MiB 4096 log tip
RG002LOGTIPBACKUP SSD normal 48 MiB Unreplicated 2 MiB 4096 log tip backup
declustered VCD spares
configuration data array configured actual remarks
------------------ ----------- ---------- ------ -------
relocation space DA1 24 28 must contain VCD
configuration data disk group fault tolerance remarks
------------------ --------------------------------- -------
rg descriptor 4 pdisk limiting fault tolerance
system index 4 pdisk limited by rg descriptor
vdisk RAID code disk group fault tolerance remarks
------------------ --------------- --------------------------------- -------
RG002LOGHOME 4WayReplication 3 pdisk
RG002LOGTIP 2WayReplication 1 pdisk
RG002LOGTIPBACKUP Unreplicated 0 pdisk
第五步:使用dssServerConfig.sh优化GPFS配置。
第六步:定义和创建vdisk,vdisk在DA上面定义并创建被用于NSD,主要需要确定 Raid Code、Block Size和容量。vdisk存在于DA的所有pdisk上,Raid Code可选 Reed-Solomon code (4+2p/4+3p/8+2p/8+3p)或副本(三或四)保护数据,Block Size 对于 Reed-Solomon code 最大16MB、对于副本最大1MB。
与许多RAID6一样,GNR也有写惩罚问题,写入一个完整的block size时性能最佳,部分写入需要重新计算校验,这会导致性能下降,显然多副本没有这个问题。
元数据占用空间不大,不过读写块都很小,建议用多副本提高性能,如三副本 3WayReplication,空间利用率只有1/3。数据占用空间很大,用奇偶校验可以提供空间利用率,如8+2p,空间利用率有80%。3WayReplication 和 8+2p 可以忍受同时故障2个硬盘,如果需要忍受同时3个硬盘故障则需要 4WayReplication 和 8+3p,这样提高了安全性但是会导致性能和空间利用率下降。
元数据的容量建议至少1%,按照元数据三副本、数据8+2p计算元数据比例:
(0.03/3)/(0.97*0.8)=1.29% 元数据占裸容量3%,数据占裸容量97%,元数据占可用容量1.29%
(0.05/3)/(0.95*0.8)=2.19% 元数据占裸容量5%,数据占裸容量95%,元数据占可用容量2.19%
我们选择元数据5%裸容量三副本块大小1M块,含有元数据的vdisk会被自动指定为system pool;数据95%裸容量8+2p块大小16M,指定为data pool。先定义vdisk,确认无误且内存需求满足后创建vdisk。
[root@dss01 ~]# mmvdisk vdiskset define --vdisk-set mvs01 --recovery-group dss01,dss02 --code 3WayReplication --block-size 1m --set-size 5% --nsd-usage metadataOnly
mmvdisk: Vdisk set 'mvs01' has been defined.
mmvdisk: Recovery group 'dss01' has been defined in vdisk set 'mvs01'.
mmvdisk: Recovery group 'dss02' has been defined in vdisk set 'mvs01'.
member vdisks
vdisk set count size raw size created file system and attributes
-------------- ----- -------- -------- ------- --------------------------
mvs01 2 13 TiB 41 TiB no -, DA1, 3WayReplication, 1 MiB, metadataOnly, system
declustered capacity all vdisk sets defined
recovery group array type total raw free raw free% in the declustered array
-------------- ----------- ---- --------- -------- ----- ------------------------
dss01 DA1 HDD 834 TiB 793 TiB 95% mvs01
dss02 DA1 HDD 834 TiB 793 TiB 95% mvs01
vdisk set map memory per server
node class available required required per vdisk set
---------- --------- -------- ----------------------
dssg01 90 GiB 1080 MiB mvs01 (693 MiB)
[root@dss01 ~]# mmvdisk vdiskset define --vdisk-set dvs01 --recovery-group dss01,dss02 --code 8+2p --block-size 16m --set-size 95% --nsd-usage dataOnly --storage-pool data
mmvdisk: Vdisk set 'dvs01' has been defined.
mmvdisk: Recovery group 'dss01' has been defined in vdisk set 'dvs01'.
mmvdisk: Recovery group 'dss02' has been defined in vdisk set 'dvs01'.
member vdisks
vdisk set count size raw size created file system and attributes
-------------- ----- -------- -------- ------- --------------------------
dvs01 2 631 TiB 793 TiB no -, DA1, 8+2p, 16 MiB, dataOnly, data
declustered capacity all vdisk sets defined
recovery group array type total raw free raw free% in the declustered array
-------------- ----------- ---- --------- -------- ----- ------------------------
dss01 DA1 HDD 834 TiB 144 GiB 0% dvs01, mvs01
dss02 DA1 HDD 834 TiB 144 GiB 0% dvs01, mvs01
vdisk set map memory per server
node class available required required per vdisk set
---------- --------- -------- ----------------------
dssg01 90 GiB 14 GiB dvs01 (13 GiB), mvs01 (693 MiB)
[root@dss01 ~]# mmvdisk vdiskset list
vdisk set created file system recovery groups
---------------- ------- ----------- ---------------
dvs01 no - dss01, dss02
mvs01 no - dss01, dss02
[root@dss01 ~]# mmvdisk vdiskset list --vdisk-set all
member vdisks
vdisk set count size raw size created file system and attributes
-------------- ----- -------- -------- ------- --------------------------
dvs01 2 631 TiB 793 TiB no -, DA1, 8+2p, 16 MiB, dataOnly, data
mvs01 2 13 TiB 41 TiB no -, DA1, 3WayReplication, 1 MiB, metadataOnly, system
declustered capacity all vdisk sets defined
recovery group array type total raw free raw free% in the declustered array
-------------- ----------- ---- --------- -------- ----- ------------------------
dss01 DA1 HDD 834 TiB 144 GiB 0% dvs01, mvs01
dss02 DA1 HDD 834 TiB 144 GiB 0% dvs01, mvs01
vdisk set map memory per server
node class available required required per vdisk set
---------- --------- -------- ----------------------
dssg01 90 GiB 14 GiB dvs01 (13 GiB), mvs01 (693 MiB)
[root@dss01 ~]# mmvdisk vdiskset list --recovery-group all
declustered capacity all vdisk sets defined
recovery group array type total raw free raw free% in the declustered array
-------------- ----------- ---- --------- -------- ----- ------------------------
dss01 DA1 HDD 834 TiB 144 GiB 0% dvs01, mvs01
dss02 DA1 HDD 834 TiB 144 GiB 0% dvs01, mvs01
vdisk set map memory per server
node class available required required per vdisk set
---------- --------- -------- ----------------------
dssg01 90 GiB 14 GiB dvs01 (13 GiB), mvs01 (693 MiB)
[root@dss01 ~]# mmvdisk vdiskset create --vdisk-set mvs01,dvs01
mmvdisk: 2 vdisks and 2 NSDs will be created in vdisk set 'mvs01'.
mmvdisk: 2 vdisks and 2 NSDs will be created in vdisk set 'dvs01'.
mmvdisk: (mmcrvdisk) [I] Processing vdisk RG001VS001
mmvdisk: (mmcrvdisk) [I] Processing vdisk RG002VS001
mmvdisk: (mmcrvdisk) [I] Processing vdisk RG002VS002
mmvdisk: (mmcrvdisk) [I] Processing vdisk RG001VS002
mmvdisk: Created all vdisks in vdisk set 'mvs01'.
mmvdisk: Created all vdisks in vdisk set 'dvs01'.
mmvdisk: (mmcrnsd) Processing disk RG001VS001
mmvdisk: (mmcrnsd) Processing disk RG002VS001
mmvdisk: (mmcrnsd) Processing disk RG001VS002
mmvdisk: (mmcrnsd) Processing disk RG002VS002
mmvdisk: Created all NSDs in vdisk set 'mvs01'.
mmvdisk: Created all NSDs in vdisk set 'dvs01'.
第七步:创建文件系统:因NSD的用途和块大小在前一步vdisk时已经确定了,在这里只要指定vdisk set即可。
[root@dss01 ~]# mmvdisk filesystem create –file-system dssfs –vdisk-set mvs01,dvs01 –mmcrfs -A yes -Q yes -n 1024 -T /dssfs –auto-inode-limit
mmvdisk: Creating file system ‘dssfs’.
mmvdisk: The following disks of dssfs will be formatted on node dss01:
mmvdisk: RG001VS001: size 14520704 MB
mmvdisk: RG002VS001: size 14520704 MB
mmvdisk: RG001VS002: size 662657024 MB
mmvdisk: RG002VS002: size 662657024 MB
mmvdisk: Formatting file system …
mmvdisk: Disks up to size 126.40 TB can be added to storage pool system.
mmvdisk: Disks up to size 7.90 PB can be added to storage pool data.
mmvdisk: Creating Inode File
mmvdisk: 97 % complete on Sun Mar 9 19:28:34 2025
mmvdisk: 100 % complete on Sun Mar 9 19:28:34 2025
mmvdisk: Creating Allocation Maps
mmvdisk: Creating Log Files
mmvdisk: 0 % complete on Sun Mar 9 19:28:40 2025
mmvdisk: 18 % complete on Sun Mar 9 19:28:45 2025
mmvdisk: 31 % complete on Sun Mar 9 19:28:50 2025
mmvdisk: 48 % complete on Sun Mar 9 19:28:55 2025
mmvdisk: 63 % complete on Sun Mar 9 19:29:00 2025
mmvdisk: 75 % complete on Sun Mar 9 19:29:05 2025
mmvdisk: 100 % complete on Sun Mar 9 19:29:08 2025
mmvdisk: Clearing Inode Allocation Map
mmvdisk: Clearing Block Allocation Map
mmvdisk: Formatting Allocation Map for storage pool system
mmvdisk: Formatting Allocation Map for storage pool data
mmvdisk: 76 % complete on Sun Mar 9 19:29:16 2025
mmvdisk: 100 % complete on Sun Mar 9 19:29:17 2025
mmvdisk: Completed creation of file system /dev/dssfs.
这个时候我们再回头看看RG的配置,RG dss01所属的主服务器是dss01,备用服务器是dss02,DA1中每个pdisk空闲空间组成热备空间。
[root@dss01 ~]# mmvdisk recoverygroup list --recovery-group dss01 --all
needs user
recovery group node class active current or master server service vdisks remarks
-------------- ---------- ------- -------------------------------- ------- ------ -------
dss01 dssg01 yes dss01 no 2
recovery group format version
recovery group current allowable mmvdisk version
-------------- ------------- ------------- ---------------
dss01 5.1.5.0 5.1.5.0 5.1.9.2
node
number server active remarks
------ -------------------------------- ------- -------
922 dss01 yes primary, serving dss01
923 dss02 yes backup
declustered needs vdisks pdisks capacity
array service type BER trim user log total spare rt total raw free raw background task
----------- ------- ---- ------- ---- ---- --- ----- ----- -- --------- -------- ---------------
NVR no NVR enable - 0 1 2 0 1 - - scrub 14d (66%)
SSD no SSD enable - 0 1 1 0 1 - - scrub 14d (33%)
DA1 no HDD enable no 2 1 44 2 2 834 TiB 144 GiB scrub 14d (45%)
mmvdisk: Total capacity is the raw space before any vdisk set definitions.
mmvdisk: Free capacity is what remains for additional vdisk set definitions.
declustered paths AU
pdisk array active total capacity free space log size state
------------ ----------- ------ ----- -------- ---------- -------- -----
n922v001 NVR 1 1 7992 MiB 7816 MiB 120 MiB ok
n923v001 NVR 1 1 7992 MiB 7816 MiB 120 MiB ok
e1s01ssd SSD 2 4 745 GiB 744 GiB 120 MiB ok
e1s02 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s03 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s04 DA1 2 4 20 TiB 1040 GiB 40 MiB ok
e1s05 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s06 DA1 2 4 20 TiB 1040 GiB 40 MiB ok
e1s07 DA1 2 4 20 TiB 1040 GiB 40 MiB ok
e1s16 DA1 2 4 20 TiB 1040 GiB 40 MiB ok
e1s17 DA1 2 4 20 TiB 1040 GiB 40 MiB ok
e1s18 DA1 2 4 20 TiB 1040 GiB 40 MiB ok
e1s19 DA1 2 4 20 TiB 1040 GiB 40 MiB ok
e1s20 DA1 2 4 20 TiB 1040 GiB 40 MiB ok
e1s21 DA1 2 4 20 TiB 1040 GiB 40 MiB ok
e1s22 DA1 2 4 20 TiB 1040 GiB 40 MiB ok
e1s23 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s31 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s32 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s33 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s34 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s35 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s36 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s37 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s46 DA1 2 4 20 TiB 1040 GiB 40 MiB ok
e1s47 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s48 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s49 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s50 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s51 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s52 DA1 2 4 20 TiB 1040 GiB 40 MiB ok
e1s53 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s61 DA1 2 4 20 TiB 1040 GiB 40 MiB ok
e1s62 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s63 DA1 2 4 20 TiB 1040 GiB 40 MiB ok
e1s64 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s65 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s66 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s67 DA1 2 4 20 TiB 1040 GiB 40 MiB ok
e1s76 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s77 DA1 2 4 20 TiB 1040 GiB 40 MiB ok
e1s78 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s79 DA1 2 4 20 TiB 1040 GiB 40 MiB ok
e1s80 DA1 2 4 20 TiB 1040 GiB 40 MiB ok
e1s81 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s82 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
e1s83 DA1 2 4 20 TiB 1024 GiB 40 MiB ok
declustered capacity all vdisk sets defined
recovery group array type total raw free raw free% in the declustered array
-------------- ----------- ---- --------- -------- ----- ------------------------
dss01 DA1 HDD 834 TiB 144 GiB 0% dvs01, mvs01
vdisk set map memory per server
node class available required required per vdisk set
---------- --------- -------- ----------------------
dssg01 90 GiB 14 GiB dvs01 (13 GiB), mvs01 (693 MiB)
declustered block size and
vdisk array activity capacity RAID code checksum granularity remarks
------------------ ----------- -------- -------- --------------- --------- --------- -------
RG001LOGHOME DA1 normal 48 GiB 4WayReplication 2 MiB 4096 log home
RG001LOGTIP NVR normal 48 MiB 2WayReplication 2 MiB 4096 log tip
RG001LOGTIPBACKUP SSD normal 48 MiB Unreplicated 2 MiB 4096 log tip backup
RG001VS001 DA1 normal 13 TiB 3WayReplication 1 MiB 32 KiB
RG001VS002 DA1 normal 631 TiB 8+2p 16 MiB 32 KiB
declustered VCD spares
configuration data array configured actual remarks
------------------ ----------- ---------- ------ -------
relocation space DA1 24 28 must contain VCD
configuration data disk group fault tolerance remarks
------------------ --------------------------------- -------
rg descriptor 4 pdisk limiting fault tolerance
system index 4 pdisk limited by rg descriptor
vdisk RAID code disk group fault tolerance remarks
------------------ --------------- --------------------------------- -------
RG001LOGHOME 4WayReplication 3 pdisk
RG001LOGTIP 2WayReplication 1 pdisk
RG001LOGTIPBACKUP Unreplicated 0 pdisk
RG001VS001 3WayReplication 2 pdisk
RG001VS002 8+2p 2 pdisk
再看看vdisk的配置,可见block size、pool和用途都在vdisk上面配置了。
[root@dss01 ~]# mmvdisk vdiskset list --file-system all
member vdisks
vdisk set count size raw size created file system and attributes
-------------- ----- -------- -------- ------- --------------------------
dvs01 2 631 TiB 793 TiB yes fsb, DA1, 8+2p, 16 MiB, dataOnly, data
mvs01 2 13 TiB 41 TiB yes fsb, DA1, 3WayReplication, 1 MiB, metadataOnly, system
declustered capacity all vdisk sets defined
recovery group array type total raw free raw free% in the declustered array
-------------- ----------- ---- --------- -------- ----- ------------------------
dss01 DA1 HDD 834 TiB 144 GiB 0% dvs01, mvs01
dss02 DA1 HDD 834 TiB 144 GiB 0% dvs01, mvs01
vdisk set map memory per server
node class available required required per vdisk set
---------- --------- -------- ----------------------
dssg01 90 GiB 14 GiB dvs01 (13 GiB), mvs01 (693 MiB)
]]>storcli64 /c0 show 查看当前版本
然后升级一堆:
storcli64 /c0 download file=HBA_9400-8e_SAS_SATA_Profile.bin
storcli64 /c0 download bios file=mpt35sas_legacy.rom
storcli64 /c2 download efibios file=mpt35sas_x64.rom
storcli64 /c0 show 查看升级后版本,然后重启搞定
]]>dcb pfc
#
interface 25GE1/0/10
description Host-01
port link-type trunk
undo port trunk allow-pass vlan 1
port trunk allow-pass vlan 125
stp edged-port enable
lldp tlv-enable dcbx
dcb pfc enable mode manual
#
interface 25GE1/0/12
description Host-02
port link-type trunk
undo port trunk allow-pass vlan 1
port trunk allow-pass vlan 125
stp edged-port enable
lldp tlv-enable dcbx
dcb pfc enable mode manual
#
lldp enable
#
通过主机-配置-物理适配器确认用于vSAN网卡的设备位置和端口,以下示例中是0000:8a:00的端口2(vmnic3)和0000:8b:00的端口1(vmnic4),首先查看一下LLDP和DCBx的相关配置。需要先安装MFT才能使用相关命令。
[root@yaoge123:~] /opt/mellanox/bin/mlxconfig -d 0000:8a:00.1 query|grep -iE "dcb|lldp"
LLDP_NB_DCBX_P1 False(0)
LLDP_NB_RX_MODE_P1 OFF(0)
LLDP_NB_TX_MODE_P1 OFF(0)
LLDP_NB_DCBX_P2 False(0)
LLDP_NB_RX_MODE_P2 OFF(0)
LLDP_NB_TX_MODE_P2 OFF(0)
DCBX_IEEE_P1 True(1)
DCBX_CEE_P1 True(1)
DCBX_WILLING_P1 True(1)
DCBX_IEEE_P2 True(1)
DCBX_CEE_P2 True(1)
DCBX_WILLING_P2 True(1)
[root@yaoge123:~] /opt/mellanox/bin/mlxconfig -d 0000:8b:00.0 query|grep -iE "dcb|lldp"
LLDP_NB_DCBX_P1 False(0)
LLDP_NB_RX_MODE_P1 OFF(0)
LLDP_NB_TX_MODE_P1 OFF(0)
LLDP_NB_DCBX_P2 False(0)
LLDP_NB_RX_MODE_P2 OFF(0)
LLDP_NB_TX_MODE_P2 OFF(0)
DCBX_IEEE_P1 True(1)
DCBX_CEE_P1 True(1)
DCBX_WILLING_P1 True(1)
DCBX_IEEE_P2 True(1)
DCBX_CEE_P2 True(1)
DCBX_WILLING_P2 True(1)
发现LLDP没有启动,启用LLDP和DCBx,DCBx IEEE已启用。
[root@yaoge123:~] /opt/mellanox/bin/mlxconfig -d 0000:8a:00.1 set LLDP_NB_DCBX_P2=1 LLDP_NB_RX_MODE_P2=2 LLDP_NB_TX_MODE_P2=2 DCBX_WILLING_P2=1 DCBX_IEEE_P2=1
Device #1:
----------
Device type: ConnectX5
Name: MCX512A-ACU_Ax_Bx
Description: ConnectX-5 EN network interface card; 10/25GbE dual-port SFP28; PCIe3.0 x8; UEFI Enabled (x86/ARM)
Device: 0000:8a:00.1
Configurations: Next Boot New
LLDP_NB_DCBX_P2 False(0) True(1)
LLDP_NB_RX_MODE_P2 OFF(0) ALL(2)
LLDP_NB_TX_MODE_P2 OFF(0) ALL(2)
DCBX_WILLING_P2 True(1) True(1)
DCBX_IEEE_P2 True(1) True(1)
Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.
[root@yaoge123:~] /opt/mellanox/bin/mlxconfig -d 0000:8b:00.0 set LLDP_NB_DCBX_P1=1 LLDP_NB_RX_MODE_P1=2 LLDP_NB_TX_MODE_P1=2 DCBX_WILLING_P1=1 DCBX_IEEE_P1=1
Device #1:
----------
Device type: ConnectX5
Name: MCX512A-ACU_Ax_Bx
Description: ConnectX-5 EN network interface card; 10/25GbE dual-port SFP28; PCIe3.0 x8; UEFI Enabled (x86/ARM)
Device: 0000:8b:00.0
Configurations: Next Boot New
LLDP_NB_DCBX_P1 False(0) True(1)
LLDP_NB_RX_MODE_P1 OFF(0) ALL(2)
LLDP_NB_TX_MODE_P1 OFF(0) ALL(2)
DCBX_WILLING_P1 True(1) True(1)
DCBX_IEEE_P1 True(1) True(1)
Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.
重启后查看网卡当前配置,确认LLDP和DCBx均已启用
[root@yaoge123:~] /opt/mellanox/bin/mlxconfig -d 0000:8a:00.1 query|grep -iE "dcb|lldp"
LLDP_NB_DCBX_P1 False(0)
LLDP_NB_RX_MODE_P1 OFF(0)
LLDP_NB_TX_MODE_P1 OFF(0)
LLDP_NB_DCBX_P2 True(1)
LLDP_NB_RX_MODE_P2 ALL(2)
LLDP_NB_TX_MODE_P2 ALL(2)
DCBX_IEEE_P1 True(1)
DCBX_CEE_P1 True(1)
DCBX_WILLING_P1 True(1)
DCBX_IEEE_P2 True(1)
DCBX_CEE_P2 True(1)
DCBX_WILLING_P2 True(1)
[root@yaoge123:~] /opt/mellanox/bin/mlxconfig -d 0000:8b:00.0 query|grep -iE "dcb|lldp"
LLDP_NB_DCBX_P1 True(1)
LLDP_NB_RX_MODE_P1 ALL(2)
LLDP_NB_TX_MODE_P1 ALL(2)
LLDP_NB_DCBX_P2 False(0)
LLDP_NB_RX_MODE_P2 OFF(0)
LLDP_NB_TX_MODE_P2 OFF(0)
DCBX_IEEE_P1 True(1)
DCBX_CEE_P1 True(1)
DCBX_WILLING_P1 True(1)
DCBX_IEEE_P2 True(1)
DCBX_CEE_P2 True(1)
DCBX_WILLING_P2 True(1)
查看网卡的DCB状态,模式为IEEE,PFC已启用且优先级为3(0 0 0 1 0 0 0 0)
[root@2288Hv6-05:~] esxcli network nic dcb status get -n vmnic3
Nic Name: vmnic3
Mode: 3 - IEEE Mode
Enabled: true
Capabilities:
Priority Group: true
Priority Flow Control: true
PG Traffic Classes: 8
PFC Traffic Classes: 8
PFC Enabled: true
PFC Configuration: 0 0 0 1 0 0 0 0
IEEE ETS Configuration:
Willing Bit In ETS Config TLV: 1
Supported Capacity: 8
Credit Based Shaper ETS Algorithm Supported: 0x0
TX Bandwidth Per TC: 13 13 13 13 12 12 12 12
RX Bandwidth Per TC: 13 13 13 13 12 12 12 12
TSA Assignment Table Per TC: 2 2 2 2 2 2 2 2
Priority Assignment Per TC: 1 0 2 3 4 5 6 7
Recommended TC Bandwidth Per TC: 13 13 13 13 12 12 12 12
Recommended TSA Assignment Per TC: 2 2 2 2 2 2 2 2
Recommended Priority Assignment Per TC: 1 0 2 3 4 5 6 7
IEEE PFC Configuration:
Number Of Traffic Classes: 8
PFC Configuration: 0 0 0 1 0 0 0 0
Macsec Bypass Capability Is Enabled: 0
Round Trip Propagation Delay Of Link: 0
Sent PFC Frames: 0 0 0 0 0 0 0 0
Received PFC Frames: 0 0 0 0 0 0 0 0
DCB Apps:
[root@2288Hv6-05:~] esxcli network nic dcb status get -n vmnic4
Nic Name: vmnic4
Mode: 3 - IEEE Mode
Enabled: true
Capabilities:
Priority Group: true
Priority Flow Control: true
PG Traffic Classes: 8
PFC Traffic Classes: 8
PFC Enabled: true
PFC Configuration: 0 0 0 1 0 0 0 0
IEEE ETS Configuration:
Willing Bit In ETS Config TLV: 1
Supported Capacity: 8
Credit Based Shaper ETS Algorithm Supported: 0x0
TX Bandwidth Per TC: 13 13 13 13 12 12 12 12
RX Bandwidth Per TC: 13 13 13 13 12 12 12 12
TSA Assignment Table Per TC: 2 2 2 2 2 2 2 2
Priority Assignment Per TC: 1 0 2 3 4 5 6 7
Recommended TC Bandwidth Per TC: 13 13 13 13 12 12 12 12
Recommended TSA Assignment Per TC: 2 2 2 2 2 2 2 2
Recommended Priority Assignment Per TC: 1 0 2 3 4 5 6 7
IEEE PFC Configuration:
Number Of Traffic Classes: 8
PFC Configuration: 0 0 0 1 0 0 0 0
Macsec Bypass Capability Is Enabled: 0
Round Trip Propagation Delay Of Link: 0
Sent PFC Frames: 0 0 0 0 0 0 0 0
Received PFC Frames: 0 0 0 0 0 0 0 0
DCB Apps:
]]>proxy_cache配置在前端nginx上,前端nginx访问SSD,前端nginx需要编译安装ngx_cache_purge以便主动清除缓存。前端nginx反代给后端nginx,后端nginx访问机械盘提供实际的数据。当然也可以一个nginx同时访问SSD和机械盘,这个nginx自己反代给自己,因为只有反代了才能缓存,前端nginx缓存配置如下:
http {
……
proxy_cache_path /cache levels=1:2 use_temp_path=off keys_zone=mirror:64m inactive=24h max_size=2600g;
……
}
server {
listen 80;
listen [::]:80;
listen 443 ssl;
listen [::]:443 ssl;
listen *:443 quic;
listen [::]:443 quic;
server_name mirror.nju.edu.cn mirrors.nju.edu.cn;
……
proxy_cache mirror;
proxy_cache_key $request_uri;
proxy_cache_valid 200 24h;
proxy_cache_valid 301 302 1h;
proxy_cache_valid any 1m;
proxy_cache_lock on;
proxy_cache_lock_age 3s;
proxy_cache_lock_timeout 3s;
proxy_cache_use_stale error timeout updating;
proxy_cache_background_update on;
proxy_cache_revalidate on;
cache_purge_response_type text;
proxy_cache_purge PURGE from 127.0.0.1 192.168.10.10;
add_header X-Cache-Status $upstream_cache_status;
……
location / {
proxy_pass http://192.168.10.10:8000;
……
}
……
}
后端nginx可能需要设定某些文件的缓存有效期,比如提供给mirrorz的相关文件需要根据不同的请求站点设置不同的 Access-Control-Allow-Origin 头以实现跨域,但是URL相同则前端缓存key相同,前端无法根据不同的请求站点回复不同的HTTP头,在缓存有效期内均回复第一次访问缓存下来的HTTP头,所以比较简单的方法就是直接禁用有关文件的缓存:
server {
listen 8000;
root /mirror;
……
location ~ ^/mirrorz/ {
expires -1;
……
}
……
}
编辑 tunasync worker 的配置文件添加一个脚本,使得每个同步作业结束后均清除前端上这个镜像的所有缓存:
#tunasync worker的配置文件
[global]
……
exec_on_success = [ "/home/mirror/postexec.sh" ]
exec_on_failure = [ "/home/mirror/postexec.sh" ]
……
#上述的postexec.sh脚本内容
#!/bin/bash
MIRROR_DIR="/mirror/"
DIR="${TUNASYNC_WORKING_DIR/"$MIRROR_DIR"/}"
/usr/bin/curl --silent --output /dev/null --request PURGE "https://mirror.nju.edu.cn/$DIR/*"
镜像站某些状态文件会频繁的周期更新,每次更新后都用 curl --silent --output /dev/null --request PURGE "https://mirror.nju.edu.cn/status/*" 清除一下缓存。