Tombart提出的问题 -server

Tombart

Asked: 2024-09-25 17:39:31 +0800 CST

NVMe mdadm RAID 阵列上的 I/O 非常慢

9

我有一台装有6 个 NVMe 驱动器的AMD EPYC 7502P 32-CoreLinux 服务器（内核6.10.6），突然 I/O 性能下降。所有操作都花费了太多时间。安装软件包更新需要几个小时，而不是几秒钟（也许几分钟）。

我尝试fio在具有 RAID5 的文件系统上运行。指标存在巨大差异clat：

    clat (nsec): min=190, max=359716k, avg=16112.91, stdev=592031.05

stdev值是极端的。

完整输出：

$ fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.33
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [F(1)][100.0%][w=53.3MiB/s][w=13.6k IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=48391: Wed Sep 25 09:17:02 2024
  write: IOPS=45.5k, BW=178MiB/s (186MB/s)(10.6GiB/61165msec); 0 zone resets
    slat (nsec): min=552, max=123137, avg=2016.89, stdev=468.03
    clat (nsec): min=190, max=359716k, avg=16112.91, stdev=592031.05
     lat (usec): min=10, max=359716, avg=18.13, stdev=592.03
    clat percentiles (usec):
     |  1.00th=[   11],  5.00th=[   12], 10.00th=[   14], 20.00th=[   15],
     | 30.00th=[   15], 40.00th=[   15], 50.00th=[   15], 60.00th=[   16],
     | 70.00th=[   16], 80.00th=[   16], 90.00th=[   17], 95.00th=[   18],
     | 99.00th=[   20], 99.50th=[   22], 99.90th=[   42], 99.95th=[  119],
     | 99.99th=[  186]
   bw (  KiB/s): min=42592, max=290232, per=100.00%, avg=209653.41, stdev=46502.99, samples=105
   iops        : min=10648, max=72558, avg=52413.32, stdev=11625.75, samples=105
  lat (nsec)   : 250=0.01%, 500=0.01%, 1000=0.01%
  lat (usec)   : 10=0.01%, 20=99.15%, 50=0.76%, 100=0.03%, 250=0.06%
  lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 500=0.01%
  cpu          : usr=12.62%, sys=30.97%, ctx=2800981, majf=0, minf=28
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,2784519,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=178MiB/s (186MB/s), 178MiB/s-178MiB/s (186MB/s-186MB/s), io=10.6GiB (11.4GB), run=61165-61165msec

Disk stats (read/write):
    md1: ios=0/710496, merge=0/0, ticks=0/12788992, in_queue=12788992, util=23.31%, aggrios=319833/649980, aggrmerge=0/0, aggrticks=118293/136983, aggrin_queue=255276, aggrutil=14.78%
  nvme1n1: ios=318781/638009, merge=0/0, ticks=118546/131154, in_queue=249701, util=14.71%
  nvme5n1: ios=321508/659460, merge=0/0, ticks=118683/138996, in_queue=257679, util=14.77%
  nvme2n1: ios=320523/647922, merge=0/0, ticks=120634/134284, in_queue=254918, util=14.71%
  nvme3n1: ios=320809/651642, merge=0/0, ticks=118823/135985, in_queue=254808, util=14.73%
  nvme0n1: ios=316267/642934, merge=0/0, ticks=116772/143909, in_queue=260681, util=14.75%
  nvme4n1: ios=321110/659918, merge=0/0, ticks=116300/137570, in_queue=253870, util=14.78%

可能有一个磁盘有故障，有办法确定速度慢的磁盘吗？

所有磁盘的SMART属性都差不多，没什么突出的。三星7T：

Model Number:                       SAMSUNG MZQL27T6HBLA-00A07
Firmware Version:                   GDC5902Q
Data Units Read:                    2,121,457,831 [1.08 PB]
Data Units Written:                 939,728,748 [481 TB]
Controller Busy Time:               40,224
Power Cycles:                       5
Power On Hours:                     6,913

写入性能看起来非常相似：

iostat -xh
Linux 6.10.6+bpo-amd64 (ts01b)  25/09/24        _x86_64_        (64 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.0%    0.0%    4.3%    0.6%    0.0%   90.2%

     r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz Device
    0.12      7.3k     0.00   0.0%    0.43    62.9k md0
 6461.73    548.7M     0.00   0.0%    0.22    87.0k md1
 3583.93     99.9M     9.60   0.3%    1.13    28.5k nvme0n1
 3562.77     98.9M     0.80   0.0%    1.15    28.4k nvme1n1
 3584.54     99.8M     9.74   0.3%    1.18    28.5k nvme2n1
 3565.96     98.8M     1.06   0.0%    1.16    28.4k nvme3n1
 3585.04     99.9M     9.78   0.3%    1.16    28.5k nvme4n1
 3577.56     99.0M     0.86   0.0%    1.17    28.3k nvme5n1

     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz Device
    0.00      0.0k     0.00   0.0%    0.00     4.0k md0
  366.41    146.5M     0.00   0.0%   14.28   409.4k md1
 8369.26     32.7M     1.18   0.0%    3.73     4.0k nvme0n1
 8364.63     32.7M     1.12   0.0%    3.63     4.0k nvme1n1
 8355.48     32.6M     1.10   0.0%    3.56     4.0k nvme2n1
 8365.23     32.7M     1.10   0.0%    3.46     4.0k nvme3n1
 8365.37     32.7M     1.25   0.0%    3.37     4.0k nvme4n1
 8356.70     32.6M     1.06   0.0%    3.29     4.0k nvme5n1

     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz Device
    0.00      0.0k     0.00   0.0%    0.00     0.0k md0
    0.00      0.0k     0.00   0.0%    0.00     0.0k md1
    0.00      0.0k     0.00   0.0%    0.00     0.0k nvme0n1
    0.00      0.0k     0.00   0.0%    0.00     0.0k nvme1n1
    0.00      0.0k     0.00   0.0%    0.00     0.0k nvme2n1
    0.00      0.0k     0.00   0.0%    0.00     0.0k nvme3n1
    0.00      0.0k     0.00   0.0%    0.00     0.0k nvme4n1
    0.00      0.0k     0.00   0.0%    0.00     0.0k nvme5n1

     f/s f_await  aqu-sz  %util Device
    0.00    0.00    0.00   0.0% md0
    0.00    0.00    6.68  46.8% md1
    0.00    0.00   35.24  14.9% nvme0n1
    0.00    0.00   34.50  14.6% nvme1n1
    0.00    0.00   33.98  14.9% nvme2n1
    0.00    0.00   33.06  14.6% nvme3n1
    0.00    0.00   32.33  14.8% nvme4n1
    0.00    0.00   31.72  14.6% nvme5n1

有点问题似乎是中断

$ dstat -tf --int24 60
----system---- -------------------------------interrupts------------------------------
     time     | 120   128   165   199   213   342   LOC   PMI   IWI   RES   CAL   TLB 
25-09 10:53:45|2602  2620  2688  2695  2649  2725   136k   36  1245  2739   167k  795 
25-09 10:54:45|  64    64    65    64    66    65  2235     1    26    16  2156     3 
25-09 10:55:45|  33    31    32    32    32    30  2050     1    24    10  2162    20 
25-09 10:56:45|  31    31    30    35    30    33  2303     1    26    63  2245     9 
25-09 10:57:45|  36    29    27    34    35    35  2016     1    23    72  2645    10 
25-09 10:58:45|   9     8     9     8     7     8  1766     0    27     4  1892    15 
25-09 10:59:45|  59    62    59    58    60    60  1585     1    22    20  1704     9 
25-09 11:00:45|  25    21    21    26    26    26  1605     0    26    10  1862    10 
25-09 11:01:45|  34    32    32    33    36    31  1515     0    23    24  1948    10 
25-09 11:02:45|  21    23    23    25    22    24  1772     0    27    27  1781     9

中断增加的字段被映射到9-edge所有驱动器nvme[0-5]q9，例如：

$ cat /proc/interrupts | grep 120:
IR-PCI-MSIX-0000:01:00.0    9-edge      nvme2q9

编辑：9-edge可能是 Metadisk（软件 RAID）设备。

Tombart

Asked: 2021-10-01 00:48:22 +0800 CST

如何为（辅助）编译 puppetserver 生成证书？

0

我正在尝试使用循环 DNS 来扩展 puppetserver，以获得冗余。次要puppetserver（版本7.4.0）配置为使用来自主要的 CA 权限puppetserver：

/etc/puppetlabs/puppet/puppet.conf：

[main]
ca_name = Puppet CA: puppet-ca-master.company.com
ca_server = puppet-ca-master.company.com
[agent]
server = puppet-ca-master.company.com
runinterval=1800

在辅助服务器上，我禁用了 CA 服务，因为在以下位置可能只有一个证书颁发机构/etc/puppetlabs/puppetserver/services.d/ca.cfg：

# To enable the CA service, leave the following line uncommented
# puppetlabs.services.ca.certificate-authority-service/certificate-authority-service
# To disable the CA service, comment out the above line and uncomment the line below
puppetlabs.services.ca.certificate-authority-disabled-service/certificate-authority-disabled-service
puppetlabs.trapperkeeper.services.watcher.filesystem-watch-service/filesystem-watch-service

我已经从辅助服务器中删除了证书，以便从 CA 主服务器获取证书签名证书：

rm -rf /etc/puppetlabs/puppet/ssl && mkdir -p /etc/puppetlabs/puppet/ssl/certs
chmod 0700 /etc/puppetlabs/puppet/ssl
chown -R puppet /etc/puppetlabs/puppet/ssl

但是，puppetserver由于缺少证书，服务拒绝启动：

2021-09-30T09:06:18.220+02:00 ERROR [async-dispatch-2] [p.t.internal] Error during service start!!!
java.lang.IllegalArgumentException: Unable to open 'ssl-cert' file: /etc/puppetlabs/puppet/ssl/certs/secondary-puppetserver.company.com.pem

当我尝试puppet agent -t在辅助 puppetserver 上运行时，它无法签署证书：

Couldn't fetch certificate from CA server; you might still need to sign this agent's certificate (secondary-puppetserver.company.com)

此外，会生成私钥，但不会生成公钥：

ll /etc/puppetlabs/puppet/ssl/public_keys/
total 0

Tombart

Asked: 2021-09-07 13:29:58 +0800 CST

如何调试 PostgreSQL 分段错误？

3

我有一个不断崩溃的 PostgreSQL 13 实例：

LOG:  server process (PID 10722) was terminated by signal 11: Segmentation fault
DETAIL:  Failed process was running: COMMIT
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

我已更新/etc/postgresql/13/main/pg_ctl.conf以包含核心转储

pg_ctl_options = '--core-files'

并重新启动postgresql服务。现在它似乎允许核心转储：

$ for f in `pgrep postgres`; do cat /proc/$f/limits | grep core; done
Max core file size        unlimited            unlimited            bytes

gdb回溯给出以下输出

$ gdb /usr/lib/postgresql/13/bin/postgres 13/main/core.postgres.12264

Program terminated with signal SIGSEGV, Segmentation fault.
#0  slot_deform_heap_tuple (natts=5, offp=0x557cc2e60720, tuple=<optimized out>, slot=0x557cc2e606d8) at ./build/../src/backend/executor/execTuples.c:930
930     ./build/../src/backend/executor/execTuples.c: No such file or directory.
(gdb) bt
#0  slot_deform_heap_tuple (natts=5, offp=0x557cc2e60720, tuple=<optimized out>, slot=0x557cc2e606d8) at ./build/../src/backend/executor/execTuples.c:930
#1  tts_buffer_heap_getsomeattrs (slot=0x557cc2e606d8, natts=5) at ./build/../src/backend/executor/execTuples.c:695
#2  0x0000557cc1d3998c in slot_getsomeattrs_int (slot=slot@entry=0x557cc2e606d8, attnum=5) at ./build/../src/backend/executor/execTuples.c:1912
#3  0x0000557cc1d28fba in slot_getsomeattrs (attnum=<optimized out>, slot=0x557cc2e606d8) at ./build/../src/include/executor/tuptable.h:344
#4  ExecInterpExpr (state=0x557cc2e620a8, econtext=0x557cc2ea1768, isnull=<optimized out>) at ./build/../src/backend/executor/execExprInterp.c:482
#5  0x0000557cc1d5548d in ExecEvalExprSwitchContext (isNull=0x7ffdd2599507, econtext=0x557cc2ea1768, state=0x557cc2e620a8) at ./build/../src/include/executor/executor.h:322
#6  ExecQual (econtext=0x557cc2ea1768, state=0x557cc2e620a8) at ./build/../src/include/executor/executor.h:391
#7  MJFillInner (node=0x557cc2ea1558) at ./build/../src/backend/executor/nodeMergejoin.c:494
#8  0x0000557cc1d55ce8 in ExecMergeJoin (pstate=0x557cc2ea1558) at ./build/../src/backend/executor/nodeMergejoin.c:1353
#9  0x0000557cc1d2cc83 in ExecProcNode (node=0x557cc2ea1558) at ./build/../src/include/executor/executor.h:248
#10 ExecutePlan (execute_once=<optimized out>, dest=0x557cc2e1a630, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x557cc2ea1558, 
    estate=0x557cc2ea12f8) at ./build/../src/backend/executor/execMain.c:1632
#11 standard_ExecutorRun (queryDesc=0x557cc2e1a5a0, direction=<optimized out>, count=0, execute_once=<optimized out>) at ./build/../src/backend/executor/execMain.c:350
#12 0x00007f0ec05ae09d in pgss_ExecutorRun (queryDesc=0x557cc2e1a5a0, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:1045
#13 0x0000557cc1cdbcd4 in PersistHoldablePortal (portal=portal@entry=0x557cc2d44b78) at ./build/../src/backend/commands/portalcmds.c:407
#14 0x0000557cc1ff95f9 in HoldPortal (portal=portal@entry=0x557cc2d44b78) at ./build/../src/backend/utils/mmgr/portalmem.c:642
#15 0x0000557cc1ff9e7d in PreCommit_Portals (isPrepare=isPrepare@entry=false) at ./build/../src/backend/utils/mmgr/portalmem.c:738
#16 0x0000557cc1c001c4 in CommitTransaction () at ./build/../src/backend/access/transam/xact.c:2087
#17 0x0000557cc1c015d5 in CommitTransactionCommand () at ./build/../src/backend/access/transam/xact.c:3085
#18 0x0000557cc1ea211d in finish_xact_command () at ./build/../src/backend/tcop/postgres.c:2662
#19 0x0000557cc1ea4703 in exec_simple_query (query_string=0x557cc2c9cd28 "COMMIT") at ./build/../src/backend/tcop/postgres.c:1264
#20 0x0000557cc1ea6143 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x557cc2cf6c68, dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4339
#21 0x0000557cc1e25bcd in BackendRun (port=0x557cc2ce94d0, port=0x557cc2ce94d0) at ./build/../src/backend/postmaster/postmaster.c:4526
#22 BackendStartup (port=0x557cc2ce94d0) at ./build/../src/backend/postmaster/postmaster.c:4210
#23 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1739
#24 0x0000557cc1e26b41 in PostmasterMain (argc=5, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1412
#25 0x0000557cc1b70f4f in main (argc=5, argv=0x557cc2c96c30) at ./build/../src/backend/main/main.c:210

添加log_statement = 'all'到/etc/postgresql/13/main/postgresql.conf并没有真正的帮助，因为postmaster会立即终止所有进程并且查询不会写入日志。

这是strace运行后的输出COMMIT

[pid 20006] pwrite64(29, "COMMIT", 6, 15936) = 6
[pid 20006] pwrite64(29, "\0", 1, 15942) = 1
[pid 20006] close(29)                   = 0
[pid 20006] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x10} ---
[pid 20006] +++ killed by SIGSEGV (core dumped) +++
<... select resumed> )                  = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_DUMPED, si_pid=20006, si_uid=108, si_status=SIGSEGV, si_utime=0, si_stime=0} ---
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV && WCOREDUMP(s)}], WNOHANG, NULL) = 20006
write(2, "2021-09-08 13:38:51.853 UTC [299"..., 198) = 198
write(2, "2021-09-08 13:38:51.853 UTC [299"..., 88) = 88
kill(19324, SIGQUIT)                    = 0
kill(-19324, SIGQUIT)                   = 0
kill(19331, SIGQUIT)                    = 0
kill(-19331, SIGQUIT)                   = 0
kill(19320, SIGQUIT)                    = 0
kill(-19320, SIGQUIT)                   = 0
kill(19319, SIGQUIT)                    = 0
kill(-19319, SIGQUIT)                   = 0
kill(19321, SIGQUIT)                    = 0
kill(-19321, SIGQUIT)                   = 0
kill(19322, SIGQUIT)                    = 0
kill(-19322, SIGQUIT)                   = 0
kill(19323, SIGQUIT)                    = 0
kill(-19323, SIGQUIT)                   = 0
wait4(-1, 0x7ffe90814374, WNOHANG, NULL) = 0
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(7, [5 6], NULL, NULL, {tv_sec=5, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=19320, si_uid=108, si_status=2, si_utime=14, si_stime=3} ---

有没有办法追溯执行的确切 SQL 查询？

Tombart

Asked: 2021-01-07 15:08:12 +0800 CST

ntpd 无法同步 TIME_ERROR: 0x41: 时钟未同步

0

在 Debian 10 上，ntpd [email protected]无法同步并出现以下错误：

kernel reports TIME_ERROR: 0x41: Clock Unsynchronize

这是ntp.conf：

disable monitor

statsdir /var/log/ntpstats

restrict -4 default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
restrict 127.0.0.1
restrict ::1

server 0.us.pool.ntp.org iburst
server 1.us.pool.ntp.org iburst
server 2.us.pool.ntp.org iburst
server 3.us.pool.ntp.org iburst

server   127.127.1.0
fudge    127.127.1.0 stratum 10
restrict 127.127.1.0

driftfile /var/lib/ntp/drift

ntpq -c sysinfo：

associd=0 status=0614 leap_none, sync_ntp, 1 event, freq_mode,
system peer:        50-205-57-38-static.hfc.comcastbusiness.net:123
system peer mode:   client
leap indicator:     00
stratum:            2
log2 precision:     -23
root delay:         70.634
root dispersion:    3.569
reference ID:       50.205.57.38
reference time:     e3a0c049.c39d770a  Wed, Jan  6 2021 23:03:37.764
system jitter:      0.723169
clock jitter:       1.177
clock wander:       0.000
broadcast delay:    -50.000
symm. auth. delay:  0.000

ntpq -c lpeers：

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 LOCAL(0)        .LOCL.          10 l  286   64   20    0.000    0.000   0.000
*50-205-57-38-st .GPS.            1 u   19   64   37   70.631    1.618   1.843
-ns1.backplanedn 173.162.192.156  2 u   14   64   37   84.235   -1.575   2.852
+c-73-239-136-18 74.6.168.73      3 u   11   64   37   48.606    1.598   2.522
+time-d.bbnx.net 252.74.143.178   2 u   14   64   37   92.632    0.623   0.799

timedatectl：

               Local time: Wed 2021-01-06 23:06:44 UTC
           Universal time: Wed 2021-01-06 23:06:44 UTC
                 RTC time: Wed 2021-01-06 23:06:44
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: no
              NTP service: inactive
          RTC in local TZ: no

知道有什么问题吗？

Tombart

Asked: 2019-08-15 04:13:24 +0800 CST

Docker 数据包没有被伪装（尽管有 NAT 规则）

0

在装有 Debian 9（Linux 内核 4.9）的机器上，我有一个 Docker（18.06.1），其中一些容器处于 brigde 模式。由于某些奇怪的原因，来自 Docker 的一些数据包设法绕过MASQUERADE规则，enp2s0是一个公共接口（Docker 使用docker0接口172.17.0.1）。

$ tcpdump -vvlnn -i enp2s0 port 3000 and src net 172.16.0.0/12
tcpdump: listening on enp2s0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:57:49.918655 IP (tos 0x0, ttl 63, id 62271, offset 0, flags [DF], proto TCP (6), length 52)
    172.17.0.2.55664 > x.x.x.x.3000: Flags [F.], cksum 0xe40c (correct), seq 9863202, ack 476959401, win 856, options [nop,nop,TS val 1382910659 ecr 2481487487], length 0
11:57:50.126683 IP (tos 0x0, ttl 63, id 62272, offset 0, flags [DF], proto TCP (6), length 52)
    172.17.0.2.55664 > x.x.x.x.3000: Flags [F.], cksum 0xe3d8 (correct), seq 0, ack 1, win 856, options [nop,nop,TS val 1382910711 ecr 2481487487], length 0
11:57:50.546660 IP (tos 0x0, ttl 63, id 62273, offset 0, flags [DF], proto TCP (6), length 52)
    172.17.0.2.55664 > x.x.x.x.3000: Flags [F.], cksum 0xe36f (correct), seq 0, ack 1, win 856, options [nop,nop,TS val 1382910816 ecr 2481487487], length 0

NAT 规则来自iptables-save：

*nat
:PREROUTING ACCEPT [11397418:724275374]
:INPUT ACCEPT [39095:3038067]
:OUTPUT ACCEPT [1328340:79997617]
:POSTROUTING ACCEPT [5102467:306147980]
:DOCKER - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -o enp2s0 -j MASQUERADE
-A POSTROUTING -s 172.17.0.3/32 -d 172.17.0.3/32 -p tcp -m tcp --dport 5501 -j MASQUERADE
-A POSTROUTING -s 172.17.0.3/32 -d 172.17.0.3/32 -p tcp -m tcp --dport 5500 -j MASQUERADE
-A POSTROUTING -s 172.17.0.2/32 -d 172.17.0.2/32 -p tcp -m tcp --dport 3000 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN
-A DOCKER ! -i docker0 -p tcp -m tcp --dport 48842 -j DNAT --to-destination 172.17.0.3:5501
-A DOCKER ! -i docker0 -p tcp -m tcp --dport 48841 -j DNAT --to-destination 172.17.0.3:5500
-A DOCKER ! -i docker0 -p tcp -m tcp --dport 13119 -j DNAT --to-destination 172.17.0.2:3000

我试图添加MANGLE规则来捕获这些数据包，但到目前为止没有任何成功：

*mangle
:PREROUTING ACCEPT [44457014385:7315518035795]
:INPUT ACCEPT [404840097:241773793538]
:FORWARD ACCEPT [44052174279:7073744241603]
:OUTPUT ACCEPT [526370610:171137381220]
:POSTROUTING ACCEPT [44578544703:7244881613871]
:bogus - [0:0]
:spoofing - [0:0]
-A PREROUTING -s 192.168.0.0/24 -i enp2s0 -j spoofing
-A PREROUTING -s 10.0.0.0/8 -i enp2s0 -j spoofing
-A PREROUTING -s 172.16.0.0/12 -i enp2s0 -j spoofing
-A PREROUTING -s 127.0.0.0/8 ! -i lo -j spoofing
-A PREROUTING -p tcp -m tcp --tcp-flags FIN,SYN FIN,SYN -j bogus
-A PREROUTING -p tcp -m tcp --tcp-flags SYN,RST SYN,RST -j bogus
-A PREROUTING -p tcp -m tcp --tcp-flags FIN,RST FIN,RST -j bogus
-A bogus -j LOG --log-prefix "BOGUS: "
-A bogus -j DROP
-A spoofing -j LOG --log-prefix "IP SPOOF: "
-A spoofing -j DROP
COMMIT

知道如何阻止这些数据包吗？

转发数据包：

iptables -vnL FORWARD
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
  44G 7074G DOCKER-USER  all  --  *      *       0.0.0.0/0            0.0.0.0/0           
  44G 7074G DOCKER-ISOLATION-STAGE-1  all  --  *      *       0.0.0.0/0            0.0.0.0/0           
  16G 4358G ACCEPT     all  --  *      docker0  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
  54M 3269M DOCKER     all  --  *      docker0  0.0.0.0/0            0.0.0.0/0           
  28G 2712G ACCEPT     all  --  docker0 !docker0  0.0.0.0/0            0.0.0.0/0           
    0     0 ACCEPT     all  --  docker0 docker0  0.0.0.0/0            0.0.0.0/0           
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            state INVALID
    0     0 ACCEPT     all  --  docker0 enp2s0  0.0.0.0/0            0.0.0.0/0           
    0     0 ACCEPT     all  --  enp2s0 docker0  0.0.0.0/0            0.0.0.0/0           
    0     0 LOG        all  --  *      *       0.0.0.0/0            0.0.0.0/0            LOG flags 0 level 4 prefix "fw forward drop "
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            state NEW

转发规则（部分由 Docker 注入）：

-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION-STAGE-1
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A FORWARD -m state --state INVALID -j DROP
-A FORWARD -i docker0 -o enp2s0 -j ACCEPT
-A FORWARD -i enp2s0 -o docker0 -j ACCEPT

链也OUTPUT应该丢弃无效数据包：

-A OUTPUT -m state --state INVALID -j DROP

Tombart

Asked: 2017-05-25 04:54:45 +0800 CST

hwclock：无法通过任何已知方法访问硬件时钟

7

在 Debian 服务器上，我遇到了以下问题hwclock：

$ hwclock --show 
hwclock: Cannot access the Hardware Clock via any known method.
hwclock: Use the --debug option to see the details of our search for an access method.

系统在 backports 内核上运行Debian 4.9.18-1~bpo8+1 (2017-04-10)。

这是调试输出：

$ hwclock --debug
hwclock from util-linux 2.25.2
hwclock: cannot open /dev/rtc: Device or resource busy
No usable clock interface found.
hwclock: Cannot access the Hardware Clock via any known method.

时钟源：

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

最后，rtc设备存在：

$ ls -l /dev/rtc*
lrwxrwxrwx 1 root root      4 Apr 29 16:41 /dev/rtc -> rtc0
crw------- 1 root root 253, 0 Apr 29 16:41 /dev/rtc0

Tombart

Asked: 2015-01-09 00:42:46 +0800 CST

Debian Jessie LXC 容器中的 systemd-journal 占用 100% CPU

3

在基于 Debian Jessie 创建新的 LXC 之后，在 Ubuntu 14.04 上，systemd-journal 会吃掉所有可用的 CPU。

lxc-create -n jessie -t debian

Tombart

Asked: 2014-04-11 02:51:52 +0800 CST

nginx fastcgi rewrite：主脚本未知

1

我有以下 nginx 配置：

  location / {
      try_files $uri $uri/ index.html =404;

      if (!-e $request_filename) {
        rewrite ^/(.+)$ index.php?url=$1 last;
      }
  }


 location ~ .php$ {
    # protection from known vulnerability
    fastcgi_split_path_info ^(.+\.php)(/.+)$;
    include fastcgi_params;
    fastcgi_pass   unix:/var/run/php5-fpm.sock;
    fastcgi_index  index.php;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
 }

（fastcgi_params是 Debian 软件包的默认值）

它适用于 request /，但是当请求被重写时，找不到主文件：

request/contact应该被重写为/index.php?url=contact

 *104 FastCGI sent in stderr: "Primary script unknown" while reading response header from upstream, client: 10.0.0.1, server: localhost, request: "GET /contact HTTP/1.0", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "localhost:8080"

我无法从日志中获取 fastcgi 尝试加载的实际内容，哪个路径？

Tombart

Asked: 2013-11-12 01:29:30 +0800 CST

如何用 puppet 更新 grub？

4

我想/etc/default/grub用 puppet 把一行改成这样：

GRUB_CMDLINE_LINUX="cgroup_enable=memory"

我尝试使用似乎可以做到这一点的 augeas：

   exec { "update_grub":
    command => "update-grub",
    refreshonly => true,
   }

  augeas { "grub-serial":
    context => "/files/etc/default/grub",
    changes => [
      "set /files/etc/default/grub/GRUB_CMDLINE_LINUX[last()] cgroup_enable=memory",
    ],
    notify => Exec['update_grub'],
  }

它似乎有效，但结果字符串不在引号中，而且我想确保任何其他值都将用空格分隔。

GRUB_CMDLINE_LINUX=cgroup_enable=memory

是否有一些机制如何附加值并转义整个事情？

GRUB_CMDLINE_LINUX="quiet splash cgroup_enable=memory"

Tombart

Asked: 2013-05-04 07:02:12 +0800 CST

主机名更改后无法启动 ejabberd

1

当我尝试启动 ejabberd 服务时，它总是崩溃。

Starting jabber server: ejabberd
Crash dump was written to: /var/log/ejabberd/erl_crash.dump
Kernel pid terminated (application_controller) ({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})

Crash dump was written to: /var/log/ejabberd/erl_crash.dump
Kernel pid terminated (application_controller) ({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})
.

我已经更改了服务器的 hostanme，在此之前它工作正常，但是在配置中我有：

{hosts, ["localhost", "private.localhost", "public.localhost"]}.

Tombart

Asked: 2013-03-18 05:02:09 +0800 CST

如何在 Puppet 中授予对 PostgreSQL 的全局权限？

1

来自puppetlabs的官方 postgresql 模块允许授予对特定数据库的权限。

postgresql::database_grant{'grant to myuser':
    privilege   => 'CREATE',
    db          => 'app_production',
    role        => 'myuser',
  }

这将执行：

 GRANT ${privilege} ON database ${db} TO ${role};

但是我想对给定用户的全局权限执行查询：

  ALTER role myuser login createdb;

有办法做到这一点吗？或者我应该为 PostgreSQL 使用不同的人偶模块吗？

Tombart

Asked: 2013-03-18 04:39:43 +0800 CST

用户无法删除拥有的符号链接

3

我正在尝试删除符号链接，虽然我有适当的权限，但操作被拒绝（用户称为capistrno）：

capistrno $ rm -f /var/www/app/current
rm: cannot remove `/var/www/app/current': Permission denied

用户应该拥有该文件的所有权限

lrwxrwxrwx 1 capistrano capistrano 42 17. mar 13.09 /var/www/app/current -> /var/www/app/releases/20130317120932/

capistrno $ file /var/www/app/current
/var/www/app/current: symbolic link to `/var/www/app/releases/20130317120932'

知道出了什么问题吗？

编辑：

文件夹/var/www/app

$ ls -laF /var/www/app/
total 16
drwxr-xr-x 4 www-data   www-data 4096 17. mar 14.15 ./
drwxrwxr-x 4 capistrano www-data 4096 17. mar 00.01 ../
drwxrwxr-x 6 capistrano www-data 4096 17. mar 14.15 releases/
drwxrwxr-x 7 capistrano www-data 4096 17. mar 00.39 shared/

用户capistrano属于这个组：

$ groups
capistrano www-data rvm

Tombart

Asked: 2013-03-17 13:36:46 +0800 CST

如何使用 Puppet 从源代码安装包？

1

我想安装还没有二进制包（deb、rpm）的源代码包。

如果该模块已安装在该机器上，我该如何停止执行该模块？

我在用着：

  Exec {
    creates => "${zookeeper_path}/zookeeper/bin/zkServer.sh"
  }

然而，所有其他块无论如何都会执行。什么是最好的方法？检查几个文件是否存在？我不想在 puppet 检查更改时解压并重新编译所有模块。

编辑：

安装过程包括几个步骤：

取tar.gz包
解压包
创建几个配置文件
创建服务
确保服务正在运行

Tombart

Asked: 2013-03-11 08:57:06 +0800 CST

为什么 Puppet 可以只要求每个包一次？

6

在类中定义依赖项时，每个依赖项Package只能全局定义一次。我有配置层次结构，一些包应该安装在所有机器上（默认配置），但其他包应该只安装在某些类别的机器上。当 Puppet 威胁作为重复声明时，我应该如何检查该包是否已经在机器上？

  Duplicate declaration: Package[wget] is already declared

我应该使用这样的功能吗？

  if defined( Package[$package] ) {
    debug("$package already installed")
  } else {
    package { $package: ensure => $ensure }
  }

我希望配置工具默认处理这个问题......我错过了什么吗？

Tombart

Asked: 2013-03-11 06:31:36 +0800 CST

我怎样才能在人偶中调用ruby函数basename

5

我想调用File.basenameRuby 中可用的函数。在木偶中有可能吗？

就像是：

$filename = basename($download_url)

Tombart

Asked: 2013-03-08 01:53:20 +0800 CST

如何将参数传递给木偶模块？

6

配置人偶模块的最佳实践是什么？我有木偶2.7.11。我觉得这种方式很乱，看起来像是在使用全局变量。

node default {
   $always_apt_update = true
   include apt
}

我应该创建将从原始配置继承大部分配置的类吗？该文档似乎有太多版本，我不确定哪个适用于我。

更新：

当我尝试这个时：

  class { 'apt': 
    always_update => 'true',
  }

我得到一个错误：

Error 400 on SERVER: Invalid parameter always_update at /etc/puppet/manifests/nodes.pp:32

Tombart

Asked: 2013-03-01 10:13:55 +0800 CST

人偶环境变量 $PATH 未设置

1

我正在尝试在 Debian 6.0 上安装带有 puppet 2.7 的模块，但我不断收到此错误：

returns: change from notrun to 0 failed: Could not find command 'tar'

这是相关代码：

 file {"zookeeper-tarball":
    path => "${zookeeper_parent_dir}/${tarball}",
    source => "puppet:///modules/zookeeper/${tarball}",
    ensure => file,
  }

  exec { "zookeeper_untar":
    path => "${zookeeper_parent_dir}",
    command => "tar -xzf ${zookeeper_parent_dir}/${tarball}",
    cwd => "${zookeeper_parent_dir}",
    user => "$user",
    require =>  File["zookeeper-tarball"],
    creates => "${zookeeper_parent_dir}/zookeeper-${zookeeper_version}",
  }

在manifests/site.pp我有这个：

Exec {
  path => "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
}

用户是root。知道可能是什么问题吗？好像$PATH是空的。。。

NVMe mdadm RAID 阵列上的 I/O 非常慢

如何为（辅助）编译 puppetserver 生成证书？

如何调试 PostgreSQL 分段错误？

ntpd 无法同步 TIME_ERROR: 0x41: 时钟未同步

Docker 数据包没有被伪装（尽管有 NAT 规则）

hwclock：无法通过任何已知方法访问硬件时钟

Debian Jessie LXC 容器中的 systemd-journal 占用 100% CPU

nginx fastcgi rewrite：主脚本未知

如何用 puppet 更新 grub？

主机名更改后无法启动 ejabberd

如何在 Puppet 中授予对 PostgreSQL 的全局权限？

用户无法删除拥有的符号链接

如何使用 Puppet 从源代码安装包？

为什么 Puppet 可以只要求每个包一次？

我怎样才能在人偶中调用ruby函数basename

如何将参数传递给木偶模块？

人偶环境变量 $PATH 未设置

新安装后 postgres 的默认超级用户用户名/密码是什么？

SFTP 使用什么端口？

命令行列出 Windows Active Directory 组中的用户？

什么是 Pem 文件，它与其他 OpenSSL 生成的密钥文件格式有何不同？

如何确定bash变量是否为空？

Tombart's questions