Translation below.
问题描述
某 2021 年还安装 CentOS 7 的服务器近期时常出现无响应、远程无法连接, 键鼠输入有延迟, 且有内核出错记录:
Aug 11 12:45:31 <machine name> kernel: warn alloc falled: 1380491 callbacks suppressed
Aug 11 12:45:31 <machine name> kernel: swapper/27: page allocation failure: order 0, mode: 0x20
Aug 11 12:45:31 <machine name> kernel: CPU: 27 PID: 0 Comm: swapper/27 Kdump: loaded Tainted: P OE ------------ 3.10.0-1160 62.1.el7.x86 64 #1
Aug 11 12:45:31 <machine name> kernel: Hardware name: <name>
Aug 11 12:45:31 <machine name> kernel: Call Trace:
Aug 11 12:45:31 <machine name> kernel: <IRQ> [<ffffffffbcb865a9>] dump_stack+0x19/0x1b
Aug 11 12:45:31 <machine name> kernel: [<ffffffffbc5c4bd0>] warn alloc failed+0x110/0x180
Aug 11 12:45:31 <machine name> kernel: [<ffffffffbc4d3383>] ? __wake up+0x13/0x20
Aug 11 12:45:31 <machine name> kernel: [<ffffffffbc5c976f>] __alloc_pages_nodemask+0x9df/oxbe0
Aug 11 12:45:31 <machine name> kernel: [<ffffffffbc5c9b78>] page_frag_alloc+0x158/0x170
Aug 11 12:45:31 <machine name> kernel: [<ffffffffbca4808d>] __napi_alloc_skb+0x8d/0xe0
Aug 11 12:45:31 <machine name> kernel: [<ffffffffc03df895>] mlx5e_skb_from_cqe_mpwrq_nonlinear+0x65/0x2f0 [mlx5_core]
Aug 11 12:45:31 <machine name> kernel: [<ffffffffc03dfdbf>] mlx5e_handle_rx_cqe_mpwrq+0xbf/0x8c0 [mlx5_core]
Aug 11 12:45:31 <machine name> kernel: swapper/57: page allocation failure: order 0, mode: 0x20
Aug 11 12:45:31 <machine name> kernel: CPU: 57 PID: 0 Comm: swapper/57 Kdump: loaded Tainted: P OE ------------ 3.10.0-1160 62.1.el7.x86 64 #1
Aug 11 12:45:31 <machine name> kernel: Hardware name: <name>
Aug 11 12:45:31 <machine name> kernel: Call Trace:
Aug 11 12:45:31 <machine name> kernel: <IRQ> [<ffffffffbcb865a9>] dump_stack+0x19/0x1b
Aug 11 12:45:31 <machine name> kernel: [<ffffffffc03e14be>] mlx5e_napi_poll+0xbe/0xd10 [mlx5_core]
Aug 11 12:45:31 <machine name> kernel: [<ffffffffbc78da24>] ? __radix_tree_lookup+0x84/0xf0
......
重启后可恢复一段时间, 但数小时到数天后仍会卡死.
问题排查
• • • >>
When testing SYCL programs on some backend based on Clang 14, it is noticed that the gdb-9
bundled with Ubuntu 20.04 cannot correctly read debug info of binary produced by clang-14
, then stack info is not available at all.
Although haven’t figured out why, compiling & installing a gdb-13
simply fixes this problem.
NOTE: tested with vendor-specific compiler based on clang-14
, not sure if applicable to vanilla Clang.
Strange things happened when debugging the new pipeline structure of simple-radio-telescope-backend: when emulated_fp64
is switched on (that is, the df64
type in “Emulated double precision Double single routine header” by StickGuy, Norbert Juffa, Reimar, et al. is used), under Debug configuration every thing works fine, but Release configuration gives wrong results!
An example is evaluating phase delay of radio wave in plasma caused by electrodynamics, the difference of cos()
and sin()
of such phase evaluated using float64
and df64
is shown below:
Difference of result given by df64
& double
under Debug configuration, tested on NVIDIA RTX A4000
Difference of result given by df64
& double
under Release configuration, tested on NVIDIA RTX A4000
• • • >>
Credit: Chenchen Miao and Cijie Zhang
1. 利用 HPE 的 iLO 功能实现远程操作服务器 / operate the server remotely using HPE’s iLO
- 记下机箱标签上的 iLO 账号与密码 / remember username and password of iLO which is on the label of server
- 第一次启动需要接上屏幕, 在引导阶段会显示 iLO 的 IP 地址 / screen seems required on first boot, IP address of iLO will show when booting
• • • >>
TL;DR: this is a stack overflow caused by large array on stack, so enlarge stack size using ulimit -s <stack size in KB>
Incident
An error occurred when processing raw baseband data of a radio telescope using dspsr:
$ dspsr -b 4194304 -D 56.716 -A -L 1.073741824 -c 1.073741824 -O ${file}_128 -e rf -F 128:D ${file}.bin -U 4096
Only single polarization detection available
dspsr: Single archive with multiple sub-integrations
dspsr: dedispersion filter length=131072 (minimum=8192) complex samples
dspsr: 128 channel dedispersing filterbank requires 33554432 samples
dspsr: blocksize=330382096 samples or 4096 MB
dsp::Fold::choose_nbin WARNING Requested nbin=4194304 > sensible nbin=2097152. Where:
sampling period = 0.000256 ms and
requested bin width = 0.000256 ms
dsp::Archiver::finish archive '13835058401541322426_128.rf' with 1 integrations
62305 Segmentation Fault (Core dumped)
Analyze
• • • >>