8000 multipathd using all CPU for iowait · Issue #539 · flatcar/Flatcar · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
multipathd using all CPU for iowait #539
Closed
@george-angel

Description

@george-angel
PRETTY_NAME="Flatcar Container Linux by Kinvolk 2905.2.5 (Oklo)"

We are running Trident (https://github.com/NetApp/trident) on our bare metal cluster running on prem.

Our StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: netapp-ontap-san-ext4
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: csi.trident.netapp.io
allowVolumeExpansion: true
parameters:
  backendType: "ontap-san"
  storagePools: "ontapsan_10.20.50.4:.*"
  fsType: "ext4"

Running iscsid and multipathd config:

iscsi-config.txt
mpath-config.txt

We currently run fstrim weekly.

Previously what we have observed, when one of our volumes fills up (because we aren't doing online discard on our mount), that volume would be flipped to ro mode.

Recently (I don't know which version exactly, or even if this was associated with a Flatcar version change), instead of being flipped to ro, we experience mulipathd becoming unresponsive and eating a lot of CPU for iowait

2021-11-01-132437_1236x265_scrot

Total DISK READ :       0.00 B/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 107036 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % multipathd -d -s
 177230 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % multipathd -d -s
 109544 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % multipathd -d -s
 226892 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % multipathd -d -s

From kernel logs:

Nov 01 10:59:26 worker-24.dev.merit.uw.systems kernel: sd 6:0:0:1035: [sdd] tag#59 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Nov 01 10:59:26 worker-24.dev.merit.uw.systems kernel: sd 6:0:0:1035: [sdd] tag#59 Sense Key : Data Protect [current]
Nov 01 10:59:26 worker-24.dev.merit.uw.systems kernel: sd 6:0:0:1035: [sdd] tag#59 Add. Sense: Space allocation failed write protect
Nov 01 10:59:27 worker-24.dev.merit.uw.systems kernel: sd 6:0:0:1035: [sdd] tag#59 CDB: Write(10) 2a 00 00 5d 71 b8 00 00 80 00
Nov 01 10:59:27 worker-24.dev.merit.uw.systems kernel: blk_update_request: critical space allocation error, dev sdd, sector 6123960 op 0x1:(WRITE) flags 0x4200 phys_seg 16 prio class 0
Nov 01 10:59:29 worker-24.dev.merit.uw.systems kernel: blk_update_request: critical space allocation error, dev dm-2, sector 6123960 op 0x1:(WRITE) flags 0x4000 phys_seg 16 prio class 0

full log
blkid-debug.log

To help debug this problem I want to understand kernel/multipath behavioral change to react so badly when unable to write a block.

Thank you

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    Status

    Implemented

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0