multipathd using all CPU for iowait

PRETTY_NAME="Flatcar Container Linux by Kinvolk 2905.2.5 (Oklo)"

We are running Trident (https://github.com/NetApp/trident) on our bare metal cluster running on prem.

Our StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: netapp-ontap-san-ext4
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: csi.trident.netapp.io
allowVolumeExpansion: true
parameters:
  backendType: "ontap-san"
  storagePools: "ontapsan_10.20.50.4:.*"
  fsType: "ext4"

Running iscsid and multipathd config:

iscsi-config.txt
mpath-config.txt

We currently run fstrim weekly.

Previously what we have observed, when one of our volumes fills up (because we aren't doing online discard on our mount), that volume would be flipped to ro mode.

Recently (I don't know which version exactly, or even if this was associated with a Flatcar version change), instead of being flipped to ro, we experience mulipathd becoming unresponsive and eating a lot of CPU for iowait

Total DISK READ :       0.00 B/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 107036 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % multipathd -d -s
 177230 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % multipathd -d -s
 109544 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % multipathd -d -s
 226892 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % multipathd -d -s

From kernel logs:

Nov 01 10:59:26 worker-24.dev.merit.uw.systems kernel: sd 6:0:0:1035: [sdd] tag#59 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Nov 01 10:59:26 worker-24.dev.merit.uw.systems kernel: sd 6:0:0:1035: [sdd] tag#59 Sense Key : Data Protect [current]
Nov 01 10:59:26 worker-24.dev.merit.uw.systems kernel: sd 6:0:0:1035: [sdd] tag#59 Add. Sense: Space allocation failed write protect
Nov 01 10:59:27 worker-24.dev.merit.uw.systems kernel: sd 6:0:0:1035: [sdd] tag#59 CDB: Write(10) 2a 00 00 5d 71 b8 00 00 80 00
Nov 01 10:59:27 worker-24.dev.merit.uw.systems kernel: blk_update_request: critical space allocation error, dev sdd, sector 6123960 op 0x1:(WRITE) flags 0x4200 phys_seg 16 prio class 0
Nov 01 10:59:29 worker-24.dev.merit.uw.systems kernel: blk_update_request: critical space allocation error, dev dm-2, sector 6123960 op 0x1:(WRITE) flags 0x4000 phys_seg 16 prio class 0

full log
blkid-debug.log

To help debug this problem I want to understand kernel/multipath behavioral change to react so badly when unable to write a block.

Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions