/usr/bin/zrepl status --mode raw reports Post "http://unix/status": EOF

During inital full push taking > 3 days (~40700 snapshots, zfs list used is at 2.7T) , zrepl status is now unable to list the job and reports status fetch: Post "http://unix/status": EOF. I suspect (but don't know) the broken pipe error appeard after the full sends of all datasets were completed, since a few hours prior there was only 0.1TB remaining or so.

Push side:

journalctl on the push side reports a broken pipe:

zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [prod_to_backups][snapshot][x32r$TEiz$nFrA$nFrA.qVMz]: callback channel is full, discarding snapshot update event
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: write: broken pipe"

Throughout the sendinitial sync I was observing callback channel is full, discarding snapshot update event and simply ignored those assuming they are transient and not critical.

However now the initial full sends are complete, I'd like to capture this issue before blindly restarting the zrepl.service.

On the reciving sink side, zrepl status also is unable to show the sink job details:

On the sink side, I am (unlike on the sending side) able to get a raw status output (/usr/bin/zrepl status --mode raw):

{
  "Jobs": {
    "_control": {
      "internal": null,
      "type": "internal"
    },
    "sink": {
      "sink": {
        "Snapper": null
      },
      "type": "sink"
    }
  },
  "Global": {
    "ZFSCmds": {
      "Active": null
    },
    "Envconst": {
      "Entries": [
        {
          "Var": "ZREPL_ENDPOINT_LIST_ABSTRACTIONS_QUERY_CREATETXG_RANGE_BOUND_ALLOW_0",
          "Value": "false",
          "ValueGoType": "bool"
        },
        {
          "Var": "ZREPL_TRACE_DEBUG_ENABLED",
          "Value": "false",
          "ValueGoType": "bool"
        },
        {
          "Var": "ZREPL_DAEMON_CONTROL_SERVER_WRITE_TIMEOUT",
          "Value": "1s",
          "ValueGoType": "time.Duration"
        },
        {
          "Var": "ZREPL_TRACE_ID_NUM_BYTES",
          "Value": "3",
          "ValueGoType": "int"
        },
        {
          "Var": "ZFS_RECV_PIPE_CAPACITY_HINT",
          "Value": "1048576",
          "ValueGoType": "int64"
        },
        {
          "Var": "ZREPL_TRANSPORT_DEMUX_TIMEOUT",
          "Value": "10s",
          "ValueGoType": "time.Duration"
        },
        {
          "Var": "ZREPL_DAEMON_AUTOSTART_PPROF_SERVER",
          "Value": "",
          "ValueGoType": "string"
        },
        {
          "Var": "ZREPL_ENDPOINT_RECV_PEEK_SIZE",
          "Value": "1048576",
          "ValueGoType": "int64"
        },
        {
          "Var": "ZREPL_SNAPPER_SYNCUP_WARN_MIN_DURATION",
          "Value": "1s",
          "ValueGoType": "time.Duration"
        },
        {
          "Var": "ZREPL_DAEMON_CONTROL_SERVER_READ_TIMEOUT",
          "Value": "1s",
          "ValueGoType": "time.Duration"
        },
        {
          "Var": "ZREPL_ZFS_MAX_HOLD_TAG_LEN",
          "Value": "255",
          "ValueGoType": "int"
        },
        {
          "Var": "ZREPL_ZFS_RESUME_RECV_POOL_SUPPORT_RECHECK_TIMEOUT",
          "Value": "30s",
          "ValueGoType": "time.Duration"
        },
        {
          "Var": "ZREPL_ZFS_SEND_STDERR_MAX_CAPTURE_SIZE",
          "Value": "32768",
          "ValueGoType": "int"
        },
        {
          "Var": "ZREPL_ACTIVITY_TRACE",
          "Value": "",
          "ValueGoType": "string"
        },
        {
          "Var": "ZREPL_RPC_SERVER_VERSIONHANDSHAKE_TIMEOUT",
          "Value": "10s",
          "ValueGoType": "time.Duration"
        }
      ]
    },
    "OsEnviron": [
      "LANG=en_US.UTF-8",
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin",
      "INVOCATION_ID=2bb3bb3b2ed447ca8bda702cb8a07f30",
      "JOURNAL_STREAM=8:410484",
      "RUNTIME_DIRECTORY=/run/zrepl:/run/zrepl/stdinserver",
      "SYSTEMD_EXEC_PID=114360",
      "GOTRACEBACK=crash"
    ]
  }
}

Performing the same /usr/bin/zrepl status --mode raw on the sending side produces an error:

/usr/bin/zrepl status --mode raw
Post "http://unix/status": EOF
```

Using `nload` I am still seeing transer happening accross the links, which suggests zrepl may still be operating/syncing, despite not being able get a valud `zrepl status`

![Image](https://github.com/user-attachments/assets/970630aa-30c4-48c5-844b-7b99871bbd94)

Any relevant additional info you'd like me to post to help identify cause before I restart the service, happy to provide.

Perhaps related https://github.com/zrepl/zrepl/issues/379

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions