8000 `/usr/bin/zrepl status --mode raw` reports Post "http://unix/status": EOF · Issue #883 · zrepl/zrepl · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
/usr/bin/zrepl status --mode raw reports Post "http://unix/status": EOF #883
Open
@chrisjsimpson

Description

@chrisjsimpson

During inital full push taking > 3 days (~40700 snapshots, zfs list used is at 2.7T) , zrepl status is now unable to list the job and reports status fetch: Post "http://unix/status": EOF. I suspect (but don't know) the broken pipe error appeard after the full sends of all datasets were completed, since a few hours prior there was only 0.1TB remaining or so.

Push side:
Image

journalctl on the push side reports a broken pipe:

zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [prod_to_backups][snapshot][x32r$TEiz$nFrA$nFrA.qVMz]: callback channel is full, discarding snapshot update event
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: write: broken pipe"

Throughout the sendinitial sync I was observing callback channel is full, discarding snapshot update event and simply ignored those assuming they are transient and not critical.

However now the initial full sends are complete, I'd like to capture this issue before blindly restarting the zrepl.service.

On the reciving sink side, zrepl status also is unable to show the sink job details:

Image

On the sink side, I am (unlike on the sending side) able to get a raw status output (/usr/bin/zrepl status --mode raw):

{
  "Jobs": {
    "_control": {
      "internal": null,
      "type": "internal"
    },
    "sink": {
      "sink": {
        "Snapper": null
      },
      "type": "sink"
    }
  },
  "Global": {
    "ZFSCmds": {
      "Active": null
    },
    "Envconst": {
      "Entries": [
        {
          "Var": "ZREPL_ENDPOINT_LIST_ABSTRACTIONS_QUERY_CREATETXG_RANGE_BOUND_ALLOW_0",
          "Value": "false",
          "ValueGoType": "bool"
        },
        {
          "Var": "ZREPL_TRACE_DEBUG_ENABLED",
          "Value": "false",
          "ValueGoType": "bool"
        },
        {
          "Var": "ZREPL_DAEMON_CONTROL_SERVER_WRITE_TIMEOUT",
          "Value": "1s",
          "ValueGoType": "time.Duration"
        },
        {
          "Var": "ZREPL_TRACE_ID_NUM_BYTES",
          "Value": "3",
          "ValueGoType": "int"
        },
        {
          "Var": "ZFS_RECV_PIPE_CAPACITY_HINT",
          "Value": "1048576",
          "ValueGoType": "int64"
        },
        {
          "Var": "ZREPL_TRANSPORT_DEMUX_TIMEOUT",
          "Value": "10s",
          "ValueGoType": "time.Duration"
        },
        {
          "Var": "ZREPL_DAEMON_AUTOSTART_PPROF_SERVER",
          "Value": "",
          "ValueGoType": "string"
        },
        {
          "Var": "ZREPL_ENDPOINT_RECV_PEEK_SIZE",
          "Value": "1048576",
          "ValueGoType": "int64"
        },
        {
          "Var": "ZREPL_SNAPPER_SYNCUP_WARN_MIN_DURATION",
          "Value": "1s",
          "ValueGoType": "time.Duration"
        },
        {
          "Var": "ZREPL_DAEMON_CONTROL_SERVER_READ_TIMEOUT",
          "Value": "1s",
          "ValueGoType": "time.Duration"
        },
        {
          "Var": "ZREPL_ZFS_MAX_HOLD_TAG_LEN",
          "Value": "255",
          "ValueGoType": "int"
        },
        {
          "Var": "ZREPL_ZFS_RESUME_RECV_POOL_SUPPORT_RECHECK_TIMEOUT",
          "Value": "30s",
          "ValueGoType": "time.Duration"
        },
        {
          "Var": "ZREPL_ZFS_SEND_STDERR_MAX_CAPTURE_SIZE",
          "Value": "32768",
          "ValueGoType": "int"
        },
        {
          "Var": "ZREPL_ACTIVITY_TRACE",
          "Value": "",
          "ValueGoType": "string"
        },
        {
          "Var": "ZREPL_RPC_SERVER_VERSIONHANDSHAKE_TIMEOUT",
          "Value": "10s",
          "ValueGoType": "time.Duration"
        }
      ]
    },
    "OsEnviron": [
      "LANG=en_US.UTF-8",
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin",
      "INVOCATION_ID=2bb3bb3b2ed447ca8bda702cb8a07f30",
      "JOURNAL_STREAM=8:410484",
      "RUNTIME_DIRECTORY=/run/zrepl:/run/zrepl/stdinserver",
      "SYSTEMD_EXEC_PID=114360",
      "GOTRACEBACK=crash"
    ]
  }
}

Performing the same /usr/bin/zrepl status --mode raw on the sending side produces an error:

/usr/bin/zrepl status --mode raw
Post "http://unix/status": EOF
```

Using `nload` I am still seeing transer happening accross the links, which suggests zrepl may still be operating/syncing, despite not being able get a valud `zrepl status`

![Image](https://github.com/user-attachments/assets/970630aa-30c4-48c5-844b-7b99871bbd94)

Any relevant additional info you'd like me to post to help identify cause before I restart the service, happy to provide.

Perhaps related https://github.com/zrepl/zrepl/issues/379

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0