Description
During inital full push taking > 3 days (~40700
snapshots, zfs list used is at 2.7T) , zrepl status
is now unable to list the job and reports status fetch: Post "http://unix/status": EOF
. I suspect (but don't know) the broken pipe error appeard after the full sends of all datasets were completed, since a few hours prior there was only 0.1TB remaining or so.
journalctl on the push side reports a broken pipe:
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [prod_to_backups][snapshot][x32r$TEiz$nFrA$nFrA.qVMz]: callback channel is full, discarding snapshot update event
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: i/o timeout"
zrepl[1510727]: [_control][job][x32r$x32r]: control handler io error err="write unix /var/run/zrepl/control->@: write: broken pipe"
Throughout the sendinitial sync I was observing callback channel is full, discarding snapshot update event
and simply ignored those assuming they are transient and not critical.
However now the initial full sends are complete, I'd like to capture this issue before blindly restarting the zrepl.service
.
On the reciving sink
side, zrepl status
also is unable to show the sink job details:
On the sink side, I am (unlike on the sending side) able to get a raw status output (/usr/bin/zrepl status --mode raw
):
{
"Jobs": {
"_control": {
"internal": null,
"type": "internal"
},
"sink": {
"sink": {
"Snapper": null
},
"type": "sink"
}
},
"Global": {
"ZFSCmds": {
"Active": null
},
"Envconst": {
"Entries": [
{
"Var": "ZREPL_ENDPOINT_LIST_ABSTRACTIONS_QUERY_CREATETXG_RANGE_BOUND_ALLOW_0",
"Value": "false",
"ValueGoType": "bool"
},
{
"Var": "ZREPL_TRACE_DEBUG_ENABLED",
"Value": "false",
"ValueGoType": "bool"
},
{
"Var": "ZREPL_DAEMON_CONTROL_SERVER_WRITE_TIMEOUT",
"Value": "1s",
"ValueGoType": "time.Duration"
},
{
"Var": "ZREPL_TRACE_ID_NUM_BYTES",
"Value": "3",
"ValueGoType": "int"
},
{
"Var": "ZFS_RECV_PIPE_CAPACITY_HINT",
"Value": "1048576",
"ValueGoType": "int64"
},
{
"Var": "ZREPL_TRANSPORT_DEMUX_TIMEOUT",
"Value": "10s",
"ValueGoType": "time.Duration"
},
{
"Var": "ZREPL_DAEMON_AUTOSTART_PPROF_SERVER",
"Value": "",
"ValueGoType": "string"
},
{
"Var": "ZREPL_ENDPOINT_RECV_PEEK_SIZE",
"Value": "1048576",
"ValueGoType": "int64"
},
{
"Var": "ZREPL_SNAPPER_SYNCUP_WARN_MIN_DURATION",
"Value": "1s",
"ValueGoType": "time.Duration"
},
{
"Var": "ZREPL_DAEMON_CONTROL_SERVER_READ_TIMEOUT",
"Value": "1s",
"ValueGoType": "time.Duration"
},
{
"Var": "ZREPL_ZFS_MAX_HOLD_TAG_LEN",
"Value": "255",
"ValueGoType": "int"
},
{
"Var": "ZREPL_ZFS_RESUME_RECV_POOL_SUPPORT_RECHECK_TIMEOUT",
"Value": "30s",
"ValueGoType": "time.Duration"
},
{
"Var": "ZREPL_ZFS_SEND_STDERR_MAX_CAPTURE_SIZE",
"Value": "32768",
"ValueGoType": "int"
},
{
"Var": "ZREPL_ACTIVITY_TRACE",
"Value": "",
"ValueGoType": "string"
},
{
"Var": "ZREPL_RPC_SERVER_VERSIONHANDSHAKE_TIMEOUT",
"Value": "10s",
"ValueGoType": "time.Duration"
}
]
},
"OsEnviron": [
"LANG=en_US.UTF-8",
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin",
"INVOCATION_ID=2bb3bb3b2ed447ca8bda702cb8a07f30",
"JOURNAL_STREAM=8:410484",
"RUNTIME_DIRECTORY=/run/zrepl:/run/zrepl/stdinserver",
"SYSTEMD_EXEC_PID=114360",
"GOTRACEBACK=crash"
]
}
}
Performing the same /usr/bin/zrepl status --mode raw
on the sending side produces an error:
/usr/bin/zrepl status --mode raw
Post "http://unix/status": EOF
```
Using `nload` I am still seeing transer happening accross the links, which suggests zrepl may still be operating/syncing, despite not being able get a valud `zrepl status`

Any relevant additional info you'd like me to post to help identify cause before I restart the service, happy to provide.
Perhaps related https://github.com/zrepl/zrepl/issues/379