don't stall gossip broadcasting when there are blocked connections #1831

rade · 2015-12-24T17:09:35Z

See commits for explanation.

Note that this PR is based on #1826.

- level 2 headings for top-level instead of level 3 - put all status reporting as L3s under one L2 heading - add 'Reboots' to ToC - move 'Stopping weave' into its own section and and move that

Update 'reboots' info Fixes #1774.

and eliminate duplication Fixes #1842.

And also make the error message more meaningful. Fixes #1843.

instead just don't auto-detect TLS args in that case. This makes `weave --local launch` work (again). Fixes #1844.

Fixes #1845.

...rather than via an env var. This is cleaner.

PROCFS is always set

changed my mind This reverts commit 20a8488.

…line" changed my mind This reverts commit be8260a.

I broke this in 3702126. Fixes #1848.

Fixes #1851.

clean up GossipSender

Previously broadcasts were being handled by one GossipSender per broadcast source (and channel), which sends the broadcast to all next hops, and hence can stall when a single destination is stalled. Here we get rid of these per broadcast source GossipSenders. Instead broadcasts are sent to per-connection GossipSenders for all next hops. For this we use the existing GossipSenders we have set up for ordinary gossip. In order to deal with broadcasts they need one GossipData cell per broadcast source, since only broadcasts from the same source can be Merge()ed. So we add a PeerName->GossipData map of cells, in addition to the existing cell for ordinary gossip. The GossipSender goroutine picks one of the cells at a time, Encode()s the contents and sends it. It prefers the ordinary gossip cell over the broadcast cells since typically ordinary gossip is more important. To reduce coupling, the GossipSenders don't actually know how to construct protocol messages. They just invoke a couple of functions for that - one for ordinary gossip and one for broadcast - which are supplied by the GossipChannel. There are two downsides to this change: 1. broadcasts get encoded per connection, rather than just once 2. we can potentially end up with O(n_peers^2) cells, each containing accumulated (via GossipData.Merge()) broadcasts. For this to happen, the peer must - deal with gossip from most nodes. This doesn't happen in (near)complete connection topologies. Hypercube topologies are probably the worst case uniform topology. And a star topology is the worst case for a single (the centre) peer. - have backlogged connections to most of its neighbours, without those connections being completely stalled (since that would cause heartbeat timeouts to terminate them). Furthermore, for this to matter in practise, the accumulated broadcast GossipData must be sizeable: - For topology gossip, GossipData is just a set of PeerNames. Which takes very little space and is bounded, since there is a finite number of peers. And, if we could get rid of workaround for #1793 (cbaa92d), then topology broadcasts would only ever contain information about the source peer, so the GossipData would contain just one PeerName. - IPAM only employs broadcast during initialisation and shutdown. - DNS broadcasts contain DNS entries for containers on the source peer. Each entry will typically be 100-200 bytes. The accumulated broadcast GossipData from a peer will contain entries for all the peer's DNS entries, worst case. This includes tombstones, i.e. entries for containers that have died. If there is churn, i.e. DNS entries being added and removed continuously, and the churn rate exceeds the rate at which we can forward those entries, then the accumulated broadcast GossipData can grow unbounded. Note that this is the case on master too; the difference here is that we can have up to n_peers copies of that GossipData.

rade · 2016-01-05T17:16:01Z

replaced by #1855

rade force-pushed the gossip-broadcast-stall-3 branch from 71e8edc to 1306946 Compare December 24, 2015 17:35

rade mentioned this pull request Dec 24, 2015

get rid of GossipSender gc #1832

Closed

rade force-pushed the cleaner-gossip-sender branch from dc1ebb0 to 9702faf Compare December 24, 2015 22:13

rade force-pushed the gossip-broadcast-stall-3 branch from 1306946 to 5651a17 Compare December 24, 2015 22:14

rade force-pushed the cleaner-gossip-sender branch from 9702faf to e952021 Compare December 30, 2015 12:10

rade force-pushed the gossip-broadcast-stall-3 branch 2 times, most recently from fa74e07 to 4341ec2 Compare December 30, 2015 12:33

rade added 23 commits December 30, 2015 14:07

optimisation: don't re-assign our ShortID if nothing got gc'ed

58de14f

replace partially broken link soup

6ecdf69

reword "Contact Us"

f0f0e6f

move things around on troubleshooting page

a71d2a1

- level 2 headings for top-level instead of level 3 - put all status reporting as L3s under one L2 heading - add 'Reboots' to ToC - move 'Stopping weave' into its own section and and move that

cosmetic

7a5b69f

Merge branch 'latest_release_doc_updates' into 1.4

e96a63f

Merge branch '1.4'

872d88a

Update 'reboots' info

14d78cf

reword

93eef26

Merge pull request #1804 from weaveworks/1774-trouble

ef3e478

Update 'reboots' info Fixes #1774.

Merge branch 'latest_release_doc_updates' into 1.4

7709c6d

Merge branch '1.4'

0f357cd

cosmetic

82ac69b

refactor: introduce a small helper so we don't repeat ourselves

27f477a

make some rules more generic, depending on *.go instead of main.go

8f01f6e

refactor: rename netcheck.go -> main.go, for consistency

4e22922

don't complain in weave stop when plugin is not enabled

d11b6e8

document weave launch-router --trusted-subnets

281937c

and eliminate duplication Fixes #1842.

Merge branch '1.4'

1cc7ca1

prettier usage

159bad5

Show expected error message when docker_tls_args fails.

3702126

And also make the error message more meaningful. Fixes #1843.

refactor: simplify

c1b1276

don't fail on missing docker_tls_args

a9d212c

instead just don't auto-detect TLS args in that case. This makes `weave --local launch` work (again). Fixes #1844.

rade and others added 26 commits December 31, 2015 12:32

Merge branch '1.4'

24d1c73

properly deal with unavailable docker server api version

c8ff2de

Fixes #1845.

Merge branch '1.4'

c1e8c9c

refactor: replace some 'if's with && and || expressions

9c31de9

refactor: pass procfs location to docker_tls_args on command line

be8260a

...rather than via an env var. This is cleaner.

cosmetic

20a8488

PROCFS is always set

Revert "cosmetic"

4a6b72c

changed my mind This reverts commit 20a8488.

Revert "refactor: pass procfs location to docker_tls_args on command …

d59594c

…line" changed my mind This reverts commit be8260a.

appease linter

75d9e95

refactor: simplify fastdp initialisation

13f37ed

refactor: further simplify fastdp initialisation

0ee22af

cosmetic(ish)

79452d3

oops

7660ad7

I broke this in 3702126. Fixes #1848.

Merge branch '1.4'

c3e6669

correct image names

ecab4b4

add ToC to build instructions

9c17fe2

Merge branch 'latest_release_doc_updates' into 1.4

0adf762

Merge branch '1.4'

1bec425

don't fail weave launch when plugin is disabled

beb15fa

Fixes #1851.

oops

80770d3

Merge branch '1.4'

70bfcd0

update year in copyright notices

b623e91

Merge pull request #1826 from weaveworks/cleaner-gossip-sender

262f891

clean up GossipSender

optimisation: abort GossipSender.deliver when stopped

b0b50c6

optimisation: stop GossipSender on send error

8d30953

rade force-pushed the gossip-broadcast-stall-3 branch from 4341ec2 to 8d30953 Compare January 5, 2016 17:13

rade closed this Jan 5, 2016

awh added this to the n/a milestone Jan 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

don't stall gossip broadcasting when there are blocked connections #1831

don't stall gossip broadcasting when there are blocked connections #1831

Uh oh!

Uh oh!

Uh oh!

Uh oh!

don't stall gossip broadcasting when there are blocked connections #1831

don't stall gossip broadcasting when there are blocked connections #1831

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!