nodeDB.querySeeds was not safe for concurrent use but could be called
concurrenty on multiple goroutines in the following case:
- the table was empty
- a timed refresh started
- a lookup was started and initiated refresh
These conditions are unlikely to coincide during normal use, but are
much more likely to occur all at once when the user's machine just woke
from sleep. The root cause of the issue is that querySeeds reused the
same leveldb iterator until it was exhausted.
This commit moves the refresh scheduling logic into its own goroutine
(so only one refresh is ever active) and changes querySeeds to not use
a persistent iterator. The seed node selection is now more random and
ignores nodes that have not been contacted in the last 5 days.
PR #1621 changed Table locking so the mutex is not held while a
contested node is being pinged. If multiple nodes ping the local node
during this time window, multiple ping packets will be sent to the
contested node. The changes in this commit prevent multiple packets by
tracking whether the node is being replaced.
If the timeout fired (even just nanoseconds) before the deadline of the
next pending reply, the timer was not rescheduled. The timer would've
been rescheduled anyway once the next packet was sent, but there were
cases where no next packet could ever be sent due to the locking issue
fixed in the previous commit.
As timing-related bugs go, this issue had been present for a long time
and I could never reproduce it. The test added in this commit did
reproduce the issue on about one out of 15 runs.
Table.mutex was being held while waiting for a reply packet, which
effectively made many parts of the whole stack block on that packet,
including the net_peerCount RPC call.
Lookup calls would spin out of control when network connectivity was
lost. The throttling that was in place only took effect when the table
returned zero results, which doesn't happen very often.
The new throttling should not have a negative impact when the host is
online. Lookups against the network take some time and dials for all
results must complete or hit the cache before a new one is started. This
usually takes longer than four seconds, leaving online lookups
unaffected.
Fixes#1296
As of this commit, we no longer rely on the protocol handler to report
write errors in a timely fashion. When a write fails, shutdown is
initiated immediately and no new writes can start. This will also
prevent new writes from starting after Server.Stop has been called.
The most visible change is event-based dialing, which should be an
improvement over the timer-based system that we have at the moment.
The dialer gets a chance to compute new tasks whenever peers change or
dials complete. This is better than checking peers on a timer because
dials happen faster. The dialer can now make more precise decisions
about whom to dial based on the peer set and we can test those
decisions without actually opening any sockets.
Peer management is easier to test because the tests can inject
connections at checkpoints (after enc handshake, after protocol
handshake).
Most of the handshake stuff is now part of the RLPx code. It could be
exported or move to its own package because it is no longer entangled
with Server logic.
The previous limit was 10MB which is unacceptable for all kinds
of reasons, the most important one being that we don't want to
allow the remote side to make us allocate 10MB at handshake time.