The Hashed wheel timer is a very interesting data structure used widely especially in network servers. Their low over memory over head and reasonable efficiency guarantees are a good match for servers handling millions of connections with a timer per connection. We will not spend too much time describing how they work, instead we will look at a few implementations and try to evaulate their relative trade-offs. More information about hashed-wheel-timers can be found here.
Let’s quickly recollect the basic data structures of a hashed wheel timer:
The basic data structure looks like a hashmap with separate chaining with lists. Each bucket represents a fixed number of ticks. When a timer is added to the wheel we calculate which bucket it should fall into. Though any hashing scheme would do, we usually just travel forward from the current time pointer one bucket at a time, till we have consumed enough ticks for our timeout to have just triggered. We then insert the timer into the list for that bucket. Given that we have a finite number of buckets, we might have to circle around the timer wheel a few times before we can insert our timer. We will need to store this information with the timer. This is typically stored as a remainingRounds
variable.
The first draft of an interface for out fictional timer is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Usually a timer is implemented with a more generic interface with some kind of a Runnable
task per timer that is executed upon timer expiry. We have made things a lot more specific and require that each timer has an unique id. We also leave the execution of code corresponding to each expired timer up to the user of the library. We could use some other identifier instead of an id. For eg: A pointer to a per connection structure if the timers are for closing stale connections.
So the absolute minimum data we need to have in our timers is:
1 2 3 4 5 6 |
|
Given this basic scheme let’s look at some of the ways we could implement the interface we showed earlier
Each bucket simply points to an intrusive doubly linked list of timers. Our data structures look like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
Our algorithms look like
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
So our algorithm characteristics in brief are:
O(1)
O(1)
O(expired timers)
. The actual time depends on how many timers hashed onto a bucket. We can put some bounds on this number with general hash table math.This is our base line. We have significant memory over head in terms of previous/next pointers on our timers. It’s pretty terrible that we have to malloc()
and free()
on each add_timer()
and cancel_timer()
call. Of course the O(1)
calls also hide the fact that we are traversing a linked list during timer expiry which is pretty terrible.
Instead of an intrusive linked list we can just store a vector of pointers to timers. We can grow this vector as needed. If memory is not a concern we could allocate a large enough array so that growing is never needed (though still accounted for). Our timer struct now looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Our algorithms look like
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
|
So our algorithm characteristics in brief are:
O(1)
.O(1)
.O(expired timers)
. Again actual time depends on how many timers hashed onto a bucket.This is a slight improvement on our base line. By using a vector of pointers to timers we have removed the need for the next pointers – a slight improvement in space usage. We still have the terrible pattern of having to use malloc()
and free()
on each add_timer()
and cancel_timer()
call. During expiry of timers we now traverse a vector of pointers to timers instead of a linked list. This is only marginally better than the first implementation since the timers themselves are allocated randomly.
Let’s see if we can improve effiency by making our interface even more specialized. So far we have returned a Timer*
on an add_timer()
call. Our intented use for the Timer*
is to cancel the timer using this pointer. Thus we will end up stashing this pointer somewhere. Maybe on a per connection structure or a hash map or something. What if we could store a pointer to this timer holder structure (that points to our timer) in our timer? We could then safely move/invalidate the pointer to a timer, updating the only place where this timer was actually referenced. Here we assume that the pointer to the timer was stashed in only one place. This is somewhat like our own little garbage collector updating references to timer objects that were moved around. Let’s look at our new data structures:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
Our modified interface now is:
1 2 3 4 5 6 7 8 9 |
|
Our algorithms look like
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
So our algorithm characteristics in brief are:
O(1)
.O(1)
.cancel_timer()
call – O(expired timers)
. Again actual time depends on how many timers hashed onto a bucket.We no longer call malloc()
and free()
on every timer call. Instead we manage large buckets and copy small amounts of data around to manage deletions. We ended up storing a pointer to the holder of the timer object but overall we are not storing any more data that the other implementations. When we got rid of the malloc()
and free()
calls we also got rid of any fragmentation arising from them. Further during expiry processing we iterate through a chunk of contiguous memory instead of pointer chasing.
By analyzing the context of our timer algorithm we were able to come up with a better implementation albeit a more tightly coupled one. Wholistic analysis of a particular piece of software often opens up such opportunities for efficiency gains. Though many a time there is a efficiency
VS other-non-tangibles
trade-off, I’ve been surprised by the number of times such gains have been made without tigher coupling.
Protocol design can have a suprisingly large effect on performance. Text-based protocols while good for human readability are far from ideal when it comes to performance. Unless your application almost exclusively deals with strings, a binary protocol could offer significant savings. There is a lot of dogma surrounding binary protocols:
toString()
method on the objects that you are transferring is already a big step in being able to debug your protocol.So far we have been talking about binary protocols in general. A length prefixed protocol requires every request be prefixed by its length in bytes. So the first 4/8 bytes of the request tell us the length of the actual payload. The rest of the payload could be text based instead of binary, but that’s usually not the norm.
Length prefixed protocols allow us to cheaply check to see if an entire request has been received. All you need to check is to see if you have received all the bytes you were promised. The actual parsing and procesing of the request can be done later, maybe on another thread.
The rest of the article shows some techniques on how to read a length-prefixed request.
The above approach lets us pick appropriately sized buffers for each request instead of guessing buffer sizes and copying later. It suffers from a problem though. Even for the smallest of requests we need at least two read calls. System calls are expensive and should be avoided if possible. Can we do better? We could trade the extra system calls for more complexity, a worse pathological case and larger internal framgentation.
So if we pick a default buffer size D, such that most of our requests are of size D or less
, we minimize the number of read calls. When our requests are of size greater than D
we incur some copy. The copy isn’t that bad especially if our requests are small. Picking the default buffer size depends on the application. Each option has different shortcomings:
Yet another approach is to not copy the contents of the default buffer but to use a list of buffers
to hold our request. Since we know the length of our request, we are guaranteed that the list will be of size 1 or 2.
Here we use scatter reads to read into our default buffer and the overflow buffer using the same system call. We avoid the extra copy but have our request spread over two buffers incurring random read costs when parsing the request.
The right approach depends on what trade-offs make sense for the particular application.
Though other protocols are vulnerable to slowloris attacks, servers that use length prefixed protocols are vulnerable to a particularly nasty attack. Imagine the following attack:
1 2 3 |
|
Not only did the client tie up a connection, it also tied up memory. A flood of these attacks could cause the server to exhaust all buffers. The standard techniques to mitigate slowloris attacks still apply:
potentially evil
if it only sends a trickle of bytes one too many times. We could also make our timer more strict for clients we suspect.We can modify our default buffer size technique to make it more resistant to buffer exhaustion:
The simplest defense though is to use these protocols on trusted networks instead of on the open web.
Length prefixed protocols make it really easy to check if a particular request is ready for parsing or not. This is especially a boon for multi-threaded servers where one thread could read bytes off the network and pass a complete request to another thread for parsing and processing. The cheap check implies that the network-thread doesn’t have to do much work to decide when to transfer a request to the next stage in the pipeline. The request might still be ill-formed but the parsing/processing thread could take care of it. Let me know in the comments section, if you use other techniques to read length prefixed protocols or if I have missed something.
]]>