I've been doing some reading into the whole NAT traversal thing. I figured I'd share my preliminary findings in the hope to spur interest in others and make them join in thisbrainstorm and maybe we can come up with some new ideas on how to handle NAT traversal in Demigod. In this post I will refer to several IETF documents, but note that many of them are still work in progress and not finished standards.
First let's outline the assumptions I made about what is desirable:
- low latency
- acceptable connection initialization time
- no end-user technical knowledge requirement (e.g. no port forwarding)
- support as much as possible the varying type of NAT on the market
I will try to use terminology from RFC5128 (State of Peer-to-Peer (P2P) Communication across Network Address Translators (NATs)) but sometimes I will slip in some of the older terminology of cone and symmetric NATs or invent my own. I highly recommend reading RFC5128, it specifically mentions peer-to-peer multiplayer online gaming and is only 30 pages long. It is basically a general summary of the problems encountered by peer-to-peer applications in a NAT world and refers to several possible solutions. All those solutions however are still in draft state and not finished.
Before we can start with NAT hole punching techniques, first we need a way for the game to detect whether it is behind NAT and what type of NAT this is. I think the NAT facilitator already uses STUN but I am not certain whether this is classic STUN (RFC3489) or newer STUN (RFC5389). I have not yet read either of those RFCs but from what I understand the newer STUN uses the newer terminology and covers more types of NAT that classic STUN failed to account for. STUN itself is not a complete solution but just a tool to detect what type of NAT and firewall are in the way. NATs range from easy to traverse (endpoint-independent mapping) to tricky to traverse (symmetric NAT with predictable port numbers) to probably impossible (symmetric NAT with random or illogical port numbers). When I say easy I mean this from a networking standpoint, coding it will probably still be tricky. If we want to support predicting port numbers on symmetric NATs, the initial STUN testing phase might get pretty long, so I recommend that once the type of NAT is determined this answer is cached either locally or on the facilitator for any period between an hour to a day. The benefit of caching on the facilitator is that other people from the same ISP don't need to go through the STUN testing phase again, the drawback is that if we got the type of NAT wrong because it is more complex or weird than earlier assumed, that whole ISP might be disconnected from Demigod for a while.
After we have determined the type of NAT that is in place, we know what kinds of tricks we'll need to pull out of our hat to get people connected to eachother. I see multiple possible situations:
- everybody is directly connectable: no tricks are needed, houray!
- only one person is behind NAT: the facilitator arranges this person to make only outgoing connections and no other players have to connect to the NAT-person directly
- multiple people are behind NAT, but they are the easier type of NAT (cone): the facilitator arranges for the NAT-people to punch holes and then they can connect to eachother via those holes; if multiple ports are needed (e.g. 6112 to 6118) multiple holes must be punched; beware that not all NAT uses the same mapped port as internal port, so 6112 might become 34000 but as long as during that state 6112 stays 34000, any packets to the NAT on port 34000 will be given to Demigod in the internal network
- some people are behind easy-ish NAT, one person is behind nasty NAT (symmetric): the facilitator arranges the easy-NAT-people to punch holes and the nasty-NAT-person to make only outgoing connections
- multiple people are behind nasty NAT: the facilitator needs to predict the next port each NAT will use and arrange for the nasty-NAT-people to try to punch those ports and connect to eachother (e.g. the first connection from port 6112 to another player might get mapped on 34000 while the second connection from 6112 gets mapped on 34001 so the third connection probably gets mapped on 34002); with other heavy traffic going on, predicting ports can fail easily (some else steals the predicted port) and multiple tries are needed, after X tries (i.e. exceeding the accepted waiting time for connection initialisation phase) faciliator gives up and hands over to the proxy/packet_relay.
- a mix of direct connectable, easy NAT and nasty NAT with one impossible NAT (symmetric with unpredictable port numbers): we use the above discussed punching techniques for all the possible NAT-people and arrange the impossible-NAT-person to make only outgoing connections
- multiple people are behind impossible NAT: we give up, hand over to proxy
All above points assume that once connections are setup and traffic is frequent, the NAT will keep the state and the forwarding will keep working. If the NAT decides halfway to forget the existing state and consequently map the packets to a new port or even a new address, the current connection needs to be re-established, but that is a whole new type of headache I will get into in a later post (or never). The quick&dirty solution that comes to mind to prevent people behind such NAT from not dropping mid-game is having the facilitator cache these highly evil NATs and just immediately hand those people over to the proxy; the proxy will need to use authentication so it can detect that the new stream of packets it is receiving from the new address,port tuple is actually just meant as the continuation of an existing stream.
Some links that might be of interest:
http://tools.ietf.org/html/rfc5128 (summary of terminology, issues, links to solutions, etc.)
http://tools.ietf.org/html/rfc5389 (new STUN)
http://tools.ietf.org/html/draft-ietf-behave-turn-14 (TURN, basically a proxy to get around the NAT)
http://tools.ietf.org/html/draft-takeda-symmetric-nat-traversal-00 (methods to predict port used by symmetric NAT and traverse them)
http://tools.ietf.org/html/draft-jennings-behave-test-results-04 (list of tested routers and their type of NAT)
https://www.guha.cc/saikat/stunt-results.php (same as above but tested with TCP)