[Brainstorm] NAT Traversal

By Aprogas on May 13, 2009 10:04:26 AM from

I've been doing some reading into the whole NAT traversal thing. I figured I'd share my preliminary findings in the hope to spur interest in others and make them join in thisbrainstorm and maybe we can come up with some new ideas on how to handle NAT traversal in Demigod. In this post I will refer to several IETF documents, but note that many of them are still work in progress and not finished standards.

First let's outline the assumptions I made about what is desirable:

low latency
acceptable connection initialization time
no end-user technical knowledge requirement (e.g. no port forwarding)
support as much as possible the varying type of NAT on the market

I will try to use terminology from RFC5128 (State of Peer-to-Peer (P2P) Communication across Network Address Translators (NATs)) but sometimes I will slip in some of the older terminology of cone and symmetric NATs or invent my own. I highly recommend reading RFC5128, it specifically mentions peer-to-peer multiplayer online gaming and is only 30 pages long. It is basically a general summary of the problems encountered by peer-to-peer applications in a NAT world and refers to several possible solutions. All those solutions however are still in draft state and not finished.

Before we can start with NAT hole punching techniques, first we need a way for the game to detect whether it is behind NAT and what type of NAT this is. I think the NAT facilitator already uses STUN but I am not certain whether this is classic STUN (RFC3489) or newer STUN (RFC5389). I have not yet read either of those RFCs but from what I understand the newer STUN uses the newer terminology and covers more types of NAT that classic STUN failed to account for. STUN itself is not a complete solution but just a tool to detect what type of NAT and firewall are in the way. NATs range from easy to traverse (endpoint-independent mapping) to tricky to traverse (symmetric NAT with predictable port numbers) to probably impossible (symmetric NAT with random or illogical port numbers). When I say easy I mean this from a networking standpoint, coding it will probably still be tricky. If we want to support predicting port numbers on symmetric NATs, the initial STUN testing phase might get pretty long, so I recommend that once the type of NAT is determined this answer is cached either locally or on the facilitator for any period between an hour to a day. The benefit of caching on the facilitator is that other people from the same ISP don't need to go through the STUN testing phase again, the drawback is that if we got the type of NAT wrong because it is more complex or weird than earlier assumed, that whole ISP might be disconnected from Demigod for a while.

After we have determined the type of NAT that is in place, we know what kinds of tricks we'll need to pull out of our hat to get people connected to eachother. I see multiple possible situations:

everybody is directly connectable: no tricks are needed, houray!
only one person is behind NAT: the facilitator arranges this person to make only outgoing connections and no other players have to connect to the NAT-person directly
multiple people are behind NAT, but they are the easier type of NAT (cone): the facilitator arranges for the NAT-people to punch holes and then they can connect to eachother via those holes; if multiple ports are needed (e.g. 6112 to 6118) multiple holes must be punched; beware that not all NAT uses the same mapped port as internal port, so 6112 might become 34000 but as long as during that state 6112 stays 34000, any packets to the NAT on port 34000 will be given to Demigod in the internal network
some people are behind easy-ish NAT, one person is behind nasty NAT (symmetric): the facilitator arranges the easy-NAT-people to punch holes and the nasty-NAT-person to make only outgoing connections
multiple people are behind nasty NAT: the facilitator needs to predict the next port each NAT will use and arrange for the nasty-NAT-people to try to punch those ports and connect to eachother (e.g. the first connection from port 6112 to another player might get mapped on 34000 while the second connection from 6112 gets mapped on 34001 so the third connection probably gets mapped on 34002); with other heavy traffic going on, predicting ports can fail easily (some else steals the predicted port) and multiple tries are needed, after X tries (i.e. exceeding the accepted waiting time for connection initialisation phase) faciliator gives up and hands over to the proxy/packet_relay.
a mix of direct connectable, easy NAT and nasty NAT with one impossible NAT (symmetric with unpredictable port numbers): we use the above discussed punching techniques for all the possible NAT-people and arrange the impossible-NAT-person to make only outgoing connections
multiple people are behind impossible NAT: we give up, hand over to proxy

All above points assume that once connections are setup and traffic is frequent, the NAT will keep the state and the forwarding will keep working. If the NAT decides halfway to forget the existing state and consequently map the packets to a new port or even a new address, the current connection needs to be re-established, but that is a whole new type of headache I will get into in a later post (or never). The quick&dirty solution that comes to mind to prevent people behind such NAT from not dropping mid-game is having the facilitator cache these highly evil NATs and just immediately hand those people over to the proxy; the proxy will need to use authentication so it can detect that the new stream of packets it is receiving from the new address,port tuple is actually just meant as the continuation of an existing stream.

Some links that might be of interest:

http://tools.ietf.org/html/rfc5128 (summary of terminology, issues, links to solutions, etc.)
http://tools.ietf.org/html/rfc5389 (new STUN)
http://tools.ietf.org/html/draft-ietf-behave-turn-14 (TURN, basically a proxy to get around the NAT)
http://tools.ietf.org/html/draft-takeda-symmetric-nat-traversal-00 (methods to predict port used by symmetric NAT and traverse them)
http://tools.ietf.org/html/draft-jennings-behave-test-results-04 (list of tested routers and their type of NAT)
https://www.guha.cc/saikat/stunt-results.php (same as above but tested with TCP)

+3 Karma | 4 Replies

Interesting read, luckely i've never had to work on this kind of stuff. I'm just a simple application programmer

If the NAT decides halfway to forget the existing state and consequently map the packets to a new port or even a new address, the current connection needs to be re-established, but that is a whole new type of headache I will get into in a later post (or never).

I've had a few lost connection to all situations lately (after playing for several minutes to more than 30 mins) could it by that this happened at that moment? Is this tracable? for instance with tcpview? And if so can someone tell me where and how to look at?

If you are behind NAT and your NAT decides to change the external address:port mapping you cannot see that in tcpview but you might see it in your Demigod logs (run dbgview to see real-time log) as a newly reported external address:port tuple. If it happens to the other side while you are playing, it is hard to see that in tcpview because UDP doesn't really form connections, especially not a single packet that probably gets rejected because it comes from a different address:port your computer was expecting. You would have more luck seeing it in Wireshark since that just logs all packets, also rejected ones, the problem with this however is that while playing there are so many packets going around it is hard to notice the faulty one. Bottom line is that unless you can monitor traffic on both ends of the connection (and preferably also somewhere in the middle) it is very hard to find out why exactly a connection drops.

If you suspect your NAT is being flaky, you could periodically run a tool like natcheck so see if that reports anything useful, although I haven't yet tried out that tool myself.

I have question about the original post, regarding what STUN is, exactly. I was looking at the "New STUN" link, you posted, and it seems to me like it's a big description of a message protocol, but the only functionality it really seems to provide is allowing a client behind a NAT to determine his external IP address. I don't think that this is enough information to determine what kind of NAT the client is behind.

So I'm not sure how it is possible to make that determination. I mention this because, while reading some of the posts that Frogboy makes, I got the impression that they were just trying to have users hole-punch to connect to one another as pretty much their only method of connection.

Also, when I read this article: http://www.jenkinssoftware.com/raknet/manual/natpunchthrough.html

they seem to be using an algorithm like:

1) Try to hole punch

2) If that fails, try to predict the port

3) If they both fail, try again.

4) After X tries, use the proxy.

So at least in this algorithm, they don't have a method to determine what kind of NAT each client is behind: they're just trying both methods. So is it actually possible to use STUN to determine the type of NAT? How does it work?

On some quick skimming it seems you are right though about that STUN document only describing how the STUN protocol itself works, not the algorithms to determine NAT type.

I think it would work like this: the local machine (L) sends a packet from Laddr:Lport to a public STUN server (S) at Saddr:Sport (the STUN server also has a secondary IP-address but we don't need to know that), from the perspective of the STUN server it receives the message from a remote machine (R) from Raddr:Rport (in many cases Laddr:Lport equals Raddr:Rport, i.e. no NAT). The STUN server sends multiple replies containing Raddr:Rport, these multiple replies use combinations of varying Saddr or Sport from the STUN end or varying Raddr or Rport they are being sent to. Based on which of these replies are received by the local machine, it can determine the type of NAT it is behind.

Some NAT for example when they map a packet from 192.168.1.2:6112 e.g. 192.0.2.34:34000 will reuse that mapping for another other internal packets send from 192.168.1.2:6112 to any host. This also means that any other host can send packets to 192.168.1.2:6112 by just sending packets to the external_addr:34000 and only one hole punch is needed for that port to be open to the rest of the world. So if the STUN server replies a reply both from its primary address and its secondary address to 192.0.2.34:34000 and both arrive at the local machine, it knows it is that type of NAT (endpoint-independent mapping).

Other NAT are much more strict about which return packets they allow back in, so them more than one hole punch is needed (one for each other client trying to connect).

I guess at first focus should be on getting the proxies running and when NAT fails just defer to proxy, but if it then appears many people end up at the proxies and extra latency is high, it would be nice if the trickier forms of NAT are supported as well.

The Forums Are Now Closed!

The content will remain as a historical reference, thank you.

[Brainstorm] NAT Traversal