4

Basic Network Troubleshooting

 2 years ago
source link: https://www.netmeister.org/blog/basic-network-troubleshooting.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Basic Network Troubleshooting

February 12th, 2022

Chrome's about://dino gameThere are many possible reasons for why you are unable to connect to a remote system, and no, it's not always the DNS. Being able to quickly identify what the most likely cause of the problem is (or: who would be in a position to fix it) is a useful skill, yet I often see even senior engineers waste time chasing false flags and red herrings when certain problems could quickly be ruled out by a better understanding of the specific meaning of the various error messages you might encounter.

The best way to cut down on wasted time is to quickly answer the question: "Is it the DNS, the network, or the app?" To help you answer this question and troubleshoot more efficiently, here's a quick explanation of the most common error scenarios and how to quickly identify at least where the problem is likely to lie, all based on this one weird trick: reading and understanding the error message.


Unable to resolve

Ok, so sometimes it really is the DNS:

macos$ ssh foo.netmeister.org
ssh: Could not resolve hostname foo.netmeister.org: nodename nor servname provided, or not known

or

netbsd$ ssh foo.netmeister.org
ssh: Could not resolve hostname foo.netmeister.org: No address associated with hostname
$ 

Hooold it. No need to whip out ping(8), traceroute(8), and tcpdump(1) just yet -- this is an easy one. It tells you right there what the problem is: could not resolve. That is, you never even got to the point where your system tried to send any packets to the remote side, because it couldn't translate the name foo.netmeister.org into an IP address. (Different tools and even different versions of the same tool on different platforms may give slightly different error messages, as shown above.)

In this case, you want to find out why you can't resolve the name. For starters, make sure you don't have a typo there: it's easy to accidentally slip in an extra letter or fatfinger a domain name. But suppose you are sure you have the right name, what can you check?

Try the authoritative nameserver

First, see whether the name actually exists in the authoritative nameserver in question:

$ dig +short netmeister.org ns
ns-181-a.gandi.net.
ns-143-b.gandi.net.
ns-179-c.gandi.net.
$ dig +noall +answer +comments @ns-179-c.gandi.net. foo.netmeister.org
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 25060
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
$ 

NOERROR indicates that there was, well, no error on the DNS server's side, meaning the query was correctly answered. Ok, so we've established that foo.netmeister.org simply doesn't exist in the DNS, so your journey ends here. That's right, that's it: fix the DNS entry or talk to whomever owns the domain.

But what if you can't reach the authoritative nameserver?

$ dig +noall +answer +comments @ns-179-c.gandi.net. foo.netmeister.org
[ time elapses ]
;; connection timed out; no servers could be reached
$ 

Does this mean that the problem is with the authoritative nameserver? Not necessarily! It's possible that it's only your system that can't talk to the NS, but that your configured resolver (e.g., from /etc/resolv.conf) can. This may be the case if, for example, your network administrator blocks outgoing DNS traffic to hosts other than the configured resolvers.

Note: it may also be the case that you can talk to the second-level domain authoritative nameserver (netmeister.org in this case), but the name you're trying to resolve is under a subdomain with a different authoritative nameserver:

$ dig +noall +answer +comments @ns-179-c.gandi.net. bar.dns.netmeister.org
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39531
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 2, ADDITIONAL: 3
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
$ dig +short dns.netmeister.org ns
panix.netmeister.org
$ dig +noall +answer @panix.netmeister.org bar.dns.netmeister.org
[ time elapses ]
;; connection timed out; no servers could be reached
$ 

What can you do to verify whether the name exists without asking your resolver (which we know failed to resolve the name, since that's how we started our journey here) and without making outbound port 53 UDP calls? Here you have a few options:

Use TCP

DNS is not just port 53 UDP, and you may be able to simply ask the authoritative via TCP:

$ dig +tcp +noall +answer +comments @ns-179-c.gandi.net. foo.netmeister.org
;; communications error to 2604:3400:aaac::b4#53: host unreachable

Alas, no luck. Our query is still blocked.

Try DNS-over-HTTPs

You can try to use DNS over HTTPs to one of the public DoH resolvers, such as e.g., Google's or Cloudflare's DNS servers. dig(1) supports DoH since March 2021 via the +https flag, but if your dig(1) is older than that, you can also use the JSON API directly:

$ curl -s 'https://dns.google/resolve?name=foo.netmeister.org&type=a' | jq '.Answer'
null
$ curl -s -H 'accept: application/dns-json' 'https://cloudflare-dns.com/dns-query?name=foo.netmeister.org' | jq '.Answer'
null

This doesn't let you ask the authoritative resolver, but it lets you confirm that at least some other nameserver can resolve the hostname you are experiencing problems with.

Try DNS-over-TLS

If you know a public resolver that supports DNS over TLS, then you can try that. Different command-line tools such as e.g., kdig(1) or dog(1) exist, and e.g., dig(1) has support for DoT via the +tls flag since version 9.18.0, but if you want to go really bare-bones, you can even use stunnel(1):

$ cat >/tmp/dot <<EOF
[dns]
client = yes
accept = 127.0.0.1:5353
connect = 8.8.8.8:853
CAfile = /etc/ssl/cert.pem
verifyChain = yes
checkIP = 8.8.8.8
EOF
$ stunnel /tmp/dot
$ dig -p 5353 +tcp @127.0.0.1 foo.netmeister.org

Some authoritative nameservers may even support DoT directly, so you may not need to use a public resolver like Google's DNS server here, which then would get you an authoritative answer, but even asking 8.8.8.8, you can once again confirm that what your resolver does is in line with (or different from) what others do.

Compare various public resolvers

Taking this concept a step further, you may want to compare the results across multiple different public resolvers. Good thing I have just the tool for you! You can either make a simple HTTP GET request:

$ curl -s 'https://www.netmeister.org/puddy/?name=foo.netmeister.org&format=json' | jq '.results'
{
  "2001:4860:4860::8888": {
    "CNAME": {
      "status": "NOERROR"
    },
    "AAAA": {
      "status": "NOERROR"
    },
    "comment": "Google Public DNS",
    "A": {
      "status": "NOERROR"
    }
  },
[...]

...or you can run the command-line tool directly:

$ puddy -1 foo.netmeister.org a aaaa
2001:470:20::2 (Hurricane Electric)
        A: [NO RECORD FOUND]
        AAAA: [NO RECORD FOUND]                                                                     
2001:4860:4860::8888 (Google)
        A: [NO RECORD FOUND]
        AAAA: [NO RECORD FOUND]                                                                     
2606:4700:4700::1001 (Cloudflare)
        A: [NO RECORD FOUND]
        AAAA: [NO RECORD FOUND]                                                                     
2620:0:ccc::2 (OpenDNS)
        A: [NO RECORD FOUND]
        AAAA: [NO RECORD FOUND]                                                                     
2620:fe::fe (Quad9)
        A: [NO RECORD FOUND]
        AAAA: [NO RECORD FOUND]                                                                     
$ 

...although that does again require you to be able to talk to the different resolvers on port 53. Either way, this offers you another way of checking if other people are seeing the same result as you do on your local system, which can be helpful.

Eliminate /etc/hosts confusion

A common source of frustration is an apparent discrepancy between a DNS lookup and behavior seen by other tools. This can often be traced back to a local change made to /etc/hosts, and would exhibit itself like this:

$ host foo.netmeister.org
Host foo.netmeister.org not found: 3(NXDOMAIN)
$ ssh foo.netmeister.org
ssh: connect to host foo.netmeister.org port 22: Connection refused

or, even more frustratingly:

$ host bar.dns.netmeister.org
bar.dns.netmeister.org has address 198.51.100.1
bar.dns.netmeister.org has IPv6 address 2001:db8::c2de:2d22:5ca1:2727
$ ping6 bar.dns.netmeister.org
PING6(56=40+8+8 bytes) 2001:470:30:84:e276:63ff:fe72:3900 --> 2001:db8::9a6f:ba98:b763:574e
ping6: sendmsg: Network is unreachable
ping6: wrote 2001:db8::9a6f:ba98:b763:574e 16 chars, ret=-1
ping6: sendmsg: Network is unreachable
ping6: wrote 2001:db8::9a6f:ba98:b763:574e 16 chars, ret=-1
^C
--- 2001:db8::9a6f:ba98:b763:574e ping6 statistics ---                                              
2 packets transmitted, 0 packets received, 100.0% packet loss                                       
$ grep bar.dns.netmeister.org /etc/hosts                                                            
2001:db8::9a6f:ba98:b763:574e   bar.dns.netmeister.org
$ 

That is, you observe one address (or no address) when performing an explicit DNS lookup (2001:db8::c2de:2d22:5ca1:2727), but another (2001:db8::9a6f:ba98:b763:574e) when running commands that use the normal gethostbyname(3) library functions, which (in most cases) will first try /etc/hosts.

If you see this, fix /etc/hosts, yell at the person who made that change (probably you yourself), and then make that file immutable (e.g., sudo chflags schg /etc/hosts or sudo chattr +i /etc/hosts).

Other name resolution failures

Here's another way that hostname resolution may fail without it being a DNS failure, strictly speaking:

$ ssh foo.netmeister.org
ssh: Could not resolve hostname foo.netmeister.org: Temporary failure in name resolution
$ 

You may see this if there is no resolver configured in /etc/resolv.conf at all, for example. This is again different from (depending on OS and SSH version):

$ ssh foo.netmeister.org
ssh: Could not resolve hostname foo.netmeister.org: Name or service not known
$ 

...which you might see if none of the nameservers configured are reachable. Which is yet again different from:

$ ssh foo.netmeister.org
ssh: Could not resolve hostname foo.netmeister.org: Non-recoverable failure in name resolution
$ 

...which you might get if you are able to reach your configured nameserver, but that nameserver does not want to talk to you, e.g., it's configured to accept DNS queries, but has restrictions on who may query it for the zone in question, so returns status: REFUSED to your DNS query.

Connection refused

If you are able to resolve the hostname, you may see a different error:

$ ssh foo.netmeister.org
ssh: connect to host foo.netmeister.org port 22: Connection refused

This error message is self-explanatory: connection refused means that you are able to talk to the host, but nothing is listening on port 22. That is, after your host resolves the name to an IP address and sends a TCP SYN packet, the remote side responds with a TCP RST:

20:29:46.690536 IP 172.16.1.15.54702 > foo.netmeister.org.ssh: Flags [S], seq 1750579353, win 65535,
        options [mss 1460,nop,wscale 6,nop,nop,TS val 674037520 ecr 0,sackOK,eol], length 0
20:29:46.701705 IP foo.netmeister.org.ssh > 172.16.1.15.54702: Flags [R.], seq 0, ack 1750579354, win 0, length 0

If you're sure you have the right port and hostname, then the problem is on the remote side at this point. But do note that connection refused is notably different from...

Operation timed out

$ ssh foo.netmeister.org
ssh: connect to host foo.netmeister.org port 22: Operation timed out
$ 

While connection refused clearly shows that you can reach the remote side, operation timed out equally clearly indicates that you can not talk to the remote system on the given port and protocol. tcpdump(1) will show your system sending repeated but unanswered SYN packets:

20:57:10.360287 IP 172.16.1.15.55111 > foo.netmeister.org.ssh: Flags [S], seq 2997342946
20:57:11.361373 IP 172.16.1.15.55111 > foo.netmeister.org.ssh: Flags [S], seq 2997342946
20:57:12.361702 IP 172.16.1.15.55111 > foo.netmeister.org.ssh: Flags [S], seq 2997342946
20:57:13.362352 IP 172.16.1.15.55111 > foo.netmeister.org.ssh: Flags [S], seq 2997342946
20:57:14.363566 IP 172.16.1.15.55111 > foo.netmeister.org.ssh: Flags [S], seq 2997342946
...

So once more, so long as you're sure you have the right port and hostname, the problem is on the remote side. Or rather, it's not on your side -- something else in between your system and the remote side may be dropping packets.

ping(8) and traceroute(8)

Finally, you can bring in your trusty friends ping(8) and traceroute(8). Is the host in question up at all? ping(8) may be able to tell you, but remember that ICMP may be blocked or dropped along the way, and failure to ping a host does not necessarily mean that it's offline.

traceroute(8) may be able to help you identify where along the way traffic is getting dropped, but here, too, remain aware of the probes possibly being dropped, the UDP port actually being in use, and so on. You can remediate some of that by trying e.g., traceroute -P tcp -p <port>.

But not being able to talk to a remote host can also have different causes, for example...

No route to host

netbsd$ ssh foo.netmeister.org
ssh: connect to host foo.netmeister.org port 22: No route to host
$ 

The hostname resolves, but we can't talk to the IP address: our system doesn't know how to route the packets to the remote side. This may be the case when the hostname in question resolves to an IPv6 address, but your system doesn't have IPv6 connectivity. Another reason might simply be a wrong or missing route, but either way, this is a problem on the client side.

Infuriatingly, this error may be reported on e.g., macOS as:

macos$ ssh foo.netmeister.org
ssh: connect to host foo.netmeister.org port 22: Undefined error: 0
$ 

Gee, thanks a lot, Apple. That's really helpful.

Connection closed by remote host

Finally, you may encounter an error that sounds very similar to connection refused (see above):

$ ssh foo.netmeister.org
kex_exchange_identification: Connection closed by remote host
$ 

In this case, you were able to resolve the host and even talk to it using TCP on port 22, but the remote side decided to close the connection. This may be because whatever is listening on port 22 on that host doesn't actually speak SSH; often times this happens if there is some sort of proxy or load balancing going on, or you simply have the wrong port.


Summary

The above examples were given using SSH; many of them translate more or less 1:1 to other TCP based protocols, such as HTTP. Of course there are many other application specific failure modes, and different applications differ in how well they relay the error messages to the user.

But before you get to debuggin the application, we can summarize the different network errors and their meaning into this flow chart (click to open a full-sized version):

Flow chart representing network troubleshooting decisions

This chart then simplifies, as promised above...

Simplified network troubleshooting flowchart

...to "Is it the DNS, the network, or is it the app?". Knowing which one of these it is is often the first step in troubleshooting the larger problem.

February 12th, 2022


Links:


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK