1

The how of monitoring your services

 3 years ago
source link: https://blog.codecentric.de/en/2020/11/the-how-of-monitoring-your-services/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Lately, there has been a lot of discussion about SLAs, SLOs and SLIs. As this article states, it is hard to define the correct SLOs and SLIs. This discussion is about what part of your services you want to monitor. But it is also difficult to measure these correctly. In this blog post I take a look at two examples of what can (and for us, did) go wrong in monitoring. This is about how you monitor your services.

Example: TCP connections for monitored services

The first example will be about TCP connections, a proxy, and handshakes.

Expectation vs. reality

For one of our projects we use an authentication proxy which is talking to an LDAP server as backend. We came across connections piling up on the server hosting this proxy. At first, it was not clear what was causing these connections.

After the proxy was installed, I integrated it into our Zabbix monitoring. To verify the proxy is answering requests, I used Zabbix’ built-in check net.tcp.connect. At first all seemed fine. The check was doing exactly what I expected.

But after a while we saw connections to the backend piling up on the server running the proxy. As no one was using the proxy for authentication at that point, I suspected Zabbix causing the vast number of connections. The monitoring of the service wasn’t working as expected. But what exactly was happening?

Each time Zabbix initiated the check, it was doing a three-way TCP handshake …

TCP three-way-handshake

Source: https://www.cs.purdue.edu/homes/park/cs536-e2e-3.pdf

… and after that tore down the connection:

TCP connection tear down

Source: https://www.cs.purdue.edu/homes/park/cs536-e2e-3.pdf

In tcpdump, it looks like this:

TCP 3-way-handshake and tear down

That was expected, so why were there so many connections still left on the system?

The proxy responded correctly, so the Zabbix check said everything is fine. But what happened on the connection from the proxy to the backend system?

It turned out, the proxy was starting a TLS connection to the backend for every incoming TCP connection. It did not matter to the proxy, there was no data sent. But the TLS connection to the backend should not be a problem either. It should have been torn down when the TCP connection from Zabbix to the Proxy ended. But that is theory. In reality the TLS connection even persisted after the correct TCP teardown:

TLS handshake - 1

The Swiss army knife of networking

So, I found the connections piling up on the proxy. But I still did not know what was the real problem. I tried to get a more precise view by connecting to the proxy manually with netcat: nc -v backend.example.com 8636

But nothing happened. Each time I opened a connection with netcat to the proxy, it started a TLS connection to the backend. I closed netcat and after that, the proxy tore down the TLS connection to the backend. No connections piled up on the Proxy. What was different? After some more testing and man page reading I managed to reproduce the Zabbix behaviour with netcat: nc -z -v backend.example.com 8636

The parameter that did the trick was -z. It instructs netcat to close the connection after a successful connect:

-z      Specifies that nc should just scan for listening daemons, 
        without sending any data to them. It is an error to use 
        this option in conjunction with the -l option.

So, it is not a problem specific to Zabbix, but it seems to be the Proxy. During the tests with netcat I observed, the problem didn’t appear when I used netcat in interactive mode.

Perhaps everything is a timing problem?

Netcat offers another handy parameter for these tests:

-w timeout
             Connections which cannot be established or are idle 
             timeout after timeout seconds. The -w flag has no 
             effect on the -l option, i.e. nc will listen forever
             for a connection, with or without the -w flag. 
             The default is no timeout.

So, I tried it again with nc -w 1 -v control01.baremetal 8636 and it turned out, it works.

I did some more tests with this parameter and it worked without leftovers. Taking a closer look at the tcpdump traces, the TLS connection is not torn down when the initiating TCP connection to the Proxy ends before the TLS handshake finished. As soon as the TCP teardown sequence starts after the TLS handshake is done, the TLS connection also ended as expected(tcpdump view):

TLS handshake - 2

Monitoring the service

So, I used the netcat command to create a new Zabbix check with the slowed down TCP disconnect. It is not the perfect solution, but works fine for my situation.

To be complete, implementing the check for Zabbix did not work without problems. In short, it showed Zabbix also needs the parameter -d. Otherwise, it does something weird with stdin and the parameter -w 1 has no effect.

Example: State of a monitored service

This is another example of an application we monitor. There we monitored the availability and response time of an HTTP endpoint. The first approach was a simple HTTP GET showing these response times:

Pile Up of monitored response times

As you can see in the graph above, the response time piled up the more we queried the endpoint.

As it turned out, the application held a state associated with the endpoint. Be surprised, but not everything is stateless. This state grew bigger and bigger each time we queried the endpoint. Therefore, it took the application longer and longer to process our requests. The session timeout was too long, to discard the session between the monitoring queries.

The application had to be modified, so that it does not create a session for the endpoints used for monitoring. This is just an example and might also happen with disk space, memory, or CPU consumption.

Conclusion

Not only is it hard to define SLOs/SLIs and define the correct measures for a user perspective. As shown with these examples, it is also hard to monitor the services correctly without impacting the selected SLOs with your measurement. It’s crucial not to only know what service to monitor, but also how to monitor this service. The Observer effect is not only applicable to quantum physics.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK