Container Networking and tcpdump

Looking at packets as they travel through the network can tell you a lot about how the network is behaving and what can potentially go wrong.

I am just starting to learn about the various ins and outs of this ecosystem, so I never lose an opportunity to use tcpdump. Sometimes I get results and sometimes I don’t, but tcpdump is always fun.

Normal Scenario.
In the normal scenario you have a computer whose packets you want to sniff. You log into the computer and start tcpdump on a network interface. A network interface is a logical counterpart of a physical networking device. 
Thus you get to see all the packets flowing through that interface. 

However, at shaadi some of our workloads are containerized. What do you do if you want to look at the packets of a single container?

The naive approach (as I would soon discover) is to run tcpdump on the entire instance. This is not a good idea if the instance is ingesting data at upwards of 1gbps. Capturing all this traffic means that you will write a file of approx 1GB in size to the disk every *second* on the container host. 

Since this is a production environment (when you are running tcpdump it is almost always on prod), writing such a huge file has 2 problems.

  1. Container hosts hardly have enough memory on disk to do that . Running 20GB host memory gives me a meagre 20 seconds of capture time. 
  2. Its computationally expensive. Very expensive.

    Analyzing the capture would be one hell of a challenge. 

The smarter way
The smarter way would be to sniff packets only from the container that we want to debug. This needs some introduction to how container networking works.

One might think that we could easily do it by capturing packet to and from the port where the container is exposed. WRONG.

I made that mistake. What we should realize is the port that docker exposes is only used for ingress into the container. Not egress. So if you capture packets on port 32763 (which maps to port 3000 inside your container according to your DOCKERFILE) then you are looking at the traffic that is connecting to your container. You are not capturing the packets that the container is pushing out.

We need to dig deeper.

So let’s go…

Containers use a Linux isolation framework called ‘namespaces’ in order to isolate process running on a host. For networking, every container runs in its own separate networking namespace so that it is isolated from other processes and connection between these different namespaces is established by using Virtual Ethernet devices called veth.

We can think of them as virtual Ethernet cables that are connected to something on both ends to some network interface. The interfaces are like virtual Ethernet ports similar to the Ethernet port on your computer.

So now, we can have to look at the scenario from two different perspectives, from the host’s perspective and from the containers perspective.

I am running a simple sh shell in alpine.

# docker run -it alpine:latest /bin/sh
# echo "Hello :-) "
# Hello :-) 

Now, I am run ip link which will describe the network interfaces.

# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
17: eth0@if18: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP 
    link/ether 02:42:ac:1e:01:00 brd ff:ff:ff:ff:ff:ff

Here, we see that eth0@if18 has an @ifXX in it which makes things very interesting. This signifies two things. The ‘@’ shows us that this interface is linked to another interface and the ‘ifXX‘ tells us that the interface it is linked to is not in the same network namespace.

Every interface is supposed to be connected on both ends and every interface has an interface index. This is the value that we see on the above output as 1 and 17. This can be found out by reading the value at /sys/class/net/<interface>/ifindex

# cat /sys/class/net/eth0/ifindex 
17

We can read the value of the linked interface from /sys/class/<interface>

The one it is connected to is called the peer link and we can look at its index in /sys/class/net/<interface>/iflink

# cat /sys/class/net/eth0/iflink
18

But, that is surprising because my container does not have any interface with ifindex=18. That is not a mistake. This shows that the interface 17 on the container is linked to interface 18 on my host.

This is what ip link shows on my host.

1: lo: &lt;LOOPBACK,UP,LOWER_UP&gt; mtu 65536 qdisc noqueue state [[EXTRA DETAILS TRUNCATED]]
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s31f6: &lt;NO-CARRIER,BROADCAST,MULTICAST,UP&gt; mtu 1500 [[EXTRA DETAILS TRUNCATED]]
link/ether e8:6a:64:c1:0d:3f brd ff:ff:ff:ff:ff:ff
3: wlp6s0: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 [[EXTRA DETAILS TRUNCATED]]
link/ether 98:2c:bc:4d:d9:b0 brd ff:ff:ff:ff:ff:ff
4: docker0: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 [[EXTRA DETAILS TRUNCATED]]
link/ether 02:42:27:9a:cf:32 brd ff:ff:ff:ff:ff:ff
18: veth01f0f9d@if17: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 [[EXTRA DETAILS TRUNCATED]]
link/ether f6:7c:d7:20:73:f4 brd ff:ff:ff:ff:ff:ff link-netnsid 0

NOTE Notice how interface 18 is linked to interface 17 on another namespace. This will be important.


Why don't we have a iflink for some interfaces.... ?

Interfaces that represent physical devices (eth0, wlan0) are linked to themselves and hence the ‘@’ is not used.

# cat /sys/class/net/wlp6s0/ifindex
3
# cat /sys/class/net/wlp6s0/iflink
3


How is this related to tcpdump ?

Well, we have figured out that all traffic from the container is flowing through the host machine via a linked network interface, so in order to sniff packets only from that container, we can tell tcpdump to point to that interface only.

tcpdump -i <interface> -w output.pcap

and voila!!! Now we can sit and sniff packets only from a docker container.

Not only does this vastly reduce the size of the capture files, it also reduces complexity during the analysis phase.

That’s it! Thanks for reading and happy sniffing.

Note: You can read a similar post on Sohom’s blog signalshore.github.io which does not have the work-specific bits.