Tencrypt: Hardening OpenShift by Encrypting Tenant Traffic

Published in Thu 06 December 2018 in Tech

#golang #networking #openshift #research report #security

tl;dr: Tencrypt implements a transparent encryption proxy for network traffic originating from Pods towards Pods of the same OpenShift Project, without the need for any changes in deployment images (hence „transparent“). As OpenShift uses the Kubernetes orchestration engine, this implementation might be of interest in the Kubernetes stack as well. This article is a shortened text version of my report, published as a paper and a presentation. I have skipped some parts of the report to focus on the implementation (although including some overhead to offer keywords for search engines as well).

Introduction
Demo
Networking concepts in OpenShift
Security requirements and threat model
Using Minishift for experimental implementations
Encrypting traffic between Pods
Conclusion
Sources & more
- Existing container security projects and integrations
- Related technologies

Introduction

Security and data privacy are becoming increasingly bigger factors in the decision, which technologies companies and other user groups take into consideration when migrating software into cloud environments. OpenShift is one of the main platforms for containerised application deployment.

In this project, the requirements for a transparent encryption of network traffic between multiple applications (Pods) of a Project inside OpenShift are evaluated and implemented as a proof of concept. Based on these results, measurements are taken, to compare the different throughput performance metrics in both the vanilla and the patched environments. Lastly, the conclusion wraps up the approach, the results and further work to be done.

For an introduction to OpenShift, please see section 2 „Red Hat OpenShift, Kubernetes and Docker“ in my report.

Demo

In the following video you see a demo of Tencrypt. The demo shows a connection between two Pods, nginx1 and nginx2, which belong to the same OpenShift Project and use an example image. The tencrypt-nsexec.sh script takes the arguments and executes them in the target container namespace.

In the first run a connection from the Pod nginx1 is done to the web Service offered by nginx2 with curl. The HEAD request is fulfilled as expected, but can be sniffed on the docker0 bridge on the host (as seen in the top left terminal). To prepare the encrypted tunnel, the tencrypt-iptables-update.sh script creates (or updates) iptables rules for the Tencrypt port towards virtual IPs. This is needed because the originating Pod only knows the VIP and uses it to establish the encrypted tunnel.

Executing tencrypt-setup.sh sets up the Pods network for Tencrypt. The script is shown in section „Setting up the Pods network“. You can see that executing it inside nginx1 applies the patch without changing nginx2. At last, nginx2 is patched as well and the tencrypt-proxy binary is executed in both Pods.

Running curl again shows the patched communication. The request finishes as expected, the proxy applications exchange packets via UDP. What you can see in the backlog in nginx1 is the output of the remote address resolver, marking it as an „internal IP“ (don't mind the error messages; the demo was based on an extended version with experimental code blocks). More details on this in section „Differentiation of Project-internal and -external traffic flows“. The end result shows the sniffed traffic on docker0. Without the patches the request was made in the clear, with the patches only an exchange of encrypted UDP packets could be captured.

Video is also available in original size (1354x752, 32M).

Networking concepts in OpenShift

Kubernetes by default allocates IP addresses from private internal network ranges for each Pod. Additionally, Pods receive their own unique networking namespace as isolation. See figure 1 for details. As described in [Hau18], Kubernetes uses the same network namespace for all containers inside a Pod for intra-Pod traffic. As mentioned above, OpenShift differentiates between three modes of networking when using the SDN overlay network (using VXLAN):

Flat: Using the ovs-subnet plug-in, a „flat“ Pod network is established. Every Pod can communicate with every other Pod.
Multi-Tenant: The ovs-multitenant plug-in isolates traffic on the Project-level. Each Project receives a unique Virtual Network ID (VNID) (comparable to a VLAN tag) which restricts traffic flows from pods to pods of the same Project. The special VNID 0 can be used to enable unrestricted traffic flows, e.g. for load balancers. It can be seen as Project-level network isolation, supporting multicast and egress network policies.
Network Policy: Lastly, the ovs-networkpolicy plug-in allows the configuration of custom isolation policies. This granular policy-based isolation provides rules like „allow all traffic inside the project“ or „allow traffic to project Purple on port 8080“.

Containers are created inside Pods. Containers receive an isolated eth0 network adapter from a veth pair. The other end is added to the OVS bridge veth0.

OpenShift networking overview

Security requirements and threat model

Using complex software infrastructures for computation and storage introduces multiple attack vectors. To analyse the security requirements Tencrypt should cover, the STRIDE/AINCAA list given in [Sho14] is taken into consideration. STRIDE can be viewed as a super-set of the CIA triad.

Authentication: Pods which receive traffic over an intra-Project connection from other Pods of the same Project must be able to ensure the authenticity of the traffic sender.
Integrity: Data transmitted over a Tencrypt channel must support validation of integrity.
Non-repudiation: Not a subject of this project.
Confidentiality: The encryption of data sent to other Pods must not be readable by outsiders, e.g. administrators listening on the br0 bridge. Introduced in the proof of concept.
Availability is one of the key elements of the underlying Kubernetes orchestration tool. Kubernetes offers features like load balancing, migration of Pods in case of failures, automatic scaling of resources in case of increased load and e.g. dynamic routing, with which any changes in the container topology can be transparently handled without interruptions. An extended implementation of Tencrypt should handle discovery of new routes and other interference.
Authorisation: Tencrypt-enabled Pods should automatically reject unencrypted or malformed traffic from Project-internal Pods, reducing the threat of forged traffic from an attacker masking as a Project member.

A note regarding Containerisation

A note before we go through the list of possible attack vectors and threats given in the OpenShift Pod setup: this threat analysis does not include the threat of a hostile administrator or attacker controlling the hosted Node (and especially the network bridge). The isolation of Pods in namespaces does not help in any way to e.g. prevent traffic interception by a third party, if this third party has unlimited access on the host machine. To prevent administrative access into the underlying network, mount and PID namespaces, an entirely different approach is needed. Tencrypt focuses on the transparent encryption of Project-internal Pod-to-Pod traffic, on the same Node and in cross-Node traffic.

Nevertheless, the results presented in this paper should be seen in context with currently expanding technologies which support hardware-based isolation of Containers. Once the isolation of namespaces can be technically ensured, the eth0 network interfaced used by Tencrypt would be placed at the boundary between the protected container and the shared resources on the host.

Additionally, the reduction of possible attack points on hosts and inside the network might help mitigating threats in an administration with strict hierarchical access roles. If, for example, a group of administrators only has debug access to the networking bridge, but not to the underlying host system, Tencrypt would prevent disclosure of intra-Project data exchanges to members of this administrative group.

Threats

The following table gives an overview over possible threats which Tencrypt takes into consideration.

ID	Description	Mitigation
T1	An attacker uses a Pod to intercept traffic originating from other namespaces (Pods) on the `br0` bridge.	Encryption of traffic, hardening of isolation mechanisms (Linux kernel).
T2	An attacker not only intercepts, but is able to modify traffic on the `br0` bridge or the vxlan0 adapter.	Encryption and integrity checks.
T3	Interception and modification of Master-to-Node traffic.	IPsec
T4	Interception of Node-to-Node traffic, both Project-internal and cross-Project.	A combination of Node-to-Node IPsec and Tencrypt for Project-internal traffic
T5	Incoming external Service traffic is intercepted (and maybe modified) before it reaches the handling Service namespace.	Secured routes
T6	The Pod image used by OpenShift to deploy new Pods, is maliciously modified.	Securing the image registry. The registry depends on the used container technology and might be an external component.
T7	Resources requested by a Pod limit the availability of other Pods on the same Node.	Continuous resource monitoring, migration or halting of resource intensive Pods if needed.

(Sorry for the formatting, please see the report PDF for a nicer layout.)

Ideas and possible approaches

After the analysis of the OpenShift network architecture, combined with the security requirements, the following design and implementation approaches were collected and served as the base for further implementation.

Using AES, a shared Secret can be used as the encryption and decryption key. Payload of packets would need to be transparently encrypted using the key and decrypted on arrival.
- How to handle payload size growth and MTU?
- How can the keys be shared and rotated?
- Does this approach fulfil the security requirements?
In contrast to symmetric encryption with AES, asymmetric encryption with periodically changed session keys could be used (hybrid encryption). With this approach, Pods could put their public key into the Secret storage, not having to share the private key.
- Who generates the key pair? (the Pod instance?)
- Which established system could be used? (e.g. X.509)
- Is this approach realisable without introducing a new component into OpenShift (as would be needed for certificate chains)?
Wireguard is one possible existing tunnel encryption software and features moving created endpoints into network namespaces. Could Wireguard therefore be used to deploy network interfaces inside containers?
- Does it scale?
- Can the compiled Wireguard tools and kernel modules be integrated at all?
- Which component would create the Wireguard interfaces?
- Can the Secret storage be used to receive all existing public keys from peers?
  #}

Using Minishift for experimental implementations

Minishift, as a fork from the Kubernetes Minikube project, uses Docker Machine for deployment. In the Tencrypt setup, the Virtualbox VM driver was used, because it runs out-of-the-box. In contrast, the KVM setup needs additional tools from the Docker Machine project. Running the VM, at first the Boot2Docker VM image was deployed, but was later replaced by the CentOS image.

The local „developer“ account was used for administrating deployments. For testing purposes, two Projects were created, with the first having two Deployments (two running Pods) while the second had one. With this environment, inter-Project reachability and later intra-Project traffic encryption can be tested.

Minishift works with the default ovs-subnet method, permitting traffic from all Pods to all other Pods, regardless of the Project. To further test the functionalities of OpenShift, the ovs-multitenant network policy should be used, isolating Pods of specified Projects. Yet, configuring it did not result in expected policy changes. An issue on Github states, that the plugin currently does not work with Minishift, as the feature is not implemented.

Dissecting the network configuration in Minishift

When using Minishift, the OpenShift cluster, which one would distribute over multiple hardware nodes (separating the Master, etcd and the compute Nodes), is simulated with Docker containers. The localhost (the VM) is pre-configured as the Node on which Pods are deployed. Containers inside these Pods are ran as Docker containers bound to the same Docker daemon as the Pod instance. Inspecting the Minishift networking shows that the VM uses three interfaces, eth0, eth1 and docker0.

The interface eth0 with the range 10.0.2.15/24 is Virtualbox and KVM specific and is used for host-communication via SSH. The eth1 interface is configured as the vboxnet0 interface on the original host, and uses an address range like 192.168.99.100/24. The docker0 bridge uses 172.17.0.1/16 for all Pod namespaces. Each Pod is configured with an IP from this range in its isolated namespace, connected to the bridge via veth pairs. Configured Services receive a „Cluster IP“ in the network range 172.30.0.0. These Virtual IPs (VIPs) are routed with iptables NAT rules on the Minishift host. If a service is requested by its DNS name, OpenShift resolves it to the 172.30.x.x address.

To test the possibility of using pre-shared keys within the Pods and Containers, the integrated Secret management component of OpenShift was evaluated. With it, one enters a Project, creates a Secret „Tencrypt“ resource and give it the key „PROJECT_PSK“ with a random value. There are two ways to access this key from within a Pod:

Environment variable: The key/value entry is accessible via the generated environment variables inside the container. E.g. a script could use $TENCRYPT_PROJECT_PSK if configured with this key.
Filesystem mount: The Secret with all its keys is mounted as a volume and can be accessed by reading the mounted files. E.g. the Container can execute cat /mnt/tencrypt/project_psk when mounted under /mnt/tencrypt.

Further looking into the availability of ENV and mounts in the Pod container revealed, that each container has its own environment. The Pod container does not receive the Secret environment entries, if configured. Testing the mount namespaces for the second way of sharing Secrets showed that file system mounts are also not shared between the Pod and its application containers. The second option was therefore not applicable as well. This leads to the conclusion that the existing Secret Storage might not be sufficient for this type of Secret sharing.

Patching the Pod image

As mentioned, inside the VM a Docker daemon handles the building and orchestration of images. The internal registry contains some images used to deploy OpenShift components, e.g. the openshift/origin-haproxy-router and more interestingly openshift/origin-pod. Minishift allows to connect to this Docker daemon instance by executing eval $(minishift docker-env) and then further using the local docker tool which connects to the remote daemon.

The connection to the integrated Docker daemon also gives access to the VM-internal Docker registry with the mentioned openshift/origin-pod image. Using a Dockerfile and some Docker commands, we can re-tag the original Pod image. The new image is loaded when OpenShift deploys a new Pod (in the default configuration, which uses the identifier openshift/origin-{component}:v3.10 as image name). This can be verified by using the developer web interface or the oc CLI tool to re-deploy an application.

As soon as the application is re-deployed, both, a new Pod container and a new application container, are visible in the Docker container list. Using docker ps -n2, the two most recently created containers are displayed, in which a shell can be openend by using docker exec -ti -u root <id> bash. The verification is complete when ls /etc/hello_test executes successfully. We now have a new layer on top of the default origin-pod image.

Encrypting traffic between Pods

To find a possible solution for most-early integration of transparent Tenant-level encryption, the following basic policies were defined:

When speaking of „encrypted traffic“, it only includes Tenant-internal (Project-internal) traffic, not egress traffic leaving the platform or traffic exchanged with Services of other Projects.
Containers inside Pods in OpenShift share a virtual ethernet interface, eth0. This interface will be the main implementation focus. All packets transmitted over the veth pair via the br0 bridge and the vxlan0 interface towards other Pods of the same Project should be encrypted.
For this experiment, only data of OSI layer 5-7 is encrypted, also known as application layer in the TCP/IP stack. This reduces the overhead which would be introduced by layer 3 or even layer 2 encryption, most probably resulting in encapsulation and additional masking techniques. The proof of concept implementation does use UDP encapsulation, because tunneling TCP with only payload encryption proved to be not feasible.

Pods receive one side of a virtual ethernet (veth) network adapter pair, connected to br0. Inside the Pod, all containers share this network interface as eth0. Technically, the whole package is a collection of multiple containers, sharing the same network namespace. The Pod container is special, since it only executes the /usr/bin/pod binary, which only waits for an interrupt and does nothing else, keeping the container alive as long as the namespace resources are needed for the containers belonging to this Pod.

Since the encryption should be done at the earliest possible point in the network stack, the modification of the eth0 adapter is the primary target. Secondly, it should not be necessary to patch any deployed application image used in OpenShift. Developers and users of the platform should not need to take any measures regarding their deployments if they wish to use Tencrypt.

In the following, part 1 shows what possibilities for the Tencrypt routing setup were explored and how the final result looks like. This is followed by part 2, in which the differentiation between Project-internal and -external traffic is examined. Part 3 explains the proof of concept implementation. The schema below shows the implementation schema with the proxy application which reads packets from the virtual TUN interface tenc0. Containers route their packets through this interface, as it is configured as the default gateway for hosts in the Services IP address range. The proxy application itself uses the interface address from eth0 to relay the packets over eth0.

Virtual adapter tenc0 implementation

Setting up the Pods network

This part started with a simple step in which the possibilities of network interface, address and route manipulation from inside the container were tested. After having the possibility to add binaries and scripts into the Pod image, the Dockerfile copied a shell script which was executed on Pod start. The script contained some basic commands like ip addr add 172.17.0.99/16 dev eth0 (background: the first attempt aimed at creating additional IP addresses which could be addressed Project-internally). Adding a simple Go binary which read all available network interface addresses worked as well.

Trying to configure any routes or addresses on the eth0 interface fails however. After some research and a look at the Docker capabilities setup, the problem could be identified to be a missing capability: NET_ADMIN. Without it, even a root inside the container may not manipulate the network interface.

To verify that this capability is indeed the missing piece, a simple test is sufficient: running a Docker container based on the patched Pod image with the --cap-add NET_ADMIN flag creates a new Pod container in the background. Attaching to it with a console and running ip addr add 127.0.0.2/8 dev lo adds a new IP address to the lo interface without problems.

Finding a solution for this problem included multiple approaches:

An extensive search in the Origin and Kubernetes code repository, to investigate if the Pod container setup could be patched to grant additional capabilities. This brought no results, as the Pod setup is complex, abstracted on multiple levels inside OpenShift and the Kubernetes libraries.
Extending the Dockerfile with setcap and iptables commands. Result: setcap works, but does nothing later when executed inside the container, iptables fails because of the known capability restrictions.
Adding a new Security Context Constraint (SCC) with the capability and adding securityContext.capabilities.add: ["NET_ADMIN"] to the deployment config. No improvements.
Updating the hostConfig of a running Docker container, adding NET_ADMIN to CapAdd and restarting the container. Tried both, the manipulation of the file on the host system and any possibilities offered by the Docker Python library. This did not work as expected.
Changing capabilities as root from the host machine, manipulating the aufs file system of the container in /var/lib/docker/aufs. This also failed, because setcap did not work because of missing symbols on TinyCoreLinux (which is used in Boot2Docker), even when self-compiled.

At this point it was obvious that too much time went into investigating a problem which needed to be solved, but would probably not take this much effort later, when Tencrypt project code would be more compatible with the upstream project and used in a dedicated OpenShift instance instead of a Docker-simulated testing environment. Since the last approach revealed some limitations of Boot2Docker and TinyCoreLinux, it was decided to reduce possible limitations by switching the Minishift VM operating system to CentOS.

Running Minishift with --profile centos --iso-url file:///tmp/minishift-centos7.iso creates a new profile and new VM. CentOS not only provides a full operating system with more libraries, it also offers more possibilities by installing packages with yum, e.g. tcpdump. TinyCoreLinux offers this as well with tce-load, but has limitations as mentioned above. yum could also be used inside the containers, since they are based on the CentOS base image. Surprisingly, after setting up test deployments and their services via the WebConsole of OpenShift, it is directly possible to use nsenter to enter a network namespace of a running Pod.

Finding the pid is possible by identifying the Pod container and using docker inspect, extracting the State.Pid value. Finally, it was possible to manually edit the network adapter inside the container, because nsenter is executed as root from the host machine and not from within the restricted container environment.

Coming back to the original aim: transparently proxying traffic to other Services. Again, multiple approaches for traffic routing and transparent proxying were tested, starting with the configuration of IPtables rules and continuing by using virtual interfaces. In the end, a virtual TUN interface in combination with default routes and policy based routing did the trick. The solution for creating the interface and setting up default routes to a transparent proxy is implemented as the following shell script.

# Create TUN
mkdir /dev/net && mknod /dev/net/tun c 10 200 && \
ip tuntap add mode tun tenc0 && \
ip link set tenc0 up
ip link set mtu 1440 dev tenc0
ip addr add 10.0.0.2/24 dev tenc0

# Create routing policy
OWNIP=$(ip a show dev eth0 | grep "inet " | awk '{ print $2 }')
ip rule add from $OWNIP lookup 2

# And the routes
ip route add 172.30.0.0/16 via 10.0.0.1 dev tenc0
ip route add 172.30.0.2/32 dev eth0
ip route add default via 172.17.0.1 dev eth0 table 2

# Prevent TCP RESETs on raw sockets issued by kernel
iptables -I OUTPUT -s 10.0.0.2 -p tcp --tcp-flags RST RST -j DROP

The verification of the functionality was done with a Python script which opened the TUN interface, read the TCP payload from each packet and sent the payload via a second connection towards an external host. The received answer was then passed back to the original local application which was waiting for the reply. The parsing was done with the Scapy library. The following paragraphs summarise the approaches of this implementation step.

At first, a local application should listen for connections on a local port and act as a proxy for TCP flows. For this, IPtables should do destination Network Address Translation (DNAT). This does not work as expected, because NAT rewrites the packets before passing them on according to the netfilter packet flow. The information where the packet should originally be sent to is lost and can therefore not be relayed. In this experimental stage, the IPtables mangle table and the TPROXY target were evaluated. TPROXY is only compatible with the PREROUTING chain, but packets from local applications only pass the OUTPUT and POSTROUTING chains, so TPROXY is not usable in this case.

Using a TAP interface for low-level control on link layer 2. The default route to Services must then be defined as this interface. An application opens the file descriptor to this interface via ioctl and reads incoming Ethernet frames from it. This worked partly, but ultimately failed because TAP interfaces need working Address Resolution Protocol (ARP) routing, since it works on layer 2. Any traffic sent towards the gateway 10.0.0.1 which was configured on the tenc0 TAP interface resulted in ARP requests for 10.0.0.1. Using the interface itself as a route, without any interface IP, resulted in ARP requests for the remote host address - which obviously cannot be resolved on the local interface. TAP was therefore dismissed to not introduce more complexity factors.

Creating the interface as a TUN interface working on the IP layer removes the need to deploy working ARP on this interface. The application reading packets from the interface receives IP packets only. Configuring the TUN interface as the default gateway for traffic towards the Services IP range works transparently for applications running inside the network namespace. A problem occurs as soon as the application reading the packets wants to relay/proxy the received packets. The default route will be applied to this application as well, resulting in a traffic loop.

To solve this problem, two interface IP addresses needed to be added to the TUN interface, 10.0.0.2/24 and 10.0.0.3/24. The default gateway for new traffic being the 10.0.0.1/24, routing via tenc0. At this point the IPtables rules and the usage of multiple routing tables were re-introduced. An application which uses this route uses the local address 10.0.0.2 as the packet sender. The proxy application should then receive the packets on the virtual interface, rewrite the source IP address and re-send the packet. The mangle table should MARK packets which were sent from 10.0.0.3, and the fwmark 2 table 3 rule should route those packets via another interface. The whole combination seemed plausible, but failed, because packets were sent via the tenc0 interface again. Binding to a specific interface did not work in the Python script.

The solution for this setup was a variation of the previous step. The 10.0.0.3/24 interface address and all IPtables rules were removed, only routing policies were still applied. The proxy application reads from the TUN interface and binds not as 10.0.0.3, but as 172.17.0.X on the eth0 interface. This combination lets all default traffic originated in local applications proceed according to the default routing table, but marks packets from 172.17.0.X/16 to use another routing table which uses 172.17.0.1 as the default gateway.

Differentiation of Project-internal and -external traffic flows

OpenShift makes heavy use of the internal DNS hostnames and up-to-date resolvers in combination with the usage of the Virtual IP (VIP) address space. This feature was focus to testing the possibility of differentiating traffic flows, looking for a way to utilise the DNS to identify remote IP addresses within the same Project namespace and the rest.

Looking at the DNS config inside a container, we see that OpenShift uses a hierarchical DNS structure which starts its search for hostnames in the Project DNS zone.

nameserver 172.30.0.2
search myproject.svc.cluster.local svc.cluster.local cluster.local

This means resolving requested service addresses could identify connections which should be encrypted. What is missing is a possibility for a Pod to look up its own Project namespace name. The container environment variables of a running application offers the variable OPENSHIFT_BUILD_NAMESPACE with the name, but this variable is not available in the Pod container, which is the main target.

Given the information contained in the resolv.conf, the search domain can be utilised to lookup and differentiate host names of remote IP addresses. The host names are queried by reverse IP lookups. To test this setup, the environment contained two Projects, with two Services in the first and one in the second. One important thing to know: the DNS resolver is also in the 172.30.0.0/16 address range and must be excluded from the default route which is routed over the TUN interface.

Utilising a Python script again, this step can be done by reading the resolv.conf and then issuing reverse lookups for remote IP addresses. The following code block shows the working Python test script.

with open("/etc/resolv.conf", "r") as fh:
    for line in fh.readlines():
        if line.startswith("search"):
            project = line.split(" ")[1]

while True:
    data = os.read(fd, MTU + 18)
    pkt = parse_packet(data)
    dns_info = socket.gethostbyaddr(pkt.dst)
    if dns_info[0].endswith(project):
        print("Host {} belongs to project!".format(pkt.dst))

Section „Part 3: Encryption of traffic“ can be found in the report. It has some ideas about how to continue with the encryption implementation. The result is described in the next section, the Proof of Concept.

Proof of concept implementation

The proof of concept (PoC) implementation was written in the Go language with the following components:

A DNS upstream proxy which reads requests and parses them to extract information.
A TUN interface handler, reading packets from applications and writing received replies.
UDP encapsulation of payloads with encryption and decryption functionality, using a listener on a specified port for Tencrypt UDP packets.
Raw sockets to route traffic towards targeted Services listening on local interfaces.

For an overview over the traffic flow, see the following graph. There were multiple obstacles to be taken, which are described in the following paragraphs.

Virtual adapter tenc0 implementation

The first approach in finding out if a host is internal or external lead to one failed request at the beginning of a connection establishment. An application inside a Pod requests the IP address of a remote Service from the DNS server, receives the IP and sends the TCP SYN packet towards the host. The packet is routed over the tenc0 interface, but can not directly be handled, because the proxy would have to examine the state (internal/external) of the requested host via reverse lookup. It was necessary to change this approach and go one step further.

For optimisation, external hosts should be white-listed so traffic targeting these hosts would not flow through the proxy. This and the caveat mentioned previously lead to the implementation of a DNS proxy inside the Pod. This DNS proxy opens a DNS socket on the loopback interface and the Pod gets this „new“ DNS server assigned as the default nameserver in /etc/resolv.conf. All requests to this socket are routed to the upstream DNS server on a single, static UDP connection (this might even lead to an improved component on its own, reducing the DNS connections inside the network). Answers from the upstream DNS are inspected and parsed, replies are matched to clients by the DNS ID field. In case the host is a Pod outside the namespace of the requesting Pod, the IP is added as a static route over eth0. This needs further tweaking, as IP addresses might change and routes must be deleted in case a host receives a new VIP.

Even though in the fundamentals, which were defined before the practical investigation, stated that only application data would be encrypted, the PoC does indeed encapsulate the whole packet received from client and service applications. This step was done due to failing handshakes in TCP sessions in the case of payload encryption and reassembly of the TCP packet. Further development of the proxy application might be able to pick up this approach again.

As mentioned, the tenc0 TUN interface is one central point of exchange for proxied applications. But during the implementation phase it came to light that the TUN interface is not enough to interact with applications listening on the public interface of a Pod. Example: a Pod offers a web service on port 8080, the web server listening for connections is bound to the public interface. Packets sent to the tenc0 interface however are not read by this web server application, because there is no listening socket waiting for packets. A second channel is needed to push incoming packets onto the kernel network stack like a „normal“ client would do when connecting. This is solved by using a raw socket inside the target Pod as a sending endpoint. The raw socket uses 10.0.0.2 to send the received payload towards the listening service.

The whole procedure of reading, encapsulating and unpacking has one other problem to be solved: a Service is addressed by its Virtual IP (VIP), not by the IP the Pod uses on its public interface. Meaning, if a client application in Pod A resolves myservice.myp.svc to 172.30.123.123, the packets would be tunneled over the Tencrypt proxy. But as soon as the packet is reassembled on the remote Pod B, B does not know about the virtual IP address, because the virtual IP addresses are translated by NAT rules on the docker0 bridge. Sending the packets to the Tencrypt UDP listener reachable over the virtual IP is no problem, these packets are rewritten as expected. Packets which are received as payload by the Tencrypt UDP endpoint have to be translated as well. This feature is implemented in Tencrypt, too, including the necessary re-calculations for IP and TCP checksums.

Section „Throughput measurements“ is not included in this blog post, as I fear the results are not meaningful in the evaluation of the whole concept.

Conclusion

Tencrypt showed that transparent encryption of Pod-to-Pod traffic with regard to the differentiation of Project-internal and -external Pods in an OpenShift environment is possible.

Security requirements and a possible threat model were gathered. We have seen that an implementation which would run in production infrastructure would need to fulfil the needs for authentication, integrity, confidentiality and possibly availability and authorisation. Even though the malicious access to the Node host and its networking cannot be prevented by Tencrypt, the concept of Tencrypt offers a practical component in Node security regarding future developments in container hardening techniques such as hardware-based isolation of memory.

Futher sections gave an overview of the basic conceptual assumptions of Tencrypt and explored multiple alternatives of implementation. Additionally, possible future problems of each alternative were taken into consideration. The Minishift development environment was examined in detail. We looked at the network configuration, the modes of Secret management, how the Docker daemon could be interacted with and how the internally deployed Pod image could be patched to include changes needed to execute Tencrypt in namespaces. Going further, three parts of practical implementation details were covered. Part one highlighted the diverse possibilities in network configuration, showing which parts could be covered with IPtables and policy based routing and the usage of TUN and TAP interfaces. The differentiation of Project-internal and -external traffic was demonstrated in part two. The section was concluded with part three which provides ideas regarding the encryption and decryption of packets.

This was followed by a wrap-up of the actual proof of concept implementation written in Go. It includes a DNS proxy, the TUN interface reader-writer, the UDP encapsulation service, encryption and decryption of encapsulated packets and the local use of raw sockets.

As a final step, the implementation was tested regarding the throughput performance. Examining the measurement results, we can see that the throughput is very low compared to unpatched connections. It should also be noted that the PoC fails the security requirements, because the tunnel only applies AES encryption of the payload without any promises regarding authentication or integrity. These are topics for further development iterations, because the code is intended to be an unoptimised proof of concept. Nevertheless, Tencrypt shows that the patches work as intended and that the proposed concept proved to be a possible candidate to be integrated into the OpenShift software stack.

Sources & more

In the appendices of my report you can find a list of components of OpenShift and more details about namespaces, Linux cgroups, SELinux and Secure Computing Mode (seccomp).

[Hau18] Michael Hausenblas. Container Networking: From Docker to Kubernetes. O’Reilly Media, Apr. 2018
[Sho14] Adam Shostack. Threat modeling designing for security. J. Wiley & Sons, 2014

Existing container security projects and integrations

OKD Secured Routes: When using TLS for traffic coming into an OKD Router, the Router decides how to handle this „secured route“. The „Secured Routes“ feature can then be configured in different variants to handle the corresponding route further continuing in the cluster. It can either be Edge Termination (serving certificates from the Router), Passthrough Termination (TLS is handled by the Service) and Re-encryption Termination (the router provides certificates for the external client, but also encrypts proxied traffic to the Service). If a Route is marked as secure, but no Secured Route is configured, internal traffic is routed unencrypted.

IPsec: The OKD project docs reference IPsec to be used to encrypt traffic between Master and Nodes, and for Node-to-Node communications. IPsec is an IETF standard, published as RFCs 4301 and others.

Aporeto: Aporeto provides a set of security features usable with Kubernetes in different Cloud infrastructures, including OpenShift. It implements End2End container encryption with TLS. They also develop „Aporeto integration with Kubernetes Network Policies“, which integrate into the Kubernetes policy catalogues.

Aqua Security: Container Security for Red Hat OpenShift. Enables container-level firewalls, but no encryption techniques.

Istio Auth: Istio uses Envoy service proxies inside Pods. These Envoys can be used for mTLS-encrypted Pod-to-Pod (Service-to-Service) traffic. Uses the Secure Production Identity Framework for Everyone (SPIFFE) authentication framework.

It is also worth taking a look into the OpenShift Security roadmap, which keeps track of the implemented and proposed security features by OpenShift version.

During the project run, related technologies in the context of virtualisation, SDN and encryption were looked into. The following list highlights the most interesting ones, which might also be used (or have to be taken into consideration) in further development.

Single Root I/O Virtualization (SR-IOV): With SR-IOV, PCI Express hardware resources provide an additional layer for virtualisation use cases. Hardware with SR-IOV support offer one so called Physical Function (PF, full access to the resource) and a configurable amount of Virtual Functions (VF, only supporting IO operations). This way, virtual machines (VMs) can share a physical device multiple times, while each VM sees the shared resource as an exclusive part of the virtualised system.

MACsec: MACsec provides multiple security enhancements, noticeably encryption and integrity on layer 2. It extends Ethernet frames by adding a new EtherType and a MACsec tag, inserted after the source MAC address. MACsec is standardised as IEEE 802.1AE.

Cisco Application Centric Infrastructure (ACI): ACI is an SDN infrastructure oriented, policy-based framework developed by Cisco. It aims to be used for better management and uses common technologies like VXLAN, Equal-cost Multipath (ECMP) routing and SDN controllers.