18 Jan 2024 - Posted by Francesco Lacerenza, Lorenzo Stella
During testing activities, we usually analyze the design choices and context needs in order to suggest applicable remediations depending on the different Kubernetes deployment patterns.
Scheduling is often overlooked in Kubernetes designs. Typically, various mechanisms take precedence, including, but not limited to, Admission Controllers, Network Policies, and RBAC configurations.
Nevertheless, a compromised pod could allow attackers to move laterally to other tenants running on the same Kubernetes node. Pod-escaping techniques or shared storage systems could be exploitable to achieve cross-tenant access despite the other security measures.
Having a security-oriented scheduling strategy can help to reduce the overall risk of workload compromise in a comprehensive security design. If critical workloads are separated at the scheduling decision, the blast radius of a compromised pod is reduced.
By doing so, lateral movements related to the shared node, from low-risk tasks to business-critical workloads, are prevented.
Kubernetes provides multiple mechanisms to achieve isolation-oriented designs like node tainting or affinity. Below, we describe the scheduling mechanisms offered by Kubernetes and highlight how they contribute to actionable risk reduction.
The following methods to apply a scheduling strategy will be discussed:
As mentioned earlier, isolating tenant workloads from each other helps in reducing the impact of a compromised neighbor. That happens because all pods running on a certain node will belong to a single tenant. Consequently, an attacker capable of escaping from a container will only have access to the containers and the volumes mounted to that node.
Additionally, multiple applications with different authorization may lead to privileged pods sharing the node with pods having PII data mounted or different security risk level.
Among the constraints, it is the simplest one operating by just specifying the target node labels inside the pod specification.
Example Pod Spec
apiVersion: v1
kind: Pod
metadata:
name: nodeSelector-pod
spec:
containers:
- name: nginx
image: nginx:latest
nodeSelector:
myLabel: myvalue
If multiple labels are specified, they are treated as required (AND logic), hence scheduling will happen only on pods respecting all of them.
While it is very useful in low-complexity environments, it could easily become a bottleneck stopping executions if many selectors are specified and not satisfied by nodes.
Consequently, it requires good monitoring and dynamic management of the labels assigned to nodes if many constraints need to be applied.
If the nodeName field in the Spec is set, the kube scheduler simply passes the Pod to the kubelet, which then attempts to assign the Pod to the specified node.
In that sense, nodeName overwrites other scheduling rules (e.g. nodeSelector,affinity, anti-affinity etc.) since the scheduling decision is pre-defined.
Example Pod Spec
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx:latest
nodeName: node-critical-workload
Limitations:
Consequently, it requires a detailed management of the available nodes and allocated resources for each group of workloads since the scheduling is pre-defined.
Note: De-facto such approach invalidates all the computational efficiency benefits of the scheduler and it should be only applied on small groups of critical workloads easy to manage.
The NodeAffinity feature enables the possibility to specify rules for pod scheduling based on some characteristics or labels of nodes. They can be used to ensure that pods are scheduled onto nodes meeting specific requirements (affinity rules) or to avoid scheduling pods in specific environments (anti-affinity rules).
Affinity and anti-affinity rules can be set as either “preferred” (soft) or “required” (hard):
If it’s set as preferredDuringSchedulingIgnoredDuringExecution
, this indicates a soft rule. The scheduler will try to adhere to this rule but may not always do so, especially if adhering to the rule would make scheduling impossible or challenging.
If it’s set as requiredDuringSchedulingIgnoredDuringExecution
, it’s a hard rule. The scheduler will not schedule the pod unless the condition is met. This can lead to a pod remaining unscheduled (pending) if the condition isn’t met.
In particular, anti-affinity rules could be leveraged to protect critical workloads from sharing the Kubelet with non-critical ones. By doing so, the lack of computational optimization will not affect the entire node pool, but just a few instances that will contain business-critical units.
Example of node affinity
apiVersion: v1
kind: Pod
metadata:
name: node-affinity-example
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: net-segment
operator: In
values:
- segment-x
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: workloadtype
operator: In
values:
- p0wload
- p1wload
containers:
- name: node-affinity-example
image: registry.k8s.io/pause:2.0
The node is preferred to be in a specific network segment by label and it is required to match either p0 or p1 workloadtype (custom strategy).
Multiple operators are available (https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#operators) and NotIn
and DoesNotExist
are the specific ones usable to obtain node anti-affinity.
From a security standpoint, only hard rules requiring the conditions to be respected matter. preferredDuringSchedulingIgnoredDuringExecution
should be used for computational configurations that can not affect the security posture of the cluster.
Inter-pod affinity and anti-affinity could constrain which nodes the pods can be scheduled on based on the labels of pods already running on that node.
As specified in Kubernetes documentation:
“Inter-pod affinity and anti-affinity rules take the form “this Pod should (or, in the case of anti-affinity, should not) run in an X if that X is already running one or more Pods that meet rule Y”, where X is a topology domain like node, rack, cloud provider zone or region, or similar and Y is the rule Kubernetes tries to satisfy.”
Example of anti-affinity
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- testdatabase
In the podAntiAffinity
case above, we will never see the pod running on a node where a testdatabase
app is running.
It fits designs where it is desired to schedule some pods together or where the system must ensure that certain pods are never going to be scheduled together.
In particular, the inter-pod rules allow engineers to define additional constraints within the same execution context without further creating segmentation in terms of node groups. Nevertheless, complex affinity rules could create situations with pods stuck in pending status.
Taints are the opposite of node affinity properties since they allow a node to repel a set of pods not matching some tolerations.
Taints can be applied to a node to make it repel pods unless they explicitly tolerate the taints.
Tolerations are applied to pods. Tolerations allow the scheduler to schedule pods with matching taints. It should be highlighted that while tolerations allow scheduling, the decision is not guaranteed.
Each node also defines an action linked to each taint: NoExecute
(affects running pods), NoSchedule
(hard rule), PreferNoSchedule
(soft rule).
The approach is ideal for environments where strong isolation of workloads is required. Moreover, it allows the creation of custom node selection rules not based solely on labels and it does not leave flexibility.
You can use topology spread constraints to control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes, and other user-defined topology domains. This can help to achieve high availability as well as efficient resource utilization.
Kubernetes by default uses the kube-scheduler
which follows its own set of criteria for scheduling pods. While the default scheduler is versatile and offers a lot of options, there might be specific security requirements that the default scheduler might not know about. Writing a custom scheduler allows an organization to apply a risk-based scheduling to avoid pairing privileged pods with pods processing or accessing sensitive data.
To create a custom scheduler, you would typically write a program that:
Some examples of a custom scheduler that can be adapted for this can be found at the following GH repositories: kubernetes-sigs/scheduler-plugins or onuryilmaz/k8s-scheduler-example.
Additionally, a good presentation on crafting your own is Building a Kubernetes Scheduler using Custom Metrics - Mateo Burillo, Sysdig. As mentioned in the talk, this is not for the faint of heart because of the complexity and you might be better off just sticking with the default one if you are not already planning to build one.
As described, scheduling policies could be used to attract or repel pods into specific group of nodes.
As previously stated, a proper strategy reduces the blast radius of a compromised pod. Nevertheless, there are still some aspects to take care about from the attacker perspective.
In specific cases, the implemented mechanisms could be used either to:
Bonus Section: Node labels security
Normally, the kubelet will still be able to modify labels for a node, potentially allowing a compromised node to tamper its own labels to trick the scheduler as described above.
A security measure could be applied with the NodeRestriction admission plugin. It basically denies labels editing from the kubelet if the following prefix is present in the label.
Security-wise, dedicated nodes for each namespace/service would constitute the best setup. However, the design would not exploit the kubernetes capability to optimize computations.
The following examples represent some trade-off choices:
The core concept for a successful approach is having a set of reserved nodes for critical namespaces/workloads.
Real world scenarios and complex designs require engineers to plan the fitting mix of mechanisms according to performance requirements and risk tolerance.
This decision starts with defining the workloads’ risk:
Different teams, different trust level
It’s not uncommon for large organizations to have multiple teams deploying to the same cluster. Different teams might have different levels of trustworthiness, training or access. This diversity can introduce varying levels of risks.
Data being processed or stored
Some pods may require mounting customer data or having persistent secrets to perform tasks. Sharing the node with any workload with less hardened workloads may expose the data to a risk
Exposed network services on the same node
Any pod that exposes a network service increases its attack surface. Pods interacting with external-facing requests may suffer from this exposure and be more at risk of compromise.
Pod privileges and capabilities, or its assigned risk
Some workloads may need some privileges to work or may run code that by its own nature process potentially unsafe content or third-party vendor code. All these factors can contribute to increase a workload’s assigned risk.
Once found the set of risks within the environment, decide the isolation level for teams/data/network traffic/capabilities. Grouping them if they are part of the same process could do the trick.
At that point, the amount of workloads in each isolation group should be evaluable and ready to be addressed by mixing the scheduling strategies according to the size and complexity of each group.
Note: Simple environments should use simple strategies and avoid mixing too many mechanisms if few isolation groups and constraints are present.