Skip to content

Commit bc3f44c

Browse files
authored
Merge pull request #8702 from norbertcyran/resource-limits-proposal
Granular resource limits proposal
2 parents 91461ce + 95c8666 commit bc3f44c

File tree

1 file changed

+350
-0
lines changed

1 file changed

+350
-0
lines changed
Lines changed: 350 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,350 @@
1+
# Granular Resource Limits in Node Autoscalers
2+
3+
## Objective
4+
5+
Node Autoscalers should allow setting more granular resource limits that would
6+
apply to arbitrary subsets of nodes, beyond the existing limiting mechanisms.
7+
8+
## Background
9+
10+
Cluster Autoscaler supports cluster-wide limits on resources (like total CPU and
11+
memory) and per-node-group node count limits. Karpenter supports
12+
setting [resource limits on a NodePool](https://karpenter.sh/docs/concepts/nodepools/#speclimits).
13+
Also, as mentioned
14+
in [AWS docs](https://docs.aws.amazon.com/eks/latest/best-practices/karpenter.html),
15+
cluster-wide limits are not supported too. This is not flexible enough for many
16+
use cases.
17+
18+
Users often need to configure more granular limits. For instance, a user might
19+
want to limit the total resources consumed by nodes of a specific machine
20+
family, nodes with a particular OS, or nodes with specialized hardware like
21+
GPUs. The current resource limits implementations in both node autoscalers do
22+
not support these scenarios.
23+
24+
This proposal introduces a new API to extend the Node Autoscalers’
25+
functionality, allowing limits to be applied to arbitrary sets of nodes.
26+
27+
## Proposal: The CapacityQuota API
28+
29+
We propose a new Kubernetes custom resource, CapacityQuota, to define
30+
resource limits on specific subsets of nodes. Node subsets are targeted using
31+
standard Kubernetes label selectors, offering a flexible way to group nodes.
32+
33+
A node's eligibility for provisioning operation will be checked against all
34+
CapacityQuota objects that select it. The operation will only be
35+
permitted if it does not violate any of the applicable limits. This should be
36+
compatible with the existing limiting mechanisms, i.e. CAS’ cluster-wide limits
37+
and Karpenter’s NodePool limits. Therefore, if the operation doesn’t violate
38+
CapacityQuota, but violates existing limiting mechanisms, it should
39+
be rejected.
40+
41+
### API Specification
42+
43+
An CapacityQuota object would look as follows:
44+
45+
```yaml
46+
apiVersion: autoscaling.x-k8s.io/v1beta1
47+
kind: CapacityQuota
48+
metadata:
49+
name: example-resource-quota
50+
spec:
51+
selector:
52+
matchLabels:
53+
example.cloud.com/machine-family: e2
54+
limits:
55+
resources:
56+
cpu: 64
57+
memory: 256Gi
58+
```
59+
60+
* `selector`: A standard Kubernetes label selector that determines which nodes
61+
the limits apply to. This allows for fine-grained control based on any label
62+
present on the nodes, such as zone, region, OS, machine family, or custom
63+
user-defined labels.
64+
* `limits`: Defines the limits of summed up resources of the selected nodes.
65+
* `resources`: map of resources (e.g. `cpu`, `memory`) that should be
66+
limited. This map could be put directly into `limits`, but we put it here
67+
insteade for the sake of extensibility. For instance, if we were to
68+
support DRA limits via this API, we would probably define them in a
69+
separate field under `limits`.
70+
71+
This approach is highly flexible – adding a new dimension for limits only
72+
requires ensuring the nodes are labeled appropriately, with no code changes
73+
needed in the autoscaler.
74+
75+
### Node as a Resource
76+
77+
The CapacityQuota API can be naturally extended to treat the number
78+
of nodes itself as a limitable resource, as shown in one of the examples below.
79+
80+
### CapacityQuota Status
81+
82+
For better observability, the CapacityQuota resource could be
83+
enhanced with a status field. This field, updated by a controller, would display
84+
the current resource usage for the selected nodes, allowing users to quickly
85+
check usage against the defined limits via kubectl describe. The controller can
86+
run in a separate thread as a part of the node autoscaler component.
87+
88+
An example of the status field:
89+
90+
```yaml
91+
status:
92+
used:
93+
resources:
94+
cpu: 32
95+
memory: 128Gi
96+
nodes: 50
97+
```
98+
99+
## Alternatives considered
100+
101+
### Minimum limits support
102+
103+
The initial design, besides the maximum limits, also included minimum limits.
104+
Minimum limits were supposed to affect the node consolidation in the node
105+
autoscalers. A consolidation would be allowed only if removing the node wouldn’t
106+
violate any minimum limits. Cluster-wide minimum limits are implemented in CAS
107+
together with the maximum limits, so at first, it seemed logical to include both
108+
limit directions in the design.
109+
110+
Despite being conceptually similar, minimum and maximum limits cover completely
111+
different use cases. Maximum limits can be used to control the cloud provider
112+
costs, to limit scaling certain types of compute, or to control distribution of
113+
compute resources between teams working on the same cluster. Minimum limits’
114+
main use case is ensuring a baseline capacity for users’ workloads, for example
115+
to handle sudden spikes in traffic. However, minimum limits defined as a minimum
116+
amount of resources in the cluster or a subset of nodes do not guarantee that
117+
the workloads will be schedulable on those resources. For example, two nodes
118+
with 2 CPUs each satisfy the minimum limit of 4 CPUs. If a user created a
119+
workload requesting 2 CPUs, that workload would not fit into existing nodes,
120+
making the baseline capacity effectively useless. This scenario will be better
121+
handled by
122+
the [CapacityBuffer API](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/buffers.md),
123+
which allows the user to provide an exact shape of their workloads, including
124+
the resource requests. In our example, the user would create a CapacityBuffer
125+
with a pod template requesting 2 CPUs. Such a CapacityBuffer would ensure that a
126+
pod with that shape is always schedulable on the existing nodes.
127+
128+
Therefore, we decided to remove minimum limits from the design of granular
129+
limits, as CapacityBuffers are a better way to provide a baseline capacity for
130+
user workloads.
131+
132+
### Kubernetes LimitRange and ResourceQuota
133+
134+
It has been discussed whether the same result could be accomplished by using the
135+
standard Kubernetes
136+
resources: [LimitRange](https://kubernetes.io/docs/concepts/policy/limit-range/)
137+
and [ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/).
138+
139+
LimitRange is a resource used to configure minimum and maximum resource
140+
constraints for a namespace. For example, it can define the default CPU and
141+
memory requests for pods and containers within a namespace, or enforce a minimum
142+
and maximum CPU request for a pod. However, its scope is limited to a single
143+
resource, meaning that it doesn’t look at all pods in the namespace, but just
144+
looks if the pod requests and limits are within defined bounds.
145+
146+
ResourceQuota allows to define and limit the aggregate resource consumption per
147+
namespace. This includes limiting the total CPU, memory, and storage that all
148+
pods and persistent volume claims within a namespace can request or consume. It
149+
also supports limiting the count of various Kubernetes objects, such as pods,
150+
services, and replication controllers. While resource quotas can be used to
151+
limit the resources provisioned by the CA to some degree, it’s not possible to
152+
guarantee that CA won’t scale up above the defined limit. Since the quotas
153+
operate on pod requests, and CA does not guarantee that bin packing will yield
154+
the optimal result, setting the quota to e.g. 64 CPUs does not mean that CA will
155+
stop scaling at 64 CPUs.
156+
157+
Moreover, both of those resources are namespaced, so their scope is limited to
158+
the namespace in which they are defined, while the nodes are global. We can’t
159+
use namespaced resources to limit the creation and deletion of global resources.
160+
161+
### Soft and hard limits
162+
163+
We have discussed a possibility of distinguishing soft and hard limits. That
164+
idea was initially presented for Karpenter node limits in
165+
https://github.com/kubernetes-sigs/karpenter/pull/2525. The currently existing
166+
limits in Karpenter behave like soft limits -- they are best effort, meaning
167+
that they can be exceeded, for example due to race conditions. They are
168+
respected only during the provisioning operations, while they can be exceeded
169+
during the node consolidation. Proposed hard limits would not be exceeded also
170+
during the node consolidation. In Cluster Autoscaler, at this moment there is
171+
no viable use case for such a distinction. Moreover, enforcing hard limits could
172+
be complex to achieve due to concurrency concerns. Because of that, and for the
173+
sake of simplicity of this API, we have decided to support only one type of
174+
limits. The behavior of the limits should be in line with the current
175+
implementation of the limits in the Node Autoscalers.
176+
177+
## User Stories
178+
179+
### Story 1
180+
181+
As a cluster administrator, I want to configure cluster-wide resource limits to
182+
avoid excessive cloud provider costs.
183+
184+
**Note:** This is already supported in CAS, but not in Karpenter.
185+
186+
Example CapacityQuota:
187+
188+
```yaml
189+
apiVersion: autoscaling.x-k8s.io/v1beta1
190+
kind: CapacityQuota
191+
metadata:
192+
name: cluster-wide-limits
193+
spec:
194+
limits:
195+
resources:
196+
cpu: 128
197+
memory: 256Gi
198+
```
199+
200+
### Story 2
201+
202+
As a cluster administrator, I want to configure separate resource limits for
203+
specific groups of nodes on top of cluster-wide limits, to avoid a situation
204+
where one group of nodes starves others of resources.
205+
206+
**Note:** A specific group of nodes can be either a NodePool in Karpenter, a
207+
ComputeClass in GKE, or simply a set of nodes grouped by a user-defined label.
208+
This can be useful e.g. for organizations where multiple teams are running
209+
workloads in a shared cluster, and these teams have separate sets of nodes. This
210+
way, a cluster administrator can ensure that each team has a proper limit for
211+
their resources and it doesn’t starve other teams. This story is partly
212+
supported by Karpenter’s NodePool limits.
213+
214+
Example CapacityQuota:
215+
216+
```yaml
217+
apiVersion: autoscaling.x-k8s.io/v1beta1
218+
kind: CapacityQuota
219+
metadata:
220+
name: team-a-limits
221+
spec:
222+
selector:
223+
matchLabels:
224+
team: a
225+
limits:
226+
resources:
227+
cpu: 32
228+
```
229+
230+
### Story 3
231+
232+
As a cluster administrator, I want to allow scaling up machines that are more
233+
expensive or less suitable for my workloads when better machines are
234+
unavailable, but I want to limit how many of them can be created, so that I can
235+
control extra cloud provider costs, or limit the impact of using non-optimal
236+
machine for my workloads.
237+
238+
Example CapacityQuota:
239+
240+
```yaml
241+
apiVersion: autoscaling.x-k8s.io/v1beta1
242+
kind: CapacityQuota
243+
metadata:
244+
name: max-e2-resources
245+
spec:
246+
selector:
247+
matchLabels:
248+
example.cloud.com/machine-family: e2
249+
limits:
250+
resources:
251+
cpu: 32
252+
memory: 64Gi
253+
```
254+
255+
### Story 4
256+
257+
As a cluster administrator, I want to limit the number of nodes in a specific
258+
zone if my cluster is unbalanced for any reason, so that I can avoid exhausting
259+
IP space in that zone, or enforce better balancing across zones.
260+
261+
**Note:** Originally requested
262+
in [https://github.com/kubernetes/autoscaler/issues/6940](https://github.com/kubernetes/autoscaler/issues/6940).
263+
264+
Example CapacityQuota:
265+
266+
```yaml
267+
apiVersion: autoscaling.x-k8s.io/v1beta1
268+
kind: CapacityQuota
269+
metadata:
270+
name: max-nodes-us-central1-b
271+
spec:
272+
selector:
273+
matchLabels:
274+
topology.kubernetes.io/zone: us-central1-b
275+
limits:
276+
resources:
277+
nodes: 64
278+
```
279+
280+
### Story 5 (obsolete)
281+
282+
As a cluster administrator, I want to ensure there is always a baseline capacity
283+
in my cluster or specific parts of my cluster below which the node autoscaler
284+
won’t consolidate the nodes, so that my workloads can quickly react to sudden
285+
spikes in traffic.
286+
287+
This user story is obsolete. CapacityBuffer API covers this use case in a more
288+
flexible way.
289+
290+
## Other CapacityQuota examples
291+
292+
The following examples illustrate the flexibility of the proposed API and
293+
demonstrate other possible use cases not described in the user stories.
294+
295+
#### **Maximum Windows Nodes**
296+
297+
Limit the total number of nodes running the Windows operating system to 8.
298+
299+
```yaml
300+
apiVersion: autoscaling.x-k8s.io/v1beta1
301+
kind: CapacityQuota
302+
metadata:
303+
name: max-windows-nodes
304+
spec:
305+
selector:
306+
matchLabels:
307+
kubernetes.io/os: windows
308+
limits:
309+
resources:
310+
nodes: 8
311+
```
312+
313+
#### **Maximum NVIDIA T4 GPUs**
314+
315+
Limit the total number of NVIDIA T4 GPUs in the cluster to 16.
316+
317+
```yaml
318+
apiVersion: autoscaling.x-k8s.io/v1beta1
319+
kind: CapacityQuota
320+
metadata:
321+
name: max-t4-gpus
322+
spec:
323+
selector:
324+
matchLabels:
325+
example.cloud.com/gpu-type: nvidia-t4
326+
limits:
327+
resources:
328+
nvidia.com/gpu: 16
329+
```
330+
331+
#### **Cluster-wide Limits Excluding Control Plane Nodes**
332+
333+
Apply cluster-wide CPU and memory limits while excluding nodes with the
334+
control-plane role.
335+
336+
```yaml
337+
apiVersion: autoscaling.x-k8s.io/v1beta1
338+
kind: CapacityQuota
339+
metadata:
340+
name: cluster-limits-no-control-plane
341+
spec:
342+
selector:
343+
matchExpressions:
344+
- key: node-role.kubernetes.io/control-plane
345+
operator: DoesNotExist
346+
limits:
347+
resources:
348+
cpu: 64
349+
memory: 128Gi
350+
```

0 commit comments

Comments
 (0)