Skip to content

Commit

Permalink
Merge pull request #450 from pohly/storage-capacity
Browse files Browse the repository at this point in the history
storage capacity producer
  • Loading branch information
k8s-ci-robot authored Aug 18, 2020
2 parents cb437bf + e50daf3 commit e909258
Show file tree
Hide file tree
Showing 50 changed files with 7,661 additions and 34 deletions.
120 changes: 119 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ Following table reflects the head of this branch.
| -------------- | ------- | ------- | --------------------------------------------------------------------------------------------- | --------------------------------- |
| Snapshots | Beta | On | [Snapshots and Restore](https://kubernetes-csi.github.io/docs/snapshot-restore-feature.html). | No |
| CSIMigration | Beta | On | [Migrating in-tree volume plugins to CSI](https://kubernetes.io/docs/concepts/storage/volumes/#csi-migration). | No |
| CSIStorageCapacity | Alpha | Off | Publish [capacity information](https://kubernetes.io/docs/concepts/storage/volumes/#storage-capacity) for the Kubernetes scheduler. | No |

All other external-provisioner features and the external-provisioner itself is considered GA and fully supported.

Expand Down Expand Up @@ -61,14 +62,28 @@ Note that the external-provisioner does not scale with more replicas. Only one e

* `--kube-api-burst <num>`: Burst for clients that communicate with the kubernetes apiserver. Defaults to `10`.

* `--cloning-protection-threads <num>`: Number of simultaniously running threads, handling cloning finalizer removal. Defaults to `1`.
* `--cloning-protection-threads <num>`: Number of simultaneously running threads, handling cloning finalizer removal. Defaults to `1`.

* `--metrics-address`: The TCP network address where the prometheus metrics endpoint will run (example: `:8080` which corresponds to port 8080 on local host). The default is empty string, which means metrics endpoint is disabled.

* `--metrics-path`: The HTTP path where prometheus metrics will be exposed. Default is `/metrics`.

* `--extra-create-metadata`: Enables the injection of extra PVC and PV metadata as parameters when calling `CreateVolume` on the driver (keys: "csi.storage.k8s.io/pvc/name", "csi.storage.k8s.io/pvc/namespace", "csi.storage.k8s.io/pv/name")

##### Storage capacity arguments

See the [storage capacity section](#capacity-support) below for details.

* `--capacity-threads <num>`: Number of simultaneously running threads, handling CSIStorageCapacity objects. Defaults to `1`.

* `--capacity-poll-interval <interval>`: How long the external-provisioner waits before checking for storage capacity changes. Defaults to `1m`.

* `--capacity-controller-deployment-mode central|none`: Enables producing CSIStorageCapacity objects with capacity information from the driver's GetCapacity call. 'central' is currently the only supported mode. Use it when there is just one active provisioner in the cluster. Defaults to `none`.

* `--capacity-ownerref-level <levels>`: The level indicates the number of objects that need to be traversed starting from the pod identified by the POD_NAME and POD_NAMESPACE environment variables to reach the owning object for CSIStorageCapacity objects: 0 for the pod itself, 1 for a StatefulSet, 2 for a Deployment, etc. Defaults to `1` (= StatefulSet).

* `--capacity-for-immediate-binding <bool>`: Enables producing capacity information for storage classes with immediate binding. Not needed for the Kubernetes scheduler, maybe useful for other consumers or for debugging. Defaults to `false`.

#### Other recognized arguments
* `--feature-gates <gates>`: A set of comma separated `<feature-name>=<true|false>` pairs that describe feature gates for alpha/experimental features. See [list of features](#feature-status) or `--help` output for list of recognized features. Example: `--feature-gates Topology=true` to enable Topology feature that's disabled by default.

Expand Down Expand Up @@ -102,6 +117,109 @@ Yes | No | Yes | `Requisite` = Allowed topologies<br>`Preferred` = `Requisite` w
No | Irrelevant | No | `Requisite` = Aggregated cluster topology<br>`Preferred` = `Requisite` with randomly selected node topology as first element
No | Irrelevant | Yes | `Requisite` = Allowed topologies<br>`Preferred` = `Requisite` with randomly selected node topology as first element

### Capacity support

> :warning: *Warning:* This is an alpha feature and only supported by
> Kubernetes >= 1.19 if the `CSIStorageCapacity` feature gate is
> enabled.
The external-provisioner can be used to create CSIStorageCapacity
objects that hold information about the storage capacity available
through the driver. The Kubernetes scheduler then [uses that
information](https://kubernetes.io/docs/concepts/storage/storage-capacity]
when selecting nodes for pods with unbound volumes that wait for the
first consumer.

Currently, all CSIStorageCapacity objects created by an instance of
the external-provisioner must have the same
[owner](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents). That
owner is how external-provisioner distinguishes between objects that
it must manage and those that it must leave alone. The owner is
determine with the `POD_NAME/POD_NAMESPACE` environment variables and
the `--capacity-ownerref-level` parameter. Other solutions will be
added in the future.

To enable this feature in a driver deployment (see also the
[`deploy/kubernetes/storage-capacity.yaml`](deploy/kubernetes/storage-capacity.yaml)
example):

- Set the `POD_NAME` and `POD_NAMESPACE` environment variables like this:
```yaml
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
```
- Add `--enable-capacity=central` to the command line flags.
- Add `StorageCapacity: true` to the CSIDriver information object.
Without it, external-provisioner will publish information, but the
Kubernetes scheduler will ignore it. This can be used to first
deploy the driver without that flag, then when sufficient
information has been published, enabled the scheduler usage of it.
- If external-provisioner is not deployed with a StatefulSet, then
configure with `--capacity-ownerref-level` which object is meant to own
CSIStorageCapacity objects.
- Optional: configure how often external-provisioner polls the driver
to detect changed capacity with `--capacity-poll-interval`.
- Optional: configure how many worker threads are used in parallel
with `--capacity-threads`.
- Optional: enable producing information also for storage classes that
use immediate volume binding with
`--enable-capacity=immediate-binding`. This is usually not needed
because such volumes are created by the driver without involving the
Kubernetes scheduler and thus the published information would just
be ignored.

To determine how many different topology segments exist,
external-provisioner uses the topology keys and labels that the CSI
driver instance on each node reports to kubelet in the
`NodeGetInfoResponse.accessible_topology` field. The keys are stored
by kubelet in the CSINode objects and the actual values in Node
annotations.

CSI drivers must report topology information that matches the storage
pool(s) that it has access to, with granularity that matches the most
restrictive pool.

For example, if the driver runs in a node with region/rack topology
and has access to per-region storage as well as per-rack storage, then
the driver should report topology with region/rack as its keys. If it
only has access to per-region storage, then it should just use region
as key. If it uses region/rack, then redundant CSIStorageCapacity
objects will be published, but the information is still correct. See
the
[KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1472-storage-capacity-tracking#with-central-controller)
for details.

For each segment and each storage class, CSI `GetCapacity` is called
once with the topology of the segment and the parameters of the
class. If there is no error and the capacity is non-zero, a
CSIStorageCapacity object is created or updated (if it
already exists from a prior call) with that information. Obsolete
objects are removed.

To ensure that CSIStorageCapacity objects get removed when the
external-provisioner gets removed from the cluster, they all have an
owner and therefore get garbage-collected when that owner
disappears. The owner is not the external-provisioner pod itself but
rather one of its parents as specified by `--capacity-ownerref-level`.
This way, it is possible to switch between external-provisioner
instances without losing the already gathered information.

CSIStorageCapacity objects are namespaced and get created in the
namespace of the external-provisioner. Only CSIStorageCapacity objects
with the right owner are modified by external-provisioner and their
name is generated, so it is possible to deploy different drivers in
the same namespace. However, Kubernetes does not check who is creating
CSIStorageCapacity objects, so in theory a malfunctioning or malicious
driver deployment could also publish incorrect information about some
other driver.

### CSI error and timeout handling
The external-provisioner invokes all gRPC calls to CSI driver with timeout provided by `--timeout` command line argument (15 seconds by default).

Expand Down
75 changes: 74 additions & 1 deletion cmd/csi-provisioner/csi-provisioner.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,9 @@ import (
"strings"
"time"

"github.com/container-storage-interface/spec/lib/go/csi"
flag "github.com/spf13/pflag"
"k8s.io/apimachinery/pkg/runtime/schema"
utilfeature "k8s.io/apiserver/pkg/util/feature"
"k8s.io/client-go/informers"
"k8s.io/client-go/kubernetes"
Expand All @@ -43,7 +45,10 @@ import (
"github.com/kubernetes-csi/csi-lib-utils/deprecatedflags"
"github.com/kubernetes-csi/csi-lib-utils/leaderelection"
"github.com/kubernetes-csi/csi-lib-utils/metrics"
"github.com/kubernetes-csi/external-provisioner/pkg/capacity"
"github.com/kubernetes-csi/external-provisioner/pkg/capacity/topology"
ctrl "github.com/kubernetes-csi/external-provisioner/pkg/controller"
"github.com/kubernetes-csi/external-provisioner/pkg/owner"
snapclientset "github.com/kubernetes-csi/external-snapshotter/v2/pkg/client/clientset/versioned"
)

Expand All @@ -58,7 +63,8 @@ var (
retryIntervalStart = flag.Duration("retry-interval-start", time.Second, "Initial retry interval of failed provisioning or deletion. It doubles with each failure, up to retry-interval-max.")
retryIntervalMax = flag.Duration("retry-interval-max", 5*time.Minute, "Maximum retry interval of failed provisioning or deletion.")
workerThreads = flag.Uint("worker-threads", 100, "Number of provisioner worker threads, in other words nr. of simultaneous CSI calls.")
finalizerThreads = flag.Uint("cloning-protection-threads", 1, "Number of simultaniously running threads, handling cloning finalizer removal")
finalizerThreads = flag.Uint("cloning-protection-threads", 1, "Number of simultaneously running threads, handling cloning finalizer removal")
capacityThreads = flag.Uint("capacity-threads", 1, "Number of simultaneously running threads, handling CSIStorageCapacity objects")
operationTimeout = flag.Duration("timeout", 10*time.Second, "Timeout for waiting for creation or deletion of a volume")
_ = deprecatedflags.Add("provisioner")

Expand All @@ -76,6 +82,15 @@ var (
kubeAPIQPS = flag.Float32("kube-api-qps", 5, "QPS to use while communicating with the kubernetes apiserver. Defaults to 5.0.")
kubeAPIBurst = flag.Int("kube-api-burst", 10, "Burst to use while communicating with the kubernetes apiserver. Defaults to 10.")

capacityMode = func() *capacity.DeploymentMode {
mode := capacity.DeploymentModeNone
flag.Var(&mode, "capacity-controller-deployment-mode", "Enables producing CSIStorageCapacity objects with capacity information from the driver's GetCapacity call. 'central' is currently the only supported mode. Use it when there is just one active provisioner in the cluster.")
return &mode
}()
capacityImmediateBinding = flag.Bool("capacity-for-immediate-binding", false, "Enables producing capacity information for storage classes with immediate binding. Not needed for the Kubernetes scheduler, maybe useful for other consumers or for debugging.")
capacityPollInterval = flag.Duration("capacity-poll-interval", time.Minute, "How long the external-provisioner waits before checking for storage capacity changes.")
capacityOwnerrefLevel = flag.Int("capacity-ownerref-level", 1, "The level indicates the number of objects that need to be traversed starting from the pod identified by the POD_NAME and POD_NAMESPACE environment variables to reach the owning object for CSIStorageCapacity objects: 0 for the pod itself, 1 for a StatefulSet, 2 for a Deployment, etc.")

featureGates map[string]bool
provisionController *controller.ProvisionController
version = "unknown"
Expand Down Expand Up @@ -181,6 +196,7 @@ func main() {
identity := strconv.FormatInt(timeStamp, 10) + "-" + strconv.Itoa(rand.Intn(10000)) + "-" + provisionerName

factory := informers.NewSharedInformerFactory(clientset, ctrl.ResyncPeriodOfCsiNodeInformer)
var factoryForNamespace informers.SharedInformerFactory // usually nil, only used for CSIStorageCapacity

// -------------------------------
// Listers
Expand Down Expand Up @@ -266,15 +282,72 @@ func main() {
controllerCapabilities,
)

var capacityController *capacity.Controller
if *capacityMode == capacity.DeploymentModeCentral {
podName := os.Getenv("POD_NAME")
namespace := os.Getenv("POD_NAMESPACE")
if podName == "" || namespace == "" {
klog.Fatalf("need POD_NAMESPACE/POD_NAME env variables, have only POD_NAMESPACE=%q and POD_NAME=%q", namespace, podName)
}
controller, err := owner.Lookup(config, namespace, podName,
schema.GroupVersionKind{
Group: "",
Version: "v1",
Kind: "Pod",
}, *capacityOwnerrefLevel)
if err != nil {
klog.Fatalf("look up owner(s) of pod: %v", err)
}
klog.Infof("using %s/%s %s as owner of CSIStorageCapacity objects", controller.APIVersion, controller.Kind, controller.Name)

topologyInformer := topology.NewNodeTopology(
provisionerName,
clientset,
factory.Core().V1().Nodes(),
factory.Storage().V1().CSINodes(),
workqueue.NewNamedRateLimitingQueue(rateLimiter, "csitopology"),
)

// We only need objects from our own namespace. The normal factory would give
// us an informer for the entire cluster.
factoryForNamespace = informers.NewSharedInformerFactoryWithOptions(clientset,
ctrl.ResyncPeriodOfCsiNodeInformer,
informers.WithNamespace(namespace),
)

capacityController = capacity.NewCentralCapacityController(
csi.NewControllerClient(grpcClient),
provisionerName,
clientset,
// TODO: metrics for the queue?!
workqueue.NewNamedRateLimitingQueue(rateLimiter, "csistoragecapacity"),
*controller,
namespace,
topologyInformer,
factory.Storage().V1().StorageClasses(),
factoryForNamespace.Storage().V1alpha1().CSIStorageCapacities(),
*capacityPollInterval,
*capacityImmediateBinding,
)
}

run := func(ctx context.Context) {
factory.Start(ctx.Done())
if factoryForNamespace != nil {
// Starting is enough, the capacity controller will
// wait for sync.
factoryForNamespace.Start(ctx.Done())
}
cacheSyncResult := factory.WaitForCacheSync(ctx.Done())
for _, v := range cacheSyncResult {
if !v {
klog.Fatalf("Failed to sync Informers!")
}
}

if capacityController != nil {
go capacityController.Run(ctx, int(*capacityThreads))
}
if csiClaimController != nil {
go csiClaimController.Run(ctx, int(*finalizerThreads))
}
Expand Down
15 changes: 15 additions & 0 deletions deploy/kubernetes/rbac.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,21 @@ rules:
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get", "watch", "list", "delete", "update", "create"]
# Permissions for CSIStorageCapacity are only needed enabling the publishing
# of storage capacity information.
- apiGroups: ["storage.k8s.io"]
resources: ["csistoragecapacities"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
# The GET permissions below are needed for walking up the ownership chain
# for CSIStorageCapacity. They are sufficient for deployment via
# StatefulSet (only needs to get Pod) and Deployment (needs to get
# Pod and then ReplicaSet to find the Deployment).
- apiGroups: [""]
resources: ["pods"]
verbs: ["get"]
- apiGroups: ["apps"]
resources: ["replicasets"]
verbs: ["get"]

---
kind: RoleBinding
Expand Down
57 changes: 57 additions & 0 deletions deploy/kubernetes/storage-capacity.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# This YAML file demonstrates how to enable the
# storage capacity feature when deploying the
# external provisioner, in this example together
# with the mock CSI driver.
#
# It depends on the RBAC definitions from rbac.yaml.
---
kind: Deployment
apiVersion: apps/v1
metadata:
name: csi-provisioner
spec:
replicas: 3
selector:
matchLabels:
app: csi-provisioner
template:
metadata:
labels:
app: csi-provisioner
spec:
serviceAccount: csi-provisioner
containers:
- name: csi-provisioner
image: k8s.gcr.io/sig-storage/csi-provisioner:v2.0.0
args:
- "--csi-address=$(ADDRESS)"
- "--leader-election"
- "--enable-capacity=central"
- "--capacity-ownerref-level=2"
env:
- name: ADDRESS
value: /var/lib/csi/sockets/pluginproxy/mock.socket
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
imagePullPolicy: "IfNotPresent"
volumeMounts:
- name: socket-dir
mountPath: /var/lib/csi/sockets/pluginproxy/

- name: mock-driver
image: quay.io/k8scsi/mock-driver:canary
env:
- name: CSI_ENDPOINT
value: /var/lib/csi/sockets/pluginproxy/mock.socket
volumeMounts:
- name: socket-dir
mountPath: /var/lib/csi/sockets/pluginproxy/
volumes:
- name: socket-dir
emptyDir:
1 change: 1 addition & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ require (
k8s.io/csi-translation-lib v0.19.0-rc.2
k8s.io/klog v1.0.0
k8s.io/kubernetes v1.19.0-rc.2
sigs.k8s.io/controller-runtime v0.6.2
sigs.k8s.io/sig-storage-lib-external-provisioner/v6 v6.1.0-rc1
)

Expand Down
Loading

0 comments on commit e909258

Please sign in to comment.