Kubernetes Cloud/Sterling

From renci wiki
Jump to navigation Jump to search
The physical servers running the cluster in the Europa Data Center.

Sterling is RENCI's on-premise Kubernetes cluster that anyone can use to run containerized applications at no cost to you.

The Sterling cluster is named after another North Carolina mountain: Mount Sterling.

If you are an existing Sterling user who needs support, see Kubernetes Cloud/Things you can request.

Overview

  • Nodes: 14
  • CPU cores per node: 96
  • RAM per node: 1.5TB
  • GPUs: 4x Nvidia A100
  • Storage medium: NFS via NetApp Trident or NVMe drives.
  • Inter-node networking: 1.22 GB/sec bandwidth, 26.4 us latency

Value Proposition

Kubernetes' value proposition includes:

  • More efficient use of resources than VMs
  • Self-healing (pods automatically restart on failure)
  • Built-in load balancing between multiple replicas of an application
  • Easy zero-downtime upgrades
  • The Kubernetes API is a vendor-neutral/cloud-agnostic API for running essentially any application, which allows you to run your app in other clusters with minimal changes

Some of the unique features in Sterling specifically include:

  • Security hardened using the Kubernetes CIS Benchmarks
  • Updated regularly
  • Professionally supported by ACIS
  • Centralized logging/monitoring/alerting for applications
  • Automatic DNS/TLS setup (no need to file tickets)
  • Full multi-tenancy protections (tighter RBAC, storage/CPU/RAM limits and quotas, no privileged pods)
  • Support for Persistent Volumes with encryption at rest, snapshotting, and off-site backups
  • Support for hardware like GPUs and NVMe drives
  • Support for running high-memory pods
  • And many more!

Resilience

Since Sterling's creation, applications hosted in Sterling have had approximately 99.994% uptime, when not including power outages and planned maintenance. That is 78 minutes of downtime over almost 1.5 years. See Kubernetes Cloud/Postmortems for a full analysis of all past outages.

Sterling and its dependent services are all connected to the Europa generator, meaning it can survive most power outages. For multi-hour outages on hot days, data center temperatures may require powering down Sterling.

Access

The Dex login page.

To access the cluster, email help@renci.org to request a kubeconfig file for Sterling. Sterling's kubeconfig files do not contain secret keys or passwords; instead they just describe how your client should authenticate to Sterling using your RENCI username/password.

As part of the onboarding process, you should expect to receive a copy of the Service Level Agreement (SLA) for on-premise Kubernetes at RENCI. This digital document will come in the form of a DocuSign email which will ask you to sign it to acknowledge you are familiar with the general guidelines of the service. After the SLA is signed, you will receive your kubeconfig file. To access the cluster, follow these steps to set up your client tools:

  1. Ensure you are either connected to the RENCI VPN or the RENCI WiFi. Sterling can only be accessed from RENCI IP addresses.
  2. Ensure you have kubectl installed: https://kubernetes.io/docs/tasks/tools/#kubectl
  3. Install krew (a kubectl package manager): https://krew.sigs.k8s.io/docs/user-guide/setup/install/
  4. Install oidc-login with krew: kubectl krew install oidc-login
  5. When you receive your kubeconfig file from ACIS, save it to the location ~/.kube/config
    1. If sterling is the first and only cluster you need to access, you can replace the existing config file with the sterling one. You may have to create the .kube folder if it doesn't exist.
    2. If you already have a ~/.kube/config file to access other clusters, you can combine that with the new sterling kubeconfig using konfig: kubectl krew install konfig
  6. When you first run any kubectl command (kubectl get pods), your browser will open dex.apps.renci.org and you will be asked to login with your RENCI username and password.

You will remain logged in for 24 hours before you will be automatically re-prompted to login. To "log out" and log in again, you can run rm -rf ~/.kube/cache/oidc-login.

We also recommend installing either https://k9scli.io/ or https://k8slens.dev/, which are excellent Kubernetes client UIs.

By default, users are only given access to a "personal" namespace named after their RENCI username. Your kubeconfig file sets this as your default namespace for kubectl commands, which you are free to change. Personal namespaces are useful for testing and learning, but production workloads should be put into a dedicated namespace. See Requesting a new namespace.

If you are not familiar with Kubernetes concepts, all UNC employees have access to LinkedIn Learning courses that cover the essentials: https://www.linkedin.com/learning/kubernetes-essential-training-application-development. To access the course, first link your UNC account to LinkedIn here: https://software.sites.unc.edu/linkedin/

Troubleshooting

  • If you have a Mac or Linux computer and you're using Lens and you get the error Error: unknown command "oidc_login", you'll need to run this command: sudo ln -s ~/.krew/bin/kubectl-oidc_login /usr/local/bin/kubectl-oidc_login. That will create a symbolic link inside a folder that Lens looks into for kubectl plugins. Without that symbolic link, Lens won't know how to find your kubectl plugins. On windows, you should just be able to update your PATH to include the "C:\Users\<name>\.krew\bin\" folder.
  • If you are NOT using Lens and you get the error Error: unknown command "oidc_login", ensure that the location where Krew installed the "oidc-login" binary (by default, ~/.krew/bin/kubectl-oidc_login) is in your PATH.
  • If using Windows Subsystem for Linux, ensure that all of your tools (kubectl, krew, oidc-login, k9s/Lens) are all installed inside WSL since they must work together.
  • Logging in causes a web server to start on localhost:8000 (or 18000 as a fallback), then opens your browser to that address. Ensure port 8000 is available and reachable by your browser when oidc-login binds to that port in your terminal (apparently WSL might have issues with that and may require troubleshooting).
  • If you're using Lens and you don't see any namespaces in the namespace dropdown box, that's because users don't have permission to list namespaces. Right click on the cluster in the sidebar > Settings, then click "Namespaces" to specify the list of namespaces you can access. Now those namespaces will appear in the dropdown box.
  • If you get error: You must be logged in to the server (Unauthorized) and you merged your Sterling kubeconfig with an existing one, double-check that your Sterling "user" (name: "username@sterling") is being used to authenticate to the Sterling "cluster" (name: "sterling"). Kubeconfigs consist of a user, a cluster, and a link between the two called a context, all of which are defined in your kubeconfig file.
  • If you are using Lens on Windows and you get the error exec: executable kubectl not found, you may just need to re-install the latest version of Lens.
  • For all other issues, don't hesitate to reply to help@renci.org and request assistance! You can also ask questions in the #kubernetes-users Slack channel.

Office Hours

Every Thursday at 2:30-3:30PM, feel free to join the Kubernetes Office hours Zoom call. Office hours are a chance to ask some of the more conceptual questions that may be hard to express over Slack or email. Check the #kubernetes-users slack channel or the email announcement for the Zoom link.

Restrictions

The Sterling cluster strives to be secure by default. As a result, some applications will require modifications before they will run in Sterling.

Containers cannot run with extra Capabilities/privileges

By default, pods in Sterling are assigned a "restricted" PodSecurityPolicy. As a result, any pods that specify any additional capabilities (like CAP_SYS_ADMIN, CAP_NET_RAW, etc.) in spec.containers.securityContext.capabilities.add will be blocked from running. Only the default set of capabilities will be granted to pods. Additionally, pods that set securityContext.privileged: true will also be blocked.

The reason is to prevent attackers from breaking out of containers, or stealing secrets from other pods on the same node (via strace'ing or tcpdump'ing someone else's process).

No hostNetwork, hostPath volumes

Setting hostNetwork: true or mounting a hostPath volume is not allowed. Both can be exploited to access data or secrets from other users' Pods.

Pods must have resource limits

By default, containers will be assigned CPU and RAM limits since pods without limits will be rejected. You should override the default limits by specifying your own. Current limits per container:

defaultLimit:
  memory: 256Mi
  cpu: 250m
  ephemeral-storage: 256Mi
defaultRequest:
  memory: 128Mi
  cpu: 100m
  ephemeral-storage: 32Mi
max:
  memory: 64Gi
  cpu: "8"
  ephemeral-storage: 10Gi

Your limits can be checked by running kubectl describe limitrange -n <your namespace>.

These default limits are intentionally set low to encourage users to set their own custom limits if they need them. Proper requests and limits are critical to protecting the whole cluster from cascading RAM/CPU shortages.

Request an exemption from CPU limits

Expert users may have applications that would benefit from not having CPU limits. Removing CPU limits means a pod can use any unused CPU cycles on the node. It also means bursty and latency-sensitive applications (such as webservers) will not be throttled, which can cause up to 100ms delays. However, this has a risk: if other applications (with limits enabled) need to use all the available CPUs, your application will be limited to your requests.cpu. If you set that value too low, your application may become sluggish or unresponsive.

To request an exemption from CPU limits, email help@renci.org and include the namespace name to exempt. You must attest to the following before your request will be approved:

  1. I understand that removing CPU limits in my namespace means my application can use unused CPU cycles on the node
  2. I understand that if those unused CPU cycles aren't available (other applications use them all up), my application will be throttled in proportion to my requests.cpu value
  3. I understand that I need to set a realistic value for requests.cpu to ensure my application functions and plays nice with other tenants
  4. I understand that relying on the default value for requests.cpu is NOT sufficient; I need to set it explicitly

(you only have to attest to this once; future requests to remove CPU limits from you will be auto-approved)

Each namespace has resource limits

Each namespace has a maximum amount of allocatable CPU, RAM, ephemeral-storage, and Persistent Volume Storage. These limits are in place to ensure tenants are protected from resource hogs or misconfigured applications that infinitely spawn new pods.

The current defaults are:

    limits.memory: 8Gi
    requests.memory: 8Gi
    limits.ephemeral-storage: 4Gi
    requests.ephemeral-storage: 2Gi
    limits.cpu: "4"
    requests.cpu: "4"
    requests.storage: 128Gi
    nvme-ephemeral.storageclass.storage.k8s.io/requests.storage: "0"
    requests.nvidia.com/gpu: "0"
    # Also limit the number of certain API resources in each namespace
    pods: "100"
    persistentvolumeclaims: "50"
    services: "100"

Your limits can be checked by running kubectl describe resourcequota -n <your namespace>, and you can compare your actual usage with kubectl top pods -n <your namespace> Email help@renci.org if you need your limits raised and you have verified your current limits/requests are properly tuned.

Limited disk space (outside of PVs)

There is only 20GB of storage available on each node to store ephemeral data in container file systems. This is due to the majority of the space being used for caching container images. Because Sterling uses physical rack servers, each server has a small built-in hard drive that cannot be expanded. That's why we enforce a default limit of 256MB of "ephemeral-storage" per Pod. Pod logs also contribute to ephemeral-storage, up to 50MB (which is why the minimum limit is 50MB).

If you wish to store a lot of data on disk, create and mount a PersistentVolumeClaim in your pod. Data in PersistentVolumes is stored in a separate NetApp server with plenty of disk space.

If you specifically need "scratch space" that should be deleted when your Pod is deleted, you can use "generic ephemeral volumes" as described in the Kubernetes docs (just leave the storageClass blank so you get assigned the default one). See also #Using fast NVMe drives.

Inter-namespace network traffic is disallowed

By default, each namespace has a NetworkPolicy which prevents pods in other namespaces from sending traffic to pods in your namespace. But traffic between pods in the same namespace is allowed. If you need to send traffic between pods in different namespaces, you'll need to create your own custom NetworkPolicy YAML file with the following "from" section:

spec:
  ingress:
  - from: 
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: <other namespace>

NodePorts are not allowed

"NodePorts configure host ports that cannot be secured using Kubernetes network policies and require upstream firewalls. Also, multiple tenants cannot use the same host port numbers." (Source)

Services of type: NodePort should be converted to Ingresses (preferably) or Services of type: LoadBalancer.

ModSecurity Web Application Firewall is enabled

These days, public-facing websites are constantly being scanned by nefarious actors for known vulnerabilities (like exposed dashboards, shell injection, log4shell, etc.). As a result, suspicious HTTP requests sent to Sterling will be blocked with a 403 error by modsecurity and the OWASP coreruleset for all applications that use Ingresses. In fact, between 20% and 40% of all HTTP requests sent to Sterling are blocked.

All web applications must use Ingresses (instead of services of type: LoadBalancer or NodePorts) to take advantage of the enhanced security, auditing, and independence (not having to ask ACIS for DNS/IP/TLS setup). See the section on Ingress below.

If you encounter any false positive 403 errors, email help@renci.org.

In case of emergencies when you can't contact help@renci.org to create fine-grained rule exceptions, you can apply this annotation to your ingress temporarily:

nginx.ingress.kubernetes.io/modsecurity-snippet: |
  SecRule REQUEST_FILENAME "@rx ^/put-some-path-here"
    "id:21999,phase:1,t:none,nolog,pass,ctl:ruleEngine=DetectionOnly"

That annotation will COMPLETELY disable all firewall rules for a specific route, so it is only an emergency measure. Contact help@renci.org ASAP for a proper fix. ACIS reserves the right to re-instate the firewall at any time unless you are explicitly granted an exception.

If you would like to request to disable the firewall completely, ACIS will only grant that request if you have done all of the following:

  1. All of your containers are running as a non-root user (to protect other tenants from container-breakouts)
  2. You have added a NetworkPolicy which restricts egress traffic to a specific range of IP addresses (to protect other applications within the RENCI firewall)

Public internet requires approval

New applications running in Sterling that need to be exposed on the public internet (instead of restricted to RENCI IP addresses) require approval from ACIS. Email help@renci.org to request a public Ingress.

Pods that are directly internet-facing must be running as a non-root user before ACIS will grant public internet access because non-root pods significantly lower the risk of an attacker moving laterally from your pod if it is ever compromised.

Currently, the approval process is mainly for security awareness, but potentially ACIS could provide security guidance in the future.

No kubernetes dashboard

Unfortunately the Kubernetes dashboard depends on giving out ServiceAccount tokens (an insecure practice), which we've replaced with OIDC login using your RENCI credentials. And it requires granting extra read permissions to users so they don't get errors in the dashboard that you otherwise don't need.

Client-side dashboards like https://k9scli.io/ or https://k8slens.dev/ can be good replacements.

All tenants can view logs

All tenants have access to https://sterling-grafana.apps.renci.org to view and search logs. Since the Log Query Language allows for complex searches that cannot be restricted to specific namespaces, that means any tenant can read the logs of any other tenant by default.

As a security best-practice, do not include sensitive information in your logs.

Pods can't access Ingress hostnames

Prior to Kubernetes v1.24 traffic from Pods to other sites hosted in Sterling via Ingresses would time out. Users were told to use proxy.renci.org to resolve the issue, but now that is no longer necessary.

If you are still using HTTPS_PROXY=proxy.renci.org:8080 in Sterling apps, please update your code.

imagePullPolicy is set to Always

When you run a Pod in Sterling, Sterling will ALWAYS attempt to fetch the latest version of the container image. Sterling will ignore whatever you put in the imagePullPolicy field on your Pod. This is a security feature for private images which prevents people from using cached images unless you have the credentials to use them.

Exceptions

If you have a legitimate need to bypass one of these security limits, email help@renci.org and we can exempt specific restrictions from specific namespaces.

How to set up Ingress

This section describes how users from within RENCI or from the internet can access web-based applications hosted inside Sterling.

Historically, users would have to ask ACIS for an IP address, a TLS cert, a DNS name, and (if needed) to open the application up to the internet. Now, this process is automated and does not require ACIS's involvement for DNS records under *.apps.renci.org.

To allow only RENCI users (and not users from the public internet) to access an application, create an Ingress like the following:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    # This line will automatically generate a Let's Encrypt TLS certificate which will be stored in the secretName below. See https://cert-manager.io/docs/usage/ingress/
    # This only works for DNS names in public zones like *.renci.org or *.apps.renci.org. See https://wiki.renci.org/index.php/Kubernetes_Cloud/Let%27s_Encrypt_Migration
    cert-manager.io/cluster-issuer: letsencrypt
    # Ensure you don't have any "kubernetes.io/ingress.class" annotations; those are deprecated in favor of IngressClasses, but you want to use the default class anyway.
  name: mywebservice-ingress
spec:
  tls:
  - hosts:
      - mywebservice.apps.renci.org
    secretName: mywebservice.apps.renci.org-tls
  rules:
  # This line will automatically create the DNS record using https://github.com/kubernetes-sigs/external-dns/
  # This only works for hostnames within *.apps.renci.org. If you need a DNS name outside that zone, contact ACIS.
  - host: mywebservice.apps.renci.org
    http:
      paths:
      - pathType: Prefix
        path: "/"
        backend:
          service:
            # The service name and port should match the name and port on your own Kubernetes Service: https://kubernetes.io/docs/concepts/services-networking/service/
            name: mywebservice
            port:
              number: 8080

NOTE: After your TLS certificate gets generated and stored in a Kubernetes Secret, don't delete it! That will trigger the certificate to be re-generated, which could hit rate limits that might lock you out for up to a week!

NOTE: If you have hit the Let's Encrypt rate limits (or you anticipate that you might), replace your cert-manager.io/cluster-issuer: letsencrypt annotation with cert-manager.io/cluster-issuer: zerossl. That will request a certificate from ZeroSSL, which has no rate limits. The ZeroSSL root certificate is also trusted in older devices than Let's Encrypt.

NOTE: After creating a new Ingress, try to wait a few minutes for the DNS record to be created before trying to access the site. If you access the site too soon, your computer might cache the missing DNS record for up to 1 hour, which can only be fixed by flushing your DNS cache.

Exposing your app on the public internet

To allow users from the public internet to access your application, first request approval from help@renci.org if this is a newly-public application. Then, add the following annotation to the above Ingress:

annotations:
  nginx.ingress.kubernetes.io/whitelist-source-range: "0.0.0.0/0,::/0"

That will replace the default whitelist (which only allowed RENCI IPs) with a new whitelist that allows any IP from the internet.

ACIS will only allow public internet access if a) you are migrating an existing public site into Sterling or b) the pod directly exposed to the internet is running as a non-root user. In the future, all public applications must run ALL pods as a non-root user.

Using an external domain name

ACIS administrates the renci.org domain and most of the external domains such as irods.org, so ACIS can configure most DNS records for you upon request. However, if you own a domain name and manage the DNS records yourself (using services like Route53, DigitalOcean, etc.) then you have to create the following DNS record yourself:

  • Record type: CNAME
  • Destination: ingress-nginx.apps.renci.org

You can use any subdomain such as "www.example.com" or "dev.example.com", but keep in mind that you cannot create CNAME records at the "apex" such as "example.com". See here for more info.

Once your DNS records are configured, you would edit your Ingress to update the spec.tls.hosts field and the spec.rules.host field to match your external domain, such as "dev.example.com".

If you are migrating an existing site into Sterling, you should first update your Ingress before updating DNS records to minimize downtime.

How to expose non-HTTP services

If you need to expose a typical web application that speaks HTTP, then you should use Ingresses. But if you need to expose an application that speaks plain TCP or UDP, Ingresses don't support that, so you need to use a Service of type: LoadBalancer.

First, email help@renci.org to request a LoadBalancer IP address (include the hostname, port, and purpose in your email). Public IP addresses are only granted if the pod is running as a non-root user. Once you have received your IP address, create a Service like the following (please read the comments):

apiVersion: v1
kind: Service
metadata:
  name: my-tcp-service
  annotations:
    # You can optionally specify this annotation to have the DNS record automatically created if under *.apps.renci.org
    external-dns.alpha.kubernetes.io/hostname: <your hostname>
spec:
  type: LoadBalancer
  # You must request an IP address from ACIS at help@renci.org
  loadBalancerIP: <your IP>
  ports:
  - name: <some name>
    port: <port number>
    # Can also be "UDP"
    protocol: TCP
  selector:
    # Ensure this selector matches whichever Pods should be handling this traffic
    app.kubernetes.io/name: <something>

Since by default, your namespace has a NetworkPolicy that blocks traffic from outside your namespace, you will also need to create an override NetworkPolicy to allow traffic to your new service:

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: allow-loadbalancer-ports
spec:
  podSelector:
    matchLabels:
      # Ensure this selector matches whichever Pods should be handling this traffic
      app.kubernetes.io/name: <something>
  ingress:
  - from: []
    ports:
      - port: <the ports your Pod is listening on>

Traffic flowing through these services does not have Web Application Firewall Protection, nor does it perform automatic TLS termination; you'll have to handle those things inside your own pods. You can at least auto-generate a TLS cert by creating a Certificate: https://cert-manager.io/docs/usage/certificate/

Persistent Storage

Users can create PersistentVolumeClaims to request NFS volumes of any size (up to the default limit of 512GB which can be raised upon request). These NFS volumes are stored in a commercially-supported, highly-available, redundant NetApp appliance, with 1PB+ of total space. This is the preferred location for storing persistent data.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 8Gi
  storageClassName: basic

The default StorageClass "basic" (which will be automatically assigned if the storageClassName is left blank) has no special settings, but users can choose a different storage class to enable snapshotting, encryption at rest, or both:

storageClassName: snapshots

storageClassName: encrypted

storageClassName: encrypted-with-snapshots

Enabling snapshots means that 8 hourly and 5 daily backups of old data will be retained. Do NOT enable snapshotting if you regularly need to create and delete large files, since this will cause the size of the snapshot data to balloon. You can email help@renci.org to ask for a previous backup to be restored.

On-Demand Snapshotting

Sterling supports the VolumeSnapshot API for capturing snapshots of PersistentVolumes. To create a snapshot, kubectl apply the following yaml file in your namespace:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: example-snapshot-name
spec:
  source:
    persistentVolumeClaimName: <name>

Kubernetes will then capture a point-in-time snapshot of your data. This process is near instantaneous and does not cause downtime. Note that some types of databases like sqlite could get corrupted if the snapshot is taken the exact moment that a transaction is writing to the db file, so be careful.

See here for how to restore data from a snapshot: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#volume-snapshot-and-restore-volume-from-snapshot-support

Important Notes:

  • Deleting the VolumeSnapshot from Kubernetes will delete the backing snapshot
  • Deleting the PerisistentVolume that was used to take a snapshot will result in the volume being orphaned (not visible in k8s), but the volume will be retained on the NetApp appliance. Meaning that you can still restore from the snapshot, which essentially "un-orphans" the volume back to normal.
  • The snapshot data is stored in a hidden partition inside your PersistentVolume (and thus counts towards your quota), but only the delta between the snapshot and the live data is stored. If many file blocks are changed frequently, the snapshot size will balloon.
  • If your persistent volume is controlled by a StatefulSet, you can change the size of your volume or set the dataSource (for recovery) on your statefulset using this guide.

Volume Cloning

Snapshotting allows for point-in-time recovery of data within a volume, but the snapshots use up space inside the volume and the snapshots only exist as long as the volume exists. If you want a full second copy of all data inside a new persistentvolume, you can clone a volume as described here: https://kubernetes.io/docs/concepts/storage/volume-pvc-datasource/

After cloning, the original volume could be wiped or removed from the cluster completely and the data will still be present in the new volume.

Offsite Backups

For particularly valuable data, you can email help@renci.org to request a specific PersistentVolume to be mirrored from Europa to the Manning datacenter on campus. This mirroring works by taking a snapshot of your volume every day at midnight and sending it to Manning. Manning will then retain up to 1 week of these snapshots. These offsite backups allow for your data to survive physical damage to the Europa datacenter or malicious/accidental deletion of the entire volume (kubectl delete pvc <name>).

If your volume is ever corrupted or even completely deleted, ACIS can recover the latest snapshot from Manning. To request a restore, email help@renci.org ASAP after a data loss event.

If you no longer need your volume, email help@renci.org so we can remove the offsite copy as well.

NOTE: The daily snapshots are considered "crash-consistent", meaning that the restored data will look just like the power cord was unplugged at midnight. Most databases can recover from crash-consistent backups by rolling back pending transactions, but other applications might not (like if the snapshot is taken halfway through saving a large file, the file will appear partially-written). Know your applications!

Monitoring with Prometheus Operator

Sterling has a shared monitoring stack that anyone can use, which includes Prometheus Operator. Prometheus Operator allows you to create Kubernetes manifests that automatically set up monitoring and alerting behaviors (ServiceMonitors, AlertmanagerConfigs, and PrometheusRules).

  • ServiceMonitor: Describes how prometheus should scrape a Kubernetes Service for Prometheus metrics (labelSelector, port, path, etc.)
  • PrometheusRule: Describes a Prometheus expression that triggers an alert if it evaluates to true (e.g. absent(up{job="your-app"} == 1)
  • AlertmanagerConfig: Describes how to convert alerts from PrometheusRules into notifications to slack, email, and other notification systems

If you create at least a ServiceMonitor in your namespace, Prometheus will begin collecting metrics from your service. Your service must already be exporting prometheus metrics, like enabling the redis prometheus exporter for example. The metrics collected from your service are viewable in https://sterling-grafana.apps.renci.org/ (login with RENCI credentials).

Alerting

If you would like to receive alerts for your namespace, you can apply the following AlertmanagerConfig yaml file (insert your email address first):

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: "email-alerts"
spec:
  route:
    groupBy: ['alertname', 'job']
    repeatInterval: 24h
    receiver: 'email'
    matchers:
    - matchType: '!='
      name: alertname
      value: InfoInhibitor
  receivers:
    - name: 'email'
      emailConfigs:
        - to: 'youremail@renci.org'
          sendResolved: true

Migrating existing applications to Sterling

Sterling is the preferred location to run containerized applications at RENCI. If your application is already containerized but it's running in a VM, we recommend learning about Kubernetes (via this course which is free for UNC employees) and migrating your application to Sterling. ACIS is always happy to consult with you for these migrations via help@renci.org.

If you are using docker-compose, you can use kompose to generate equivalent Kubernetes manifests. With minor tweaks, those can often be deployed straight to Sterling.

When moving applications from VMs into Sterling, ACIS provides a tool called "rsync-migrator" which can move data from a VM into a k8s PersistentVolume automatically.

The final step for migrating any application will require changing the IP addresses in your application's DNS records. Coordinate with ACIS via help@renci.org to schedule the cutover, which may require 5-10 minutes of downtime.

Using GPUs

Sterling has the following GPUs available:

  • Ampere A100 40GB (x4) divided into multiple sub-GPUs of the following sizes:
    • 3g.20gb (4x)
      • Reserved: 4/4
    • 2g.10gb (4x)
      • Reserved: 2/4
    • 1g.5gb (8x)
      • Reserved: 3/8

If you would like to request the ability to reserve a GPU, email help@renci.org so we can understand your use-case.

Once you are granted access, reserving a GPU is the same as reserving RAM or CPU: you just specify the number and types that you need in your Pod's limits:

resources:
  limits:
    nvidia.com/mig-1g.5gb: 1
    # To reserve a larger sub-GPU:
    # nvidia.com/mig-2g.10gb: 1
    # nvidia.com/mig-3g.20gb: 1

NOTE: Only one Pod can use a particular sub-GPU at a time. If pods are currently reserving all sub-GPUs, the next pod that requests a GPU will remain in the "Pending" state until one of the other pods die. This is why you must be granted approval from ACIS to reserve GPUs so we can help ensure fair access for other tenants.

NOTE: Requesting multiple MIG GPUs does not grant you one larger GPU which is the sum of the two MIG GPUs. MIG GPUs are treated as physically separate GPUs, meaning application code must be able to distribute work between multiple GPUs to properly utilize them.

NOTE: There is no need to specify requests; the request will be set to the limit you specify. If you do decide to specify a request manually, make sure it matches the limit exactly or else you will get an error about "non overcommitable resources".

Using fast NVMe drives

One of the nodes in Sterling has 6TB of NVMe storage which you can use as ephemeral scratch space for your applications. These drives have read/write speeds of >1GB per second, which is between 2x and 400x faster than the default NFS storage depending on your workload.

NVMe Caveats

  • Since only one node has the NVMe drives, pods that require NVMe drives can only run on that one node. If that node goes down (like during maintenance), your pod will be stuck "Pending" until the node comes back up. Only use NVMe drives if your applications can tolerate this additional risk of downtime!
  • Data stored in ephemeral volumes only survives as long as the Pod is running. After the Pod restarts, you are given a fresh volume.
  • Don't store data on these NVMe drives that you aren't prepared to lose. The NVMe drives have no RAID/mirroring/snapshotting/backups, so if a single drive dies, all data in all volumes will be lost.
  • Ephemeral volumes still count toward your namespace-wide quota of PersistentVolumeClaim space.
  • If you have a use-case for storing non-ephemeral data on the NVMe drives, and the above caveats aren't an issue, email help@renci.org and we can potentially grant you the ability to use regular (non-ephemeral) PersistentVolumes backed by NVMe drives.

To claim some space on the NVMe drives, create a generic ephemeral volume from your Pod spec like so (Pod specs can appear inside Deployments or StatefulSets as well):

spec:
  containers:
  - name: myapp
    image: alpine:latest
    command: ["sleep", "999999"]
    volumeMounts:
    - mountPath: "/data"
      name: scratch-volume
  securityContext:
    runAsUser: 1001
    # When running as a non-root user, setting the fsGroup is required to allow you to write to the volume
    fsGroup: 1001
  volumes:
  - name: scratch-volume
    ephemeral:
      volumeClaimTemplate:
        spec:
          accessModes: [ "ReadWriteOnce" ]
          # This storage class is how Kubernetes knows to give you an NVMe volume instead of a normal NFS volume
          storageClassName: nvme-ephemeral
          resources:
            requests:
              storage: 10Gi

Mounting /projects volumes

Many teams across RENCI store research data in NFS volumes called "/projects volumes". They are named after the path at which they are usually mounted within RENCI VMs, such as "/projects/hydroshare". You must email help@renci.org to request for a specific namespace to have access to a specific volume, otherwise your pod will be blocked from running for security purposes. See Things you can request.

Pods in Sterling can mount these volumes directly like so (docs reference):

spec:
  containers:
  - name: mycontainer
    image: <image>
    volumeMounts:
      - name: nfs
        mountPath: "/projects/<volumename>"
  securityContext:
    # You must specify a non-zero UID. You can use either your own UID or just use "30000" which is Sterling's storage service account
    runAsUser: 30000
    # You should use your "storage group" that owns the root of your /projects volume
    runAsGroup: <GID>
    # Optional: Additional GIDs can be added like so:
    # supplementalGroups:
    #   - <GID1>
    #   - <GID2>
  volumes:
  - name: nfs
    nfs:
      server: na-projects.edc.renci.org
      path: /<volumename>
      # You can optionally mount the volume in read-only mode by uncommenting the next line
      # readOnly: true

Be sure the nfs.path fields under spec.volumes match exactly what you request. IE be aware of any leading (/projects) or trailing path elements (trailing / or subdirectory)

ServiceAccounts for external automation

Some users may want to create Kubernetes ServiceAccounts in order to automate tasks such as deploying Helm charts from Jenkins. This practice is allowed, provided you follow these guidelines:

  • ServiceAccounts MAY NOT be used by individual users to access the cluster. Individual users must still log in with their RENCI username and password for auditing purposes.
  • ServiceAccounts MAY NOT have permissions in more than one namespace. Create a separate ServiceAccount per namespace rather than binding multiple roles to the same ServiceAccount.
  • ServiceAccounts MUST follow the principle of least privilege by only having the smallest subset of permissions needed for deployment. For example, don't allow your ServiceAccount to list all Secrets or delete PVCs unless absolutely necessary.
  • ServiceAccounts MUST have descriptive names so that the Kubernetes API audit logs are traceable in the event of a breach. For example, you might name the ServiceAccount after the VM on which it will be placed like "myvm-edc-renci-org" (dots replaced with dashes).
  • Ensure that a limited number of users have access to the ServiceAccount token.

Tenants have the authority to create their own ServiceAccounts, Roles, and RoleBindings, so you do not need ACIS to do this for you:

kubectl create serviceaccount <name> -n <namespace>

FAQ

When should I use Sterling instead of a VM?

If your application is already containerized, you should use Sterling instead of a VM since Kubernetes provides a long list of improvements to containerized applications. See #Value Proposition

If your application is not containerized, but it is a web service or uses common open source software like redis, elasticsearch, postgres, mysql, etc. then Sterling is still recommended since it makes operating those kinds of applications easier. Kubernetes was developed by Google primarily to automate the deployment of web services, so it is well-suited for running web services. Databases are slightly more advanced, but advancements in Kubernetes over the past few years has made it just as easy to manage stateful workloads like databases.

You should choose a VM instead of Sterling if:

  • You need an SSH or RDP-accessible machine (you can't SSH or RDP into a Kubernetes pod)
  • You need to be able to build docker images, like for a Jenkins worker (you generally can't build containers in Kubernetes without privileged access)
  • You need a Windows machine (Sterling does not support Windows)
  • Your application requires a different maintenance window than Sterling's third-Thursday window (like if an SLA requires weekend maintenance or something)
  • Your application relies on third-party software that is not easily containerizable
  • Your application requires advanced kernel/networking features that Sterling doesn't support (like a specific kernel version, CAP_NET_ADMIN/RAW, etc.)

Without a non-production cluster, where should I deploy my dev/val/staging/non-prod apps?

You should install your non-production apps in Sterling in a separate namespace (contact help@renci.org to request a new namespace). Having both your production and non-production applications in the same cluster minimizes the chances of surprise bugs when deploying to production. It also means ACIS can provide production-level stability for even your non-production applications.

ACIS will maintain a private test cluster for testing new changes so none of your applications (production or not) are negatively affected by our testing.

Do I have to modify my existing Kubernetes application in order to deploy it to Sterling?

Usually no. Typically you only need to tweak Ingress (ingressClass, annotations), PersistentVolumes (storageClass), and resource limits (they're required).

HOWEVER: Eventually, we will begin enforcing a ban on containers that run as the root user for increased security. Plan to ensure your containers do not depend upon running as the root user as soon as possible using this guide.

What IP ranges does Sterling use for egress?

The Sterling nodes have different source IPs depending on whether traffic is staying within the Europa datacenter or not.

For traffic to other RENCI machines:

The source IP for egress from Sterling to another machine within RENCI will be in the range 172.25.13.0/24. Exact IP depends on which node your Pod is running on at the time.

NOTE: Allowing this CIDR to access your VM will allow ANY pod in Sterling to access your VM too, since all the traffic appears to be coming from the underlying Kubernetes node IP. Ensure you have some form of authentication in place before allowing any Sterling pod to access your VM.

For traffic to the internet:

The source IP for egress from Sterling to the internet will be 152.54.3.118. This is the NAT gateway used for all machines hosted in Europa. Soon, we plan to change this behavior so that traffic will no longer flow through the NAT gateway and instead will come from the range 152.54.15.128/25. We will announce this when we do.

NOTE: Allowing this CIDR to access your server will allow ANY machine in RENCI to access your server too. Ensure you have some other form of authentication in place aside from an IP allow-list.

Can I send kubectl commands to Sterling from a server without a browser?

Typically no. In order to authenticate to Sterling, you must have a browser in order to use the oidc-login process, which allows you to log into Sterling with your RENCI username/password. This is a security feature since you can rely on TLS certificates to trust that you aren't being phished.

However, if you install kubectl and oidc-login on a server, you can login locally then copy the contents of ~/.kube/cache/oidc-login onto the server. This will allow you to access Sterling for 24 hours until your token expires.

I'm getting a 403 error from my internal site even though I'm connected to the VPN. What gives?

If your browser sends a request before the VPN has fully turned on (which can happen in the background, like if a tab reloads or if some background javascript tries to fetch something), your browser will re-use that connection for 60-75 seconds. Re-using an old connection from before the VPN turned on means that you will be connecting to the website using your normal IP address rather than the IP address the VPN gives you, causing the error.

If this happens to you, you must either quit your browser or wait 75 seconds. Unfortunately, connection re-use is built into the HTTP/2.0 protocol and I cannot force-close connections without violating the protocol.

What happens if the datacenter loses power?

Sterling is connected to generator power, so it can survive medium-length power outages. However, the cooling systems are NOT connected to the generator, so the datacenter may reach critical temperatures after a few hours of no power. In that event, Sterling will be powered down to avoid damaging equipment. If that happens, here are some tips for preparing.

All data written to PersistentVolumes will be safe, but there is a small risk that a power outage may cause data loss for applications that don’t fsync data to disk regularly. Most databases do this automatically, but e.g. default redis instances only fsync periodically for performance. Double-check the config of any applications that need to store persistent data to ensure they write to disk regularly.

Another way to improve the shutdown process is to ensure your containers properly handle the SIGTERM signal. Kubernetes will send this signal to PID1 inside each container as a "warning" to shutdown. If your pod ignores the signal, it will be forcibly killed 30 seconds later. This is an an easy mistake to make if e.g. your container's entrypoint is a shell script that launches a webserver. Shell scripts do not pass signals along to child processes, meaning they typically ignore SIGTERM. Ensure that your containers respond to SIGTERM and perform whatever cleanup steps they need to before shutting down.

Finally, some applications may take a long time to gracefully shutdown (like redis if it has a lot of data it needs to write, or webservers if there are any long-running connections to wait on). Test your applications to see how long they take to shutdown, then set that value in terminationGracePeriodSeconds on your Pods. That value controls the amount of time between when Kubernetes sends SIGTERM and SIGKILL. If you need a terminationGracePeriodSeconds value greater than 60 seconds, inform ACIS since 60 seconds is the default grace period for Kubernetes upgrades. NOTE: Hardware failures will NOT respect terminationGracePeriodSeconds, so prepare for that possibility for workloads that are hyper-sensitive to data loss.

Is the data I store in PersistentVolumes backed up?

No. The SLA mentions this, but ACIS cannot support logical backups for every type of application/database that could potentially use Sterling.

We do provide optional snapshotting (see #Persistent Storage) to protect against accidental data loss, but the snapshots are still stored in the same physical location. We also provide opt-in #Offsite Backups.

When is maintenance performed on Sterling?

When non-emergency maintenance is required, that takes place on the third Thursday of every month. Maintenance is not required every month, so you will receive an email if it is. This maintenance is important to keep the operating system and Kubernetes itself updated. Kubernetes versions are only officially supported for 1 year, requiring regular upgrades.

Most maintenance events only result in pods being restarted. Stateless pods with multiple replicas will not see any downtime, but stateful pods (without active-active failover or replication) may see a few seconds of downtime while the pod is restarted on a new node.

Processes inside my container are randomly dying, why is that? (OOMKilled)

If you have multiple processes inside a container that are randomly crashing, it could be caused by running out of memory. If the non-primary process dies, it will happen silently and the pod will stay running. If the primary process dies, the pod will restart and you will see an "OOMKilled" (out-of-memory killed) event via kubectl get events -n <mynamespace>.

If you set up alerting in your namespace, you will automatically receive alerts for OOMKills for non-primary processes. Otherwise, you can confirm the number of times processes in your container have been killed by doing the following in your pod:

$ cat /sys/fs/cgroup/memory/memory.oom_control
oom_kill_disable 0
under_oom 0
oom_kill 0

# Allocate a bunch of RAM to force an OOMKill
$ cat /dev/zero | head -c 2G | tail
Killed

$ cat /sys/fs/cgroup/memory/memory.oom_control
oom_kill_disable 0
under_oom 0
oom_kill 1   <--- Note the increase

Ref: https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

If you need precise details about a particular oomkill event, file a help@renci.org ticket and ACIS can grab the full oomkiller logs.

Changelog

Subscribe to the kubernetes-users@renci.org mailing list to be alerted of changes/upgrades.

  • 2021-09-24
    • Closed beta begins
  • 2021-10-18
    • Open beta begins
  • 2021-11-04
    • Added support for requesting PersistentVolumes with encryption, snapshotting, or both (see new section on storage).
  • 2021-11-09
    • General Availability (Soft Launch). Sterling is ready for production workloads, but limited to the open beta participants while the remaining 14 nodes are brought online.
  • 2021-11-19
    • Added support for VolumeSnapshots for on-demand backups of PersistentVolumes.
  • 2021-11-29
    • Sterling is now Generally Available for all users!
  • 2021-12-06
    • Enabled generic ephemeral volumes to be used as scratch space.
    • Lowered the default ephemeral-storage request and max per pod to encourage use of ephemeral volumes instead.
    • Lowered the memory, cpu, and ephemeral-storage per namespace since the cluster would have been over-subscribed otherwise.
  • 2022-01-03
    • The Harpo server was removed from Mitchell and added to Sterling. The Arrival server will be removed on Feb 14th.
  • 2022-01-20
    • Half the the Sterling nodes were unplugged and plugged back into a more reliable power source.
  • 2022-02-01
    • 2 GPUs now available in Sterling
  • 2022-02-14
    • The Arrival server was be removed from Mitchell and added to Sterling.
  • 2022-02-17
  • 2022-02-24
    • 4 total GPUs now available in Sterling
  • 2022-03-10
  • 2022-04-15
    • The Mitchell and BlackBalsam clusters were officially decommissioned
  • 2022-04-21
    • Upgraded to Kubernetes v1.22
  • 2022-05-02
  • 2022-06-16
    • Upgraded to Kubernetes v1.23
  • 2022-08-29
  • 2022-09-15
    • Upgraded to Kubernetes v1.24
  • 2022-09-29
    • Sterling is now configured to use generator power in case of power loss. Note that extended outages may still require Sterling to be shut down to reduce heat accumulation in the datacenter.
  • 2023-01-19
    • Upgraded to Kubernetes v1.25. Added ability to list Namespaces, Ingresses, and StorageClasses.
  • 2023-04-25
    • Created an alert to detect OOMKills for processes inside containers other than PID 1, which is a common source of silent failures in web servers like gunicorn.

Roadmap

The items on this roadmap are not set in stone and may change at any time. Items are listed in no particular order

  • Secure Enclave Kubernetes project: A separate OpenShift cluster in the RENCI Secure Enclave which will be approved for processing sensitive data. [IN PROGRESS]
  • SSD-backed PersistentVolumes [IN PROGRESS]
  • Allowing egress directly from nodes instead of flowing through a NAT gateway
  • IPv6 dual stack support
  • Upgrading nodes to RHEL9 or RHCOS, which would provide myriad improvements thanks to cgroups v2
  • Improvements to Ingress reliability