The Ephemeral Runner: A Cost-Optimized Architecture for GitHub Self-Hosted Runners on AKS

Introduction:

If your organization, like many modern development teams, heavily relies on GitHub Actions for its continuous integration and continuous deployment (CI/CD) pipelines, you've undoubtedly experienced the undeniable convenience and ease of its hosted runners. These readily available, cloud-based environments offer a fantastic way to get your CI/CD processes up and running with virtually zero initial setup or infrastructure management. They abstract away the complexities of provisioning and maintaining build agents, allowing developers to focus purely on writing code and defining their workflows.

However, as your development team scales, your codebase matures, and your build and deployment frequencies accelerate, the initial convenience of GitHub-hosted runners can begin to present a significant drawback. This convenience, while appealing, starts to come at a substantial cost—both literally in terms of financial outlay and figuratively in terms of operational flexibility and control. The inherent limitations of a one-size-fits-all, public cloud offering can quickly become a bottleneck for organizations striving for efficiency, security, and specialized capabilities.

The Problem: Unpacking the Challenges of GitHub-Hosted Runners

As your adoption of GitHub Actions deepens and your CI/CD footprint expands, you're likely to encounter a combination of critical challenges that necessitate a re-evaluation of your runner strategy. These issues often manifest as growing pains, impacting budgets, security posture, and the ability to innovate:

The Flawed Approach: The Pitfalls of Traditional GitHub Actions Runner Deployment

Many organizations, in their initial foray into self-hosted GitHub Actions runners, instinctively adopt a straightforward yet ultimately inefficient strategy: deploying persistent Virtual Machines (VMs) in their cloud environment, such as Azure. The immediate appeal of this approach is its apparent simplicity. By spinning up a few Azure VMs, installing the GitHub runner software, and leaving them to run continuously, teams quickly address two critical challenges:

However, this seemingly elegant solution is riddled with significant drawbacks that often become apparent only as the CI/CD system scales and operational costs mount. The most glaring issue is gross inefficiency. A typical CI/CD pipeline, even in a busy development environment, experiences bursts of activity followed by periods of dormancy. When runners are provisioned as always-on VMs, you are effectively paying for compute resources that are idle for a substantial portion of the day – often 90% or more. This translates directly into wasted expenditure on CPU, memory, and storage, making it an incredibly expensive way to run CI/CD workloads.

Beyond the financial drain, the "always-on VM" approach introduces a considerable operational burden. These static VMs require constant attention and maintenance from your engineering team:

In essence, while solving immediate technical hurdles, the flawed approach creates a new set of problems centered around cost overruns and increased operational overhead, diverting valuable engineering resources from core product development.

The Thesis (The Solution): Embracing Cloud-Native for Cost-Optimized, Ephemeral GitHub Actions Runners

Recognizing the inherent limitations of the traditional VM-based approach, there is a fundamentally superior and more cloud-native paradigm for deploying self-hosted GitHub Actions runners. This guide champions a modern, automated, and highly cost-optimized solution built upon the robust foundation of Azure Kubernetes Service (AKS).

The core of this thesis lies in leveraging the power and flexibility of Kubernetes to manage ephemeral runner workloads. Instead of static VMs, we envision a system where runners are dynamically provisioned only when needed and automatically de-provisioned once their task is complete. This architecture focuses on:

To achieve this sophisticated solution, this guide will provide a detailed, hands-on architectural blueprint that integrates several key technologies:

By combining these elements, we construct a CI/CD runner platform that is not only highly performant and secure but also extraordinarily economical, aligning perfectly with cloud-native best practices.

The Outcome: A Production-Grade, Future-Proof CI/CD Runner Platform

Upon successfully implementing the architectural blueprint detailed in this guide, organizations will possess a production-grade CI/CD runner platform that fundamentally transforms their development operations. This platform delivers tangible, measurable benefits across several critical dimensions:

6. The Architectural Blueprint: Step-by-Step Implementation

flowchart TD subgraph GitHub["GitHub"] A["Workflow Job Queued"] end subgraph subGraph1["actions-runner Namespace"] B["Actions Runner Controller"] end subgraph subGraph2["Runner Spot Node Pool"] D{"Ephemeral Runner Pod"} end subgraph subGraph3["Azure AKS Cluster"] subGraph1 subGraph2 C["Cluster Autoscaler"] end subgraph Azure["Azure"] E["Azure Spot VMs"] end A -- "1. Webhook" --> B B -- "2. Creates Pod" --> D D -- "4. Registers with" --> A D -.-> C C -- "3. No room for pod" --> E E -- "Provisions Node" --> C

Phase 1: Architecting the AKS Cluster Foundation

Prerequisites

Before you begin, you will need:

A best practice is to use a multi-node-pool strategy to separate your stable system workloads from your volatile, cost-optimized runner workloads.

1. Create the AKS Cluster with a system Node Pool: This pool will run critical services like CoreDNS and the ARC controller itself. It uses reliable, on-demand VMs.

# Set up some variables export RESOURCE_GROUP="my-runners-rg" export CLUSTER_NAME="my-runner-cluster" export LOCATION="eastus" az group create --name $RESOURCE_GROUP --location $LOCATION az aks create \ --resource-group $RESOURCE_GROUP \ --name $CLUSTER_NAME \ --node-count 1 \ --node-vm-size "Standard_B2s" \ --nodepool-name systempool \ --enable-oidc-issuer

2. Add the runner Spot Node Pool: This is where all your runner pods will live. It's configured to scale from zero and use discounted Spot instances.

az aks nodepool add \ --resource-group $RESOURCE_GROUP \ --cluster-name $CLUSTER_NAME \ --name runnerspotpool \ --node-vm-size "Standard_D2s_v3" \ --priority Spot \ --eviction-policy Delete \ --spot-max-price -1 \ --enable-cluster-autoscaler \ --min-count 0 \ --max-count 10 \ --labels "purpose=github-runners" \ --taints "sku=github-runners:NoSchedule"

Phase 2: Installing and Configuring the Actions Runner Controller (ARC)

1. Create a GitHub App: To allow the controller in your cluster to securely communicate with the GitHub API, we need to create a GitHub App. This is more secure than using a Personal Access Token.

You should now have the three critical pieces of information needed for the next step.

2. Install ARC via Helm: Add the ARC repository and install the chart, passing in your GitHub App credentials.

# Add the Helm repository helm repo add actions-runner-controller https://actions-runner-controller.github.io/actions-runner-controller helm repo update # Put your private key contents into a variable export GITHUB_APP_PRIVATE_KEY=$(cat /path/to/your/private-key.pem) helm upgrade --install --namespace actions-runner-system --create-namespace \ --set=authSecret.create=true \ --set=authSecret.github_app_id=<YOUR_APP_ID> \ --set=authSecret.github_app_installation_id=<YOUR_INSTALLATION_ID> \ --set=authSecret.github_app_private_key="${GITHUB_APP_PRIVATE_KEY}" \ actions-runner-controller actions-runner-controller/actions-runner-controller

Phase 3: Defining the RunnerDeployments

Now, we create a RunnerDeployment custom resource. This tells ARC how to configure the runner pods it creates.

1. Create a file named runner-deployment.yaml:

apiVersion: actions.summerwind.dev/v1alpha1 kind: RunnerDeployment metadata: name: linux-runners namespace: actions-runner-system spec: replicas: 1 # ARC will manage this dynamically, but a base is needed template: spec: repository: your-org/your-repo # Target repository labels: - self-hosted - linux-spot # Set resource requests to help Kubernetes schedule efficiently resources: limits: cpu: "1" memory: "2Gi" requests: cpu: "500m" memory: "1Gi" # This is the crucial part to target our Spot node pool # The toleration allows the pod to be scheduled on the tainted Spot nodes tolerations: - key: "sku" operator: "Equal" value: "github-runners" effect: "NoSchedule" # The nodeSelector ensures it ONLY runs on Spot nodes nodeSelector: "purpose": "github-runners"

2. Apply it: kubectl apply -f runner-deployment.yaml.

Phase 4: Triggering the System from a GitHub Actions Workflow

Finally, update your workflow to target your new self-hosted runners using the labels you defined.

1. Create a file in your repository at .github/workflows/test-runners.yml:

name: Test Self-Hosted Runner on: [push] jobs: build: # This 'runs-on' key targets your runners by matching the labels. runs-on: [self-hosted, linux-spot] steps: - uses: actions/checkout@v4 - name: Run a test command run: | echo "🎉 This job is running on a self-hosted runner on AKS!" echo "Hostname: $(hostname)"

2. Commit and push this file.

7. The System in Action: The Ephemeral Lifecycle

  1. Trigger: You push a commit, and the GitHub Actions workflow starts. The build job is queued.
  2. Detect: The Actions Runner Controller (ARC) sees the queued job that matches the self-hosted, linux-spot labels.
  3. Create: ARC creates a new Runner pod based on your RunnerDeployment template.
  4. Scale (If Needed): The pod is now Pending because the Spot node pool is at 0 nodes. The AKS Cluster Autoscaler sees the pending pod, calls the Azure API, and provisions a new Spot VM for the runnerspotpool.
  5. Run: The new node joins the cluster, and the runner pod is scheduled onto it. The pod registers itself with GitHub, picks up the job, and executes your workflow steps.
  6. Destroy: As soon as the job is finished, ARC terminates the runner pod.
  7. Scale Down: The Cluster Autoscaler now sees that the Spot node is empty. After a cooldown period (typically 10 minutes), it terminates the node, and your cost drops back to zero.

8. Conclusion

You have successfully architected and deployed a sophisticated, event-driven CI/CD runner platform. By moving beyond static VMs and embracing a cloud-native, Kubernetes-driven approach, you have built a system that is not only powerful but also incredibly efficient.

Call to Action:

The era of paying for idle CI machines is over. Start small: migrate one of your non-critical workflows to this new model. Watch the lifecycle in action, see the cost savings appear on your next Azure bill, and you'll be ready to scale this architecture across your entire organization.