Self-hosted Runners with ACA Job and KEDA

By combining Azure Container Apps (ACA) Jobs with the KEDA (Kubernetes-based Event Driven Autoscaler) github-runner scaler, you can build ephemeral self-hosted runners that scale on-demand based on the GitHub Actions workflow queue.

What are Self-hosted Runners?

GitHub Actions offers two types of "runners" that execute jobs.

Comparison	GitHub-hosted Runner	Self-hosted Runner
Infrastructure management	Managed by GitHub	You provision and manage
OS	Ubuntu / Windows / macOS	Any (Linux, Windows, macOS, containers, etc.)
Network	Public internet	Customizable (private network possible)
Custom software	Limited	Freely installable
Cost	Billed by usage time	Infrastructure cost only
Startup time	Typically 1–3 minutes	Customizable

Self-hosted runners run the runner agent on infrastructure you manage.

Use Cases for Self-hosted Runners

1. Access to Private Networks Required

GitHub-hosted Runners access from the public internet. This means they cannot directly access resources on Azure Virtual Networks (VNet) — such as Azure SQL Database, ACR, or Key Vault protected by private endpoints. Placing self-hosted runners inside a VNet enables private access to these resources.

2. Specific Hardware or Specs Required

ML workflows requiring GPU: Use GPU instances for model training or inference testing
Large memory builds: .NET or Java large-scale projects requiring 32GB+ memory
Fast storage: NVMe SSD for cache-heavy builds

3. Security and Compliance Requirements

Data sovereignty: GDPR and similar regulations may prevent placing code or artifacts on GitHub's infrastructure
Proprietary code protection: Source code or build artifacts cannot be placed on external shared cloud infrastructure
Security auditing: Need full control and auditability of the runner execution environment
Secret management: Integration with private secret management tools like Azure Key Vault

4. Custom Environment Requirements

Pre-installed internal tools: Specific SDKs, licensed software, internal tools
Fixed IP addresses: When external services require IP whitelisting
Stateful caching: Persist Docker layer caches or dependency caches

5. Cost Optimization

Large-scale CI/CD pipelines: For organizations with high GitHub-hosted Runner usage, self-hosted infrastructure may be more cost-efficient
Spot instance utilization: Reduce costs with Azure Spot VMs or ACA Jobs spot features

What are Ephemeral Runners?

Traditional self-hosted runners were "always-on." However, this approach has problems:

Security risk: Execution environment is shared across multiple jobs — secrets and artifacts may persist
Resource waste: Runners stay running even when there are no jobs
Scaling difficulty: Cannot handle bursts of jobs

Ephemeral runners (with the --ephemeral flag) are disposable runners that automatically deregister after executing one job. They are recommended as best practice for both security and efficiency.

What are Azure Container Apps (ACA) Jobs?

Azure Container Apps Jobs are a mechanism for running container-based tasks.

Key Container Apps Concepts

Job Types

Type	Description	Use Case
Manual	Triggered manually via API or CLI	Batch processing, migration tasks
Scheduled	Cron-based scheduling	Periodic reports, cleanup
Event-driven	Auto-execution based on KEDA scaler	Self-hosted runners ← here

Event-driven jobs automatically create and delete job instances based on KEDA scaling rules, proportional to the number of events.

KEDA and the github-runner Scaler

What is KEDA?

KEDA (Kubernetes-based Event Driven Autoscaler) is an open-source component that scales containers based on external event sources (queues, topics, metrics, etc.). It is a CNCF project widely adopted across the industry, and Azure Container Apps uses KEDA internally.

How the github-runner Scaler Works

The KEDA github-runner scaler monitors the Actions workflow queue for a specific GitHub repository or organization, and scales ACA Job instances based on the number of pending jobs in the queue.

Scaling Logic

KEDA determines the number of instances based on the targetWorkflowQueueLength value:

desired replicas = ⌈ pending jobs / targetWorkflowQueueLength ⌉

For example, if there are 5 jobs in the queue and targetWorkflowQueueLength is 1, 5 runner instances will start.

Overall Architecture

Setup Guide

1. Create a GitHub App (Recommended)

Using a GitHub App is strongly recommended over Personal Access Tokens (PAT). GitHub Apps can generate Just-in-Time (JIT) tokens, making them more secure.

GitHub organization settings → Developer settings → GitHub Apps → New GitHub App
Configure the following permissions:
- Repository permissions
  - Actions: Read-only
- Organization permissions
  - Self-hosted runners: Read and write
Save the App ID and Private Key

2. Prepare Azure Resources

Container Apps Environment (VNet integrated)
resource "azurerm_container_app_environment" "runner_env" {
  name                           = "cae-github-runners"
  location                       = var.location
  resource_group_name            = var.resource_group_name
  infrastructure_subnet_id       = azurerm_subnet.aca.id
  internal_load_balancer_enabled = true

  tags = var.tags
}

3. Build the Runner Container Image

You can use the official actions/runner as a base image, though you'll often want to add custom tools.

Dockerfile
FROM ghcr.io/actions/actions-runner:latest

# Install custom tools (e.g., Azure CLI)
USER root
RUN apt-get update && apt-get install -y \
    azure-cli \
    && rm -rf /var/lib/apt/lists/*

USER runner

info

ghcr.io/actions/actions-runner is the official runner image provided by GitHub. Since it is regularly updated, avoid pinning tags too tightly and rebuild periodically via CI/CD.

4. ACA Job Terraform Definition

terraform/modules/aca-runner/main.tf
resource "azurerm_container_app_job" "github_runner" {
  name                          = "caj-github-runner"
  location                      = var.location
  resource_group_name           = var.resource_group_name
  container_apps_environment_id = var.aca_environment_id

  # Ephemeral runner: exits after 1 job
  replica_timeout_in_seconds = 1800  # 30-minute timeout
  replica_retry_limit        = 0     # No retries (failures managed by job)

  # KEDA github-runner scaler
  event_trigger_config {
    parallelism              = 1
    replica_completion_count = 1

    scale {
      min_executions              = 0  # Zero-scale when idle
      max_executions              = 10 # Max concurrent executions
      polling_interval_in_seconds = 30 # Polling interval

      rules {
        name = "github-runner-scaler"
        type = "github-runner"
        metadata = {
          owner                     = var.github_org
          runnerScope               = "org"  # "repo" or "org" or "enterprise"
          targetWorkflowQueueLength = "1"    # 1 runner per job
          labels                    = "self-hosted,linux,azure"
        }
        authentication {
          secret_name       = "github-app-auth"
          trigger_parameter = "personalAccessToken"
        }
      }
    }
  }

  identity {
    type         = "UserAssigned"
    identity_ids = [azurerm_user_assigned_identity.runner.id]
  }

  template {
    container {
      name   = "runner"
      image  = "${var.acr_login_server}/github-runner:latest"
      cpu    = 2.0
      memory = "4Gi"

      env {
        name  = "GITHUB_APP_ID"
        value = var.github_app_id
      }
      env {
        name        = "GITHUB_APP_PRIVATE_KEY"
        secret_name = "github-app-private-key"
      }
      env {
        name  = "GITHUB_ORGANIZATION"
        value = var.github_org
      }
      env {
        name  = "RUNNER_LABELS"
        value = "self-hosted,linux,azure"
      }
      env {
        name  = "EPHEMERAL"
        value = "true"  # Enable ephemeral mode
      }
    }
  }

  secret {
    name                = "github-app-private-key"
    identity            = azurerm_user_assigned_identity.runner.id
    key_vault_secret_id = azurerm_key_vault_secret.github_app_private_key.id
  }

  registry {
    server   = var.acr_login_server
    identity = azurerm_user_assigned_identity.runner.id
  }
}

5. Workflow Configuration

.github/workflows/ci.yml
name: CI

on:
  push:
    branches: [main]
  pull_request:

jobs:
  build:
    # Use self-hosted runner
    runs-on: [self-hosted, linux, azure]
    steps:
      - uses: actions/checkout@v4
      - name: Access private resources
        run: |
          # Access via private endpoint within VNet
          az login --identity
          az acr login --name myprivateregistry

Best Practices

Security

1. Always Use Ephemeral Runners

# Specify --ephemeral flag when starting the runner
./config.sh --url ... --token ... --ephemeral

Ephemeral runners automatically deregister after executing one job. This ensures:

No secret leakage between jobs
No environment contamination
Each job runs in a clean environment

2. Use GitHub Apps Instead of Personal Access Tokens

Approach	Security	Recommendation
Personal Access Token (PAT)	Tied to user, difficult expiration management	❌ Not recommended
Fine-grained PAT	Can restrict permissions but still user-tied	△ Acceptable
GitHub App + JIT Token	App-dedicated, least privilege, auto-expiring	✅ Recommended

3. Leverage Managed Identity (Workload Identity)

Authenticate Key Vault secret retrieval and ACR login with Managed Identity, avoiding long-lived secrets embedded in containers.

# Grant Managed Identity access to Key Vault
resource "azurerm_role_assignment" "runner_kv_secrets" {
  scope                = azurerm_key_vault.main.id
  role_definition_name = "Key Vault Secrets User"
  principal_id         = azurerm_user_assigned_identity.runner.principal_id
}

# Grant Managed Identity access to ACR
resource "azurerm_role_assignment" "runner_acr_pull" {
  scope                = azurerm_container_registry.main.id
  role_definition_name = "AcrPull"
  principal_id         = azurerm_user_assigned_identity.runner.principal_id
}

4. Restrict Execution with Runner Groups

Create Runner Groups at the organization level to limit which repositories can use self-hosted runners.

Runner Group settings in org settings
# Organization Settings > Actions > Runner Groups
# - Group name: azure-private-runners
# - Repository access: Allow selected repositories only
# - Allow public repositories: Off (important!)

warning

Using self-hosted runners with public repositories allows malicious PRs to execute code in the runner environment. For public repositories, restrict pull_request_target usage or use GitHub-hosted Runners instead.

Performance and Scalability

5. Configure Scaling Parameters Appropriately

scale {
  # Full zero-scale when idle (cost savings)
  min_executions = 0
  # Set based on peak CI/CD job count for your organization
  max_executions = 20
  # KEDA polling interval (be mindful of API rate limits if too low)
  polling_interval_in_seconds = 30
}

6. Size CPU and Memory for Your Workload

container {
  # For heavy builds (.NET / Java, etc.)
  cpu    = 4.0
  memory = "8Gi"

  # For simple script execution
  # cpu    = 0.5
  # memory = "1Gi"
}

7. Runner Timeout Configuration

resource "azurerm_container_app_job" "github_runner" {
  # Timeout (set longer than the max workflow execution time)
  replica_timeout_in_seconds = 3600  # 1 hour
}

Cost Optimization

8. Leverage Zero-scaling

Setting min_executions = 0 means no runners start when there are no jobs, bringing idle costs to zero.

9. Dependency Caching Strategy

In ephemeral environments where containers start fresh every time, caching strategy is critical for reducing build times.

.github/workflows/ci.yml
steps:
  # Use Actions Cache API with Azure Blob Storage as cache backend
  - uses: actions/cache@v4
    with:
      path: ~/.nuget/packages
      key: ${{ runner.os }}-nuget-${{ hashFiles('**/*.csproj') }}
      restore-keys: |
        ${{ runner.os }}-nuget-

Operations and Monitoring

10. Container Image Lifecycle Management

.github/workflows/update-runner-image.yml
name: Update Runner Image

on:
  schedule:
    # Update runner image every Monday
    - cron: '0 1 * * 1'
  workflow_dispatch:

jobs:
  build-and-push:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build and push runner image
        uses: docker/build-push-action@v5
        with:
          context: ./runner
          push: true
          tags: |
            ${{ secrets.ACR_LOGIN_SERVER }}/github-runner:latest
            ${{ secrets.ACR_LOGIN_SERVER }}/github-runner:${{ github.sha }}

11. Monitor Runners with Azure Monitor

Collect and configure alerts for the following metrics and logs in Azure Monitor:

Job execution count: Sudden spikes or drops in jobs per hour
Execution time: Alerts for jobs approaching replica_timeout_in_seconds
Failure rate: Early detection of increased job failures

Azure Monitor alert example
resource "azurerm_monitor_metric_alert" "runner_job_timeout" {
  name                = "github-runner-job-timeout"
  resource_group_name = var.resource_group_name
  scopes              = [azurerm_container_app_job.github_runner.id]
  description         = "Detect GitHub Runner job timeouts"

  criteria {
    metric_namespace = "Microsoft.App/jobs"
    metric_name      = "FailedCount"
    aggregation      = "Total"
    operator         = "GreaterThan"
    threshold        = 3
  }

  action {
    action_group_id = var.alert_action_group_id
  }
}

12. Separate Runners by Purpose Using Labels

When using runners for multiple purposes, separate them with labels.

# Standard runner
env {
  name  = "RUNNER_LABELS"
  value = "self-hosted,linux,azure"
}

# High-spec runner (for ML / large builds)
env {
  name  = "RUNNER_LABELS"
  value = "self-hosted,linux,azure,high-memory"
}

Specifying in workflow
jobs:
  ml-training:
    runs-on: [self-hosted, linux, azure, high-memory]

Troubleshooting

Runner Fails to Start

Check KEDA scaler logs: Review system logs in the ACA Environment
GitHub API rate limits: PAT allows 50 req/h; GitHub App allows 15,000 req/h
JIT token expiration: Just-in-Time tokens expire quickly, so the registration process must be fast

Job Not Being Assigned

Runner label mismatch: Verify runs-on in the workflow exactly matches the labels set on the runner
Runner group repository access: Confirm the runner group allows access to the target repository
KEDA polling delay: The default polling interval (30s) causes a delay before new jobs become visible

Network Connection Errors

Subnet NSG rules: Allow outbound to *.github.com (443) and *.githubusercontent.com (443)
VNet integration check: Confirm the ACA Environment is correctly integrated with your VNet

Summary

The combination of ACA Job + KEDA github-runner scaler provides an excellent self-hosted runner platform with the following strengths:

Zero idle cost: Starts only when jobs arrive; zero cost when idle
Fully ephemeral execution: Each job runs in a clean environment, minimizing security risk
VNet integration: Seamless access to private Azure resources
Managed Identity: Passwordless access to Azure resources
Terraform-managed: Declarative, infrastructure-as-code management

These characteristics solve the challenge of "GitHub-hosted Runners cannot meet our requirements, but always-on self-hosted runners are too costly to manage."

What are Self-hosted Runners?​

Use Cases for Self-hosted Runners​

1. Access to Private Networks Required​

2. Specific Hardware or Specs Required​

3. Security and Compliance Requirements​

4. Custom Environment Requirements​

5. Cost Optimization​

What are Ephemeral Runners?​

What are Azure Container Apps (ACA) Jobs?​

Key Container Apps Concepts​

Job Types​

KEDA and the github-runner Scaler​

What is KEDA?​

How the github-runner Scaler Works​

Scaling Logic​

Overall Architecture​

Setup Guide​

1. Create a GitHub App (Recommended)​

2. Prepare Azure Resources​

3. Build the Runner Container Image​

4. ACA Job Terraform Definition​

5. Workflow Configuration​

Best Practices​

Security​

1. Always Use Ephemeral Runners​

2. Use GitHub Apps Instead of Personal Access Tokens​

3. Leverage Managed Identity (Workload Identity)​

4. Restrict Execution with Runner Groups​

Performance and Scalability​

5. Configure Scaling Parameters Appropriately​

6. Size CPU and Memory for Your Workload​

7. Runner Timeout Configuration​

Cost Optimization​

8. Leverage Zero-scaling​

9. Dependency Caching Strategy​

Operations and Monitoring​

10. Container Image Lifecycle Management​

11. Monitor Runners with Azure Monitor​

12. Separate Runners by Purpose Using Labels​

Troubleshooting​

Runner Fails to Start​

Job Not Being Assigned​

Network Connection Errors​

Summary​