6 min read

Deploying Self-Hosted Runners on Almost Any Platform

Written by: Ethan Holz

A few months ago, we launched the gha-runner project to help our developers use AWS for GPU testing as a part of our U.S. National Science Foundation POSE Grant. This library and tool enabled us to start using AWS for new compute tasks in CI. However, we have to ask: Is AWS really the best platform for this application?

The short answer: No.

AWS does not have competitive GPU pricing when you compare against platforms like RunPod or Nebius. These platforms provide much cheaper and faster GPU options for your money. This is where the power of gha-runner library comes in.

Our implementation abstracts out all the communication with GitHub and allows you to spin up compute on any provider that allows you to provide an initialization script such as cloud-init. Furthermore, we added an llms.txt file which you can provide to an LLM to help improve context about the project.

In this post, we will dive into how you can build out your own provider and some wacky implementations we have tested out using this library.

Understanding the architecture

Every gha-runner should look like this:

jobs:
  start-runner:
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
    outputs:
      mapping: ${{ steps.<start-step>.outputs.mapping }}
      instances: ${{ steps.<start-step>.outputs.instances }}
    steps:
        # A start action to create an instance
  test:
    runs-on: ${{ fromJSON(needs.start-runner.outputs.instances) }}
    defaults:
      run:
        shell: bash -leo pipefail {0}
    needs:
      - start-runner
    steps:
        # Whatever steps you want to run
  stop-aws-runner:
    runs-on: ubuntu-latest
    needs:
      - start-runner
      - test
    if: ${{ always() }}
    permissions:
      id-token: write
      contents: read
    steps:
        # The step to stop your instance

At a high level, the goal here is that we use GitHub's hosted runners to spin up an instance on a provider, we then pass the created instances to a workflow.
We then ensure that everything is torn down. Let's start with the Start step:

class CreateCloudInstance(ABC):
    """Abstract base class for starting a cloud instance.

    This class defines the interface for starting a cloud instance.

    """

    @abstractmethod
    def create_instances(self) -> dict[str, str]:
        """Create instances in the cloud provider and return their IDs.

        The number of instances to create is defined by the implementation.

        Returns
        -------
        dict[str, str]
            A dictionary of instance IDs and their corresponding github runner labels.

        """
        raise NotImplementedError

    @abstractmethod
    def wait_until_ready(self, ids: list[str], **kwargs):
        """Wait until instances are in a ready state.

        Parameters
        ----------
        ids : list[str]
            A list of instance IDs to wait for.
        **kwargs : dict, optional
            Additional arguments to pass to the waiter.

        """
        raise NotImplementedError

    @abstractmethod
    def set_instance_mapping(self, mapping: dict[str, str]):
        """Set the instance mapping in the environment.

        Parameters
        ----------
        mapping : dict[str, str]
            A dictionary of instance IDs and their corresponding github runner labels.

        """
        raise NotImplementedError

This class is used to handle the creation of runner instances.

It is up to the developer to decide how this looks, but you must inherit from this class. An important piece to note is the set_instance_mapping function; this function should export data back out to the GitHub environment so that we can explicitly specify the instances we created in other steps.

To stop an instance we use the following class. This handles the removal of instances and the fetching of instances from the environment. Again, this is up to you to implement!

class StopCloudInstance(ABC):
    """Abstract base class for stopping a cloud instance.

    This class defines the interface for stopping a cloud instance.

    """

    @abstractmethod
    def remove_instances(self, ids: list[str]):
        """Remove instances from the cloud provider.

        Parameters
        ----------
        ids : list[str]
            A list of instance IDs to remove.

        """
        raise NotImplementedError

    @abstractmethod
    def wait_until_removed(self, ids: list[str], **kwargs):
        """Wait until instances are removed.

        Parameters
        ----------
        ids : list[str]
            A list of instance IDs to wait for.
        **kwargs : dict, optional
            Additional arguments to pass to the waiter.

        """
        raise NotImplementedError

    @abstractmethod
    def get_instance_mapping(self) -> dict[str, str]:
        """Get the instance mapping from the environment.

        Returns
        -------
        dict[str, str]
            A dictionary of instance IDs and their corresponding github runner labels.

        """
        raise NotImplementedError

These two classes culminate into us using providing a simple interface to deploy a new runner. This works by taking your custom subclasses and spinning up an instance of that class with some required boilerplate for interacting with GitHub.

@dataclass
class DeployInstance:
    """Class that is used to deploy instances and runners.

    Parameters
    ----------
    provider_type : Type[CreateCloudInstance]
        The type of cloud provider to use.
    cloud_params : dict
        The parameters to pass to the cloud provider.
    gh : GitHubInstance
        The GitHub instance to use.
    count : int
        The number of instances to create.
    timeout : int
        The timeout to use when waiting for the runner to come online


    Attributes
    ----------
    provider : CreateCloudInstance
        The cloud provider instance
    provider_type : Type[CreateCloudInstance]
    cloud_params : dict
    gh : GitHubInstance
    count : int
    timeout : int

    """

    provider_type: Type[CreateCloudInstance]
    cloud_params: dict
    gh: GitHubInstance
    count: int
    timeout: int
    provider: CreateCloudInstance = field(init=False)

    def __post_init__(self):
        """Initialize the cloud provider.

        This function is called after the object is created to correctly
        init the provider.

        """
        # We need to create runner tokens for use by the provider
        runner_tokens = self.gh.create_runner_tokens(self.count)
        self.cloud_params["gh_runner_tokens"] = runner_tokens
        architecture = self.cloud_params.get("arch", "x64")
        release = self.gh.get_latest_runner_release(
            platform="linux", architecture=architecture
        )
        self.cloud_params["runner_release"] = release
        self.provider = self.provider_type(**self.cloud_params)

    def start_runner_instances(self):
        """Start the runner instances.

        This function starts the runner instances and waits for them to be ready.

        """
        print("Starting up...")
        # Create a GitHub instance
        print("Creating GitHub Actions Runner")

        mappings = self.provider.create_instances()
        instance_ids = list(mappings.keys())
        github_labels = list(mappings.values())
        # Output the instance mapping and labels so the stop action can use them
        self.provider.set_instance_mapping(mappings)
        # Wait for the instance to be ready
        print("Waiting for instance to be ready...")
        self.provider.wait_until_ready(instance_ids)
        print("Instance is ready!")
        # Confirm the runner is registered with GitHub
        for label in github_labels:
            print(f"Waiting for {label}...")
            self.gh.wait_for_runner(label, self.timeout)

A similar class exists for the stop piece as well. This enables us to ensure that the process is repeatable and for you to worry about the actual interaction with your provider of choice. If you are like us, you might be thinking of all the ways you can hack on this to work with where you have compute resources. SSH using Paramiko, Docker daemon connections, and integrations with services like Nebius immediately come to mind.

Leveraging an LLM

As we mentioned before, our docs have an LLM-friendly version of the docs: https://gha-runner.readthedocs.io/en/latest/llms-full.txt, as well as a preliminary parsed version in the Context7 MCP: https://context7.com/llmstxt/gha-runner_readthedocs_io-en-latest-llms.txt.

These tools allow for you to rapidly prototype with providers with well-defined APIs. Simply provide the llms-full.txt to your LLM of choice as context and ask what you want to build. For example: Write me an Azure gha-runner using the docs provided.

I will note, this is not always accurate and documentation is king. If you find that things are not accurately being represented in our docs let us know or open a PR!

Using LLM native docs, we were able to quickly prototype the skeleton for providers in both Azure and RunPod. While we have not released these, we were able to quickly see if it was possible to do add these as providers and what it may take to maintain these processes.

Caveats

As you may have noticed in this implementation, there is actually quite a bit missing here to actually communicate with engines that can provision infrastructure. We make the assumption that those prior steps are created before starting or stopping a runner. This is not to say you can't add this for your own platform, but in most cases, an action exists for authenticating to your service already.

We also assume that your class has both a gh_runner_tokens and a runner_release field. These both are provided by the gha_runner library to your class with the tokens provisioned to register your runners and the most recent actions/runner code to run on your machine. There are cases where you may not need the second (for example if you are using a managed docker or the runner binary is preinstalled), but you must be able to at least accept the parameter. Generally, you would then have to add some sort of setup script to install and start the runner binary.

Pi for dessert


While I spent a lot of time developing this tool for scientists, I wanted to see how far I could go. I write software in my free time that I self-hosted and run on a set of Raspberry Pi's in my house. As such, I have wanted to start running CI on those Pi's directly to validate my software on the hardware it will run on. This led me to develop a version of this tool that interacts with the Docker socket directly, allowing me to directly provision containers.

This project flips a few of the expectations of why we built gha-runner on its head, but it helped to highlight the flexibility of this tooling. I did not need to worry about handling GitHub interactions, I could just focus on how I am going to get a runner in my server rack. Even further, I can build the solution that works for me.

Because gha-runner is published on PyPI and is a pure Python package, I was able to leverage my build tool of choice, uv. I am even able to leverage tools like uv2nix to use Nix for this package. I mention this to highlight one of the original goals of gha-runner: choice.

As a developer, it is up to you to pick what tools you want to leverage for provisioning your infrastructure. Maybe you love conda, use it! Do you want to run your CI on Raspberry Pi's? Go for it! We wanted to make a library that was minimal and extensible so that our community can build what they need for the future of compute.

What can you build?

So with all of this in mind, what can you build? Do you have compute you want to use to run your GitHub Actions on? Let us know on Bluesky or LinkedIn how you this tool or how we can help you better use gha-runner!

Happy hacking!