4 min read

Why We Built Our Own Self-hosted GitHub Actions Runner Tool

Written by: Ethan Holz


In our previous post about the gha-runner project, we talked about how the OMSF built a tool to help provision ephemeral GitHub Actions runners based on the ec2-github-runner project. We mentioned how this project was fantastic for setup, but why did we choose to build our own tool?

As an objective of our grant, we wish to strengthen the molecular software ecosystem by providing scientists with the tools to run testing and validation on more hardware options.

For this we had a few different criteria:

    • Needs to be easily maintainable in existing teams
    • Must be cloud-agnostic to reduce vendor lock-in
    • Must have the ability to "scale-to-zero" when not in use.

What we found was that most open source solutions were great for small deployments but would struggle at a moderate scale or that the solutions were too big for the scale of our projects. Furthermore, maintenance of many of these projects was inconsistent. The ec2-github-runner was already in use by members of ecosystem and we wanted something that mirrored the simplicity, however it had a few notable problems when it came to long-term support:

    • The runner has many open PRs for new features that would be beneficial for cost management
    • The runner lacks a way to grab the latest GitHub Actions runner code
    • The runner only works on AWS
    • Python is pretty common in the ecosystem and wanted a solution that encourages contribution
    • Little unit or integration testing

We loved how this tool worked and its ease of setup for projects, but we wanted more.

With this in mind, we set out to design and architect a tool inspired by this project, one that enabled OMSF projects to be ready for other cloud providers in the future.

Designing for community

Maintainable research software requires a few things to last: testing, documentation, ease of contribution, and great research. Testing and documentation are the two largest pitfalls in building a scalable project when it comes to cloud infrastructure.

    • Tests provide developers and users the confidence that a given piece of software will run as expected.
    • Documentation allows for users to learn and understand the features of your software while also providing developers with an avenue to contribute new features.

These first two points are easy to implement for many projects and help to ensure that your project can outlast the initial scope of it, but ease of contribution takes the initial ideas and extends it to your community. Your users can use your docs to understand functionality and your tests to validate usability.

Contributors may use documentation to understand design choices and your tests to help ensure changes don't break existing experiences. Ease of contributions asks the hard questions - what language do I use? How do I architect my project to be extensible yet simple?

For the OMSF community, the first question was straightforward, Python.
If we were going to build a solution written for the developers on our teams, we needed to build a solution that used a language and environment they were familiar with. While there are many advantages to writing GitHub Actions in JavaScript (especially around performance), we opted to use Python so that our community could more easily get involved in development. We have already seen people using this tool and contributing to its development from outside of the OMSF community!

Not only should the code be easy to work with, but also easy to extend. Initially, we wrote a solution that would require every implementation to exist in one repo. This make it really hard for developers to write their own implementation and also made it harder for us to maintain. One of the major pitfalls mentioned earlier was that we absolutely needed to support more than just AWS, so we sought to leverage some pretty common object-oriented and functional design patterns to ensure we could extend this action.

Software Architecture

The software architecture of our action consists of a abstract base classes responsible for providing a common superclass of functions that all compute providers have in common. We evaluated these commonalities to be:

  • Creating multiple instances
  • Removing multiple instances
  • Wait until an instance is online
  • Wait until an instance is removed

Notice how these descriptions avoid specific compute-provider names when referencing infrastructure. This is intentional: The goal is to provide a common basis between all providers.

To then extend this base class and use it for a specific provider, you just have to specify this class as a parent class and implement the methods for your specific provider.

With this in mind, we create a collection of projects to solve this problem.
The first, gha-runner, is a library that handles the complexity of working with GitHub to register compute with your repo and a set of abstract base classes that map to the commonalities from above.

We also added some utilities that make things like input parsing and output rendering easy in a GitHub Actions environment. Instead of fighting with parsing that was complicated and often buggy, we developed new deterministic input parsing using functional programming principles. This model enabled us to have an intuitive and repeatable way of testing potential input parameters.

Next, we abstracted the initial repo to become a library of abstract base classes that enabled us to have a repeatable flow of events but without any direct cloud interaction. This enabled us to then split up starting and stopping of runners into their own actions, making for a cleaner and more verbose usage experience.
In the long term, this ensures 100% test coverage of the underlying process while enabling developers to use whatever they want to integrate their own compute provider.

For AWS, the most relevant cloud platform for our teams right now, we implemented the start-aws-gha-runner and stop-aws-gha-runner actions, which provide a reference concrete implementation of our abstract classes, and meet the most urgent current needs of our teams.

Providing software projects with choice

This project aims to ensure that developers and maintainers have the flexibility to "bring your own infrastructure" while enabling a community to contribute new additions with well documented code and accessible testing. This means that we utilize best practices to enable developers to use whatever Python tools they want in the future to configure infrastructure. In the case of the OMSF, we want to ensure that we can deploy tests on any hardware architecture that our users want testing in. As of today, this tool is being used across the OMSF, from GPU testing to job submission, with the flexibility to enable integration with other compute needs in the future.