From a Single Repo, to Multi-Repos, to Monorepo, to Multi-Monorepo

Avatar of Leonardo Losoviz
Leonardo Losoviz on

I’ve been working on the same project for several years. Its initial version was a huge monolithic app containing thousands of files. It was poorly architected and non-reusable, but was hosted in a single repo making it easy to work with. Later, I “fixed” the mess in the project by splitting the codebase into autonomous packages, hosting each of them on its own repo, and managing them with Composer. The codebase became properly architected and reusable, but being split across multiple repos made it a lot more difficult to work with.

As the code was reformatted time and again, its hosting in the repo also had to adapt, going from the initial single repo, to multiple repos, to a monorepo, to what may be called a “multi-monorepo.”

Let me take you on the journey of how this took place, explaining why and when I felt I had to switch to a new approach. The journey consists of four stages (so far!) so let’s break it down like that.

Stage 1: Single repo

The project is leoloso/PoP and it’s been through several hosting schemes, following how its code was re-architected at different times.

It was born as this WordPress site, comprising a theme and several plugins. All of the code was hosted together in the same repo.

Some time later, I needed another site with similar features so I went the quick and easy way: I duplicated the theme and added its own custom plugins, all in the same repo. I got the new site running in no time.

I did the same for another site, and then another one, and another one. Eventually the repo was hosting some 10 sites, comprising thousands of files.

A single repository hosting all our code.

Issues with the single repo

While this setup made it easy to spin up new sites, it didn’t scale well at all. The big thing is that a single change involved searching for the same string across all 10 sites. That was completely unmanageable. Let’s just say that copy/paste/search/replace became a routine thing for me.

So it was time to start coding PHP the right way.

Stage 2: Multirepo

Fast forward a couple of years. I completely split the application into PHP packages, managed via Composer and dependency injection.

Composer uses Packagist as its main PHP package repository. In order to publish a package, Packagist requires a composer.json file placed at the root of the package’s repo. That means we are unable to have multiple PHP packages, each of them with its own composer.json hosted on the same repo.

As a consequence, I had to switch from hosting all of the code in the single leoloso/PoP repo, to using multiple repos, with one repo per PHP package. To help manage them, I created the organization “PoP” in GitHub and hosted all repos there, including getpop/root, getpop/component-model, getpop/engine, and many others.

In the multirepo, each package is hosted on its own repo.

Issues with the multirepo

Handling a multirepo can be easy when you have a handful of PHP packages. But in my case, the codebase comprised over 200 PHP packages. Managing them was no fun.

The reason that the project was split into so many packages is because I also decoupled the code from WordPress (so that these could also be used with other CMSs), for which every package must be very granular, dealing with a single goal.

Now, 200 packages is not ordinary. But even if a project comprises only 10 packages, it can be difficult to manage across 10 repositories. That’s because every package must be versioned, and every version of a package depends on some version of another package. When creating pull requests, we need to configure the composer.json file on every package to use the corresponding development branch of its dependencies. It’s cumbersome and bureaucratic.

I ended up not using feature branches at all, at least in my case, and simply pointed every package to the dev-master version of its dependencies (i.e. I was not versioning packages). I wouldn’t be surprised to learn that this is a common practice more often than not.

There are tools to help manage multiple repos, like meta. It creates a project composed of multiple repos and doing git commit -m "some message" on the project executes a git commit -m "some message" command on every repo, allowing them to be in sync with each other.

However, meta will not help manage the versioning of each dependency on their composer.json file. Even though it helps alleviate the pain, it is not a definitive solution.

So, it was time to bring all packages to the same repo.

Stage 3: Monorepo

The monorepo is a single repo that hosts the code for multiple projects. Since it hosts different packages together, we can version control them together too. This way, all packages can be published with the same version, and linked across dependencies. This makes pull requests very simple.

The monorepo hosts multiple packages.

As I mentioned earlier, we are not able to publish PHP packages to Packagist if they are hosted on the same repo. But we can overcome this constraint by decoupling development and distribution of the code: we use the monorepo to host and edit the source code, and multiple repos (at one repo per package) to publish them to Packagist for distribution and consumption.

The monorepo hosts the source code, multiple repos distribute it.

Switching to the Monorepo

Switching to the monorepo approach involved the following steps:

First, I created the folder structure in leoloso/PoP to host the multiple projects. I decided to use a two-level hierarchy, first under layers/ to indicate the broader project, and then under packages/, plugins/, clients/ and whatnot to indicate the category.

Showing the HitHub repo for a project called PoP. The screen in is dark mode, so the background is near black and the text is off-white, except for blue links.
The monorepo layers indicate the broader project.

Then, I copied all source code from all repos (getpop/engine, getpop/component-model, etc.) to the corresponding location for that package in the monorepo (i.e. layers/Engine/packages/engine, layers/Engine/packages/component-model, etc).

I didn’t need to keep the Git history of the packages, so I just copied the files with Finder. Otherwise, we can use hraban/tomono or shopsys/monorepo-tools to port repos into the monorepo, while preserving their Git history and commit hashes.

Next, I updated the description of all downstream repos, to start with [READ ONLY], such as this one.

Showing the GitHub repo for the component-model project. The screen is in dark mode, so the background is near black and the text is off-white, except for blue links. There is a sidebar to the right of the screen that is next to the list of files in the repo. The sidebar has an About heading with a description that reads: Read only, component model for Pop, over which the component-based architecture is based." This is highlighted in red.
The downstream repo’s “READ ONLY” is located in the repo description.

I executed this task in bulk via GitHub’s GraphQL API. I first obtained all of the descriptions from all of the repos, with this query:

{
  repositoryOwner(login: "getpop") {
    repositories(first: 100) {
      nodes {
        id
        name
        description
      }
    }
  }
}

…which returned a list like this:

{
  "data": {
    "repositoryOwner": {
      "repositories": {
        "nodes": [
          {
            "id": "MDEwOlJlcG9zaXRvcnkxODQ2OTYyODc=",
            "name": "hooks",
            "description": "Contracts to implement hooks (filters and actions) for PoP"
          },
          {
            "id": "MDEwOlJlcG9zaXRvcnkxODU1NTQ4MDE=",
            "name": "root",
            "description": "Declaration of dependencies shared by all PoP components"
          },
          {
            "id": "MDEwOlJlcG9zaXRvcnkxODYyMjczNTk=",
            "name": "engine",
            "description": "Engine for PoP"
          }
        ]
      }
    }
  }
}

From there, I copied all descriptions, added [READ ONLY] to them, and for every repo generated a new query executing the updateRepository GraphQL mutation:

mutation {
  updateRepository(
    input: {
      repositoryId: "MDEwOlJlcG9zaXRvcnkxODYyMjczNTk="
      description: "[READ ONLY] Engine for PoP"
    }
  ) {
    repository {
      description
    }
  }
}

Finally, I introduced tooling to help “split the monorepo.” Using a monorepo relies on synchronizing the code between the upstream monorepo and the downstream repos, triggered whenever a pull request is merged. This action is called “splitting the monorepo.” Splitting the monorepo can be achieved with a git subtree split command but, because I’m lazy, I’d rather use a tool.

I chose Monorepo builder, which is written in PHP. I like this tool because I can customize it with my own functionality. Other popular tools are the Git Subtree Splitter (written in Go) and Git Subsplit (bash script).

What I like about the Monorepo

I feel at home with the monorepo. The speed of development has improved because dealing with 200 packages feels pretty much like dealing with just one. The boost is most evident when refactoring the codebase, i.e. when executing updates across many packages.

The monorepo also allows me to release multiple WordPress plugins at once. All I need to do is provide a configuration to GitHub Actions via PHP code (when using the Monorepo builder) instead of hard-coding it in YAML.

To generate a WordPress plugin for distribution, I had created a generate_plugins.yml workflow that triggers when creating a release. With the monorepo, I have adapted it to generate not just one, but multiple plugins, configured via PHP through a custom command in plugin-config-entries-json, and invoked like this in GitHub Actions:

- id: output_data
  run: |
    echo "quot;::set-output name=plugin_config_entries::$(vendor/bin/monorepo-builder plugin-config-entries-json)"

This way, I can generate my GraphQL API plugin and other plugins hosted in the monorepo all at once. The configuration defined via PHP is this one.

class PluginDataSource
{
  public function getPluginConfigEntries(): array
  {
    return [
      // GraphQL API for WordPress
      [
        'path' => 'layers/GraphQLAPIForWP/plugins/graphql-api-for-wp',
        'zip_file' => 'graphql-api.zip',
        'main_file' => 'graphql-api.php',
        'dist_repo_organization' => 'GraphQLAPI',
        'dist_repo_name' => 'graphql-api-for-wp-dist',
      ],
      // GraphQL API - Extension Demo
      [
        'path' => 'layers/GraphQLAPIForWP/plugins/extension-demo',
        'zip_file' => 'graphql-api-extension-demo.zip',
        'main_file' =>; 'graphql-api-extension-demo.php',
        'dist_repo_organization' => 'GraphQLAPI',
        'dist_repo_name' => 'extension-demo-dist',
      ],
    ];
  }
}

When creating a release, the plugins are generated via GitHub Actions.

Dark mode screen in GitHub showing the actions for the project.
This figure shows plugins generated when a release is created.

If, in the future, I add the code for yet another plugin to the repo, it will also be generated without any trouble. Investing some time and energy producing this setup now will definitely save plenty of time and energy in the future.

Issues with the Monorepo

I believe the monorepo is particularly useful when all packages are coded in the same programming language, tightly coupled, and relying on the same tooling. If instead we have multiple projects based on different programming languages (such as JavaScript and PHP), composed of unrelated parts (such as the main website code and a subdomain that handles newsletter subscriptions), or tooling (such as PHPUnit and Jest), then I don’t believe the monorepo provides much of an advantage.

That said, there are downsides to the monorepo:

  • We must use the same license for all of the code hosted in the monorepo; otherwise, we’re unable to add a LICENSE.md file at the root of the monorepo and have GitHub pick it up automatically. Indeed, leoloso/PoP initially provided several libraries using MIT and the plugin using GPLv2. So, I decided to simplify it using the lowest common denominator between them, which is GPLv2.
  • There is a lot of code, a lot of documentation, and plenty of issues, all from different projects. As such, potential contributors that were attracted to a specific project can easily get confused.
  • When tagging the code, all packages are versioned independently with that tag whether their particular code was updated or not. This is an issue with the Monorepo builder and not necessarily with the monorepo approach (Symfony has solved this problem for its monorepo).
  • The issues board needs proper management. In particular, it requires labels to assign issues to the corresponding project, or risk it becoming chaotic.
Showing the list of reported issues for the project in GitHub in dark mode. The image shows just how crowded and messy the screen looks when there are a bunch of issues from different projects in the same list without a way to differentiate them.
The issues board can become chaotic without labels that are associated with projects.

All these issues are not roadblocks though. I can cope with them. However, there is an issue that the monorepo cannot help me with: hosting both public and private code together.

I’m planning to create a “PRO” version of my plugin which I plan to host in a private repo. However, the code in the repo is either public or private, so I’m unable to host my private code in the public leoloso/PoP repo. At the same time, I want to keep using my setup for the private repo too, particularly the generate_plugins.yml workflow (which already scopes the plugin and downgrades its code from PHP 8.0 to 7.1) and its possibility to configure it via PHP. And I want to keep it DRY, avoiding copy/pastes.

It was time to switch to the multi-monorepo.

Stage 4: Multi-monorepo

The multi-monorepo approach consists of different monorepos sharing their files with each other, linked via Git submodules. At its most basic, a multi-monorepo comprises two monorepos: an autonomous upstream monorepo, and a downstream monorepo that embeds the upstream repo as a Git submodule that’s able to access its files:

A giant red folder illustration is labeled as the downstream monorepo and it contains a smaller green folder showing the upstream monorepo.
The upstream monorepo is contained within the downstream monorepo.

This approach satisfies my requirements by:

  • having the public repo leoloso/PoP be the upstream monorepo, and
  • creating a private repo leoloso/GraphQLAPI-PRO that serves as the downstream monorepo.
The same illustration as before, but now the large folder is a bright pink and is labeled as with the project name, and the smaller folder is a purplish-blue and labeled with the name of the public downstream module,.
A private monorepo can access the files from a public monorepo.

leoloso/GraphQLAPI-PRO embeds leoloso/PoP under subfolder submodules/PoP (notice how GitHub links to the specific commit of the embedded repo):

This figure show how the public monorepo is embedded within the private monorepo in the GitHub project.

Now, leoloso/GraphQLAPI-PRO can access all the files from leoloso/PoP. For instance, script ci/downgrade/downgrade_code.sh from leoloso/PoP (which downgrades the code from PHP 8.0 to 7.1) can be accessed under submodules/PoP/ci/downgrade/downgrade_code.sh.

In addition, the downstream repo can load the PHP code from the upstream repo and even extend it. This way, the configuration to generate the public WordPress plugins can be overridden to produce the PRO plugin versions instead:

class PluginDataSource extends UpstreamPluginDataSource
{
  public function getPluginConfigEntries(): array
  {
    return [
      // GraphQL API PRO
      [
        'path' => 'layers/GraphQLAPIForWP/plugins/graphql-api-pro',
        'zip_file' => 'graphql-api-pro.zip',
        'main_file' => 'graphql-api-pro.php',
        'dist_repo_organization' => 'GraphQLAPI-PRO',
        'dist_repo_name' => 'graphql-api-pro-dist',
      ],
      // GraphQL API Extensions
      // Google Translate
      [
        'path' => 'layers/GraphQLAPIForWP/plugins/google-translate',
        'zip_file' => 'graphql-api-google-translate.zip',
        'main_file' => 'graphql-api-google-translate.php',
        'dist_repo_organization' => 'GraphQLAPI-PRO',
        'dist_repo_name' => 'graphql-api-google-translate-dist',
      ],
      // Events Manager
      [
        'path' => 'layers/GraphQLAPIForWP/plugins/events-manager',
        'zip_file' => 'graphql-api-events-manager.zip',
        'main_file' => 'graphql-api-events-manager.php',
        'dist_repo_organization' => 'GraphQLAPI-PRO',
        'dist_repo_name' => 'graphql-api-events-manager-dist',
      ],
    ];
  }
}

GitHub Actions will only load workflows from under .github/workflows, and the upstream workflows are under submodules/PoP/.github/workflows; hence we need to copy them. This is not ideal, though we can avoid editing the copied workflows and treat the upstream files as the single source of truth.

To copy the workflows over, a simple Composer script can do:

{
  "scripts": {
    "copy-workflows": [
      "php -r \"copy('submodules/PoP/.github/workflows/generate_plugins.yml', '.github/workflows/generate_plugins.yml');\"",
      "php -r \"copy('submodules/PoP/.github/workflows/split_monorepo.yaml', '.github/workflows/split_monorepo.yaml');\""
    ]
  }
}

Then, each time I edit the workflows in the upstream monorepo, I also copy them to the downstream monorepo by executing the following command:

composer copy-workflows

Once this setup is in place, the private repo generates its own plugins by reusing the workflow from the public repo:

This figure shows the PRO plugins generated in GitHub Actions.

I am extremely satisfied with this approach. I feel it has removed all of the burden from my shoulders concerning the way projects are managed. I read about a WordPress plugin author complaining that managing the releases of his 10+ plugins was taking a considerable amount of time. That doesn’t happen here—after I merge my pull request, both public and private plugins are generated automatically, like magic.

Issues with the multi-monorepo

First off, it leaks. Ideally, leoloso/PoP should be completely autonomous and unaware that it is used as an upstream monorepo in a grander scheme—but that’s not the case.

When doing git checkout, the downstream monorepo must pass the --recurse-submodules option as to also checkout the submodules. In the GitHub Actions workflows for the private repo, the checkout must be done like this:

- uses: actions/checkout@v2
  with:
    submodules: recursive

As a result, we have to input submodules: recursive to the downstream workflow, but not to the upstream one even though they both use the same source file.

To solve this while maintaining the public monorepo as the single source of truth, the workflows in leoloso/PoP are injected the value for submodules via an environment variable CHECKOUT_SUBMODULES, like this:

env:
  CHECKOUT_SUBMODULES: "";

jobs:
  provide_data:
    steps:
      - uses: actions/checkout@v2
        with:
          submodules: ${{ env.CHECKOUT_SUBMODULES }}

The environment value is empty for the upstream monorepo, so doing submodules: "" works well. And then, when copying over the workflows from upstream to downstream, I replace the value of the environment variable to "recursive" so that it becomes:

env:
  CHECKOUT_SUBMODULES: "recursive"

(I have a PHP command to do the replacement, but we could also pipe sed in the copy-workflows composer script.)

This leakage reveals another issue with this setup: I must review all contributions to the public repo before they are merged, or they could break something downstream. The contributors would also completely unaware of those leakages (and they couldn’t be blamed for it). This situation is specific to the public/private-monorepo setup, where I am the only person who is aware of the full setup. While I share access to the public repo, I am the only one accessing the private one.

As an example of how things could go wrong, a contributor to leoloso/PoP might remove CHECKOUT_SUBMODULES: "" since it is superfluous. What the contributor doesn’t know is that, while that line is not needed, removing it will break the private repo.

I guess I need to add a warning!

env:
  ### ☠️ Do not delete this line! Or bad things will happen! ☠️
  CHECKOUT_SUBMODULES: ""

Wrapping up

My repo has gone through quite a journey, being adapted to the new requirements of my code and application at different stages:

  • It started as a single repo, hosting a monolithic app.
  • It became a multirepo when splitting the app into packages.
  • It was switched to a monorepo to better manage all the packages.
  • It was upgraded to a multi-monorepo to share files with a private monorepo.

Context means everything, so there is no “best” approach here—only solutions that are more or less suitable to different scenarios.

Has my repo reached the end of its journey? Who knows? The multi-monorepo satisfies my current requirements, but it hosts all private plugins together. If I ever need to grant contractors access to a specific private plugin, while preventing them to access other code, then the monorepo may no longer be the ideal solution for me, and I’ll need to iterate again.

I hope you have enjoyed the journey. And, if you have any ideas or examples from your own experiences, I’d love to hear about them in the comments.