A hack to load test data in Github Actions

19 Aug 2025

We run our integration test suite in Github Actions that requires some test data to be present in a database.

The test data was loaded into the Postgres container from a dump stored as a .sql file in our repository.

db:
    image: postgres:14
    command: ["postgres", "-c", "log_statement=all"]
    volumes:
      - postgres_test_data:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
    ports:
      - 5432:5432

This was done to save the time taken to generate test data on every run (so changes are pushed faster and costs donโ€™t shoot up).

While this setup worked well for a long time, there were a couple downsides to it-

  1. Every time the database schema changed, developers had to manually generate the test data locally, dump the data and commit the file.
  2. The dump file itself was quite large. Since the test data generation script had random components, it changed considerably every time (for example, a UUID column). This bloated PR diffs even when the actual code change was small. Large PR Diff

Now, the ideal solution would look something like this -

  1. A change in database schema is detected
  2. The test data generation script runs in Github Actions (ideally not on every run).
  3. This dump is stored somewhere and loaded on the next run.

The hack

To store the dump for a change, we can use the Github Actions cache instead of the repository. Each schema change is done by Django via a migration, which it stores as files in migrations/ directories. We calculate the hash of the all such files and use it as the cache key.

- name: Generate migrations hash
  id: hash
  run: |
    HASH=$(find . -path "*/migrations/*.py" -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum | cut -d' ' -f1)
    echo "migrations_hash=$HASH" >> $GITHUB_OUTPUT    

If thereโ€™s a cache miss, we assume that thereโ€™s a schema change. We generate the dump and store it against this hash. When the workflow runs again (presumably when the developer pushes a fix in the same PR), the cache is hit. This downloads the file and loads it into the docker container is before!

- name: Restore cached SQL dumps
  id: cache-restore-dumps
  uses: actions/cache/restore@v5
  with:
    key: sql-dumps-${{ steps.hash.outputs.migrations_hash }}
      path: |
        docker/test/init.sql
        docker/test/init_oauth.sql        

- name: Setup test data
  if: steps.cache-restore-dumps.outputs.cache-hit != 'true'
  steps:
    # Test data generation steps are run only if the cache is not hit

# Cache the data in case new data was generated
- name: Save SQL dumps to cache
  if: steps.cache-restore-dumps.outputs.cache-hit != 'true'
  uses: actions/cache/save@v5
  with:
    key: sql-dumps-${{ steps.hash.outputs.migrations_hash }}
    path: |
      docker/test/init.sql
      docker/test/init_oauth.sql      

# Test running jobs

The following jobs can now use the dump whether or not the cache was hit, since it is sure to be created by then.

Note: In case thereโ€™s some issue in the generated data (perhaps caused by a bug in the generation script), we can manually evict the cache in Github, fix the generation logic and run the workflow again.

Thanks for reading!