A hack to load test data in Github Actions
We run our integration test suite in Github Actions that requires some test data to be present in a database.
The test data was loaded into the Postgres container from a dump stored as a .sql
file in our repository.
db:
image: postgres:14
command: ["postgres", "-c", "log_statement=all"]
volumes:
- postgres_test_data:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
ports:
- 5432:5432
This was done to save the time taken to generate test data on every run (so changes are pushed faster and costs donโt shoot up).
While this setup worked well for a long time, there were a couple downsides to it-
- Every time the database schema changed, developers had to manually generate the test data locally, dump the data and commit the file.
- The dump file itself was quite large. Since the test data generation script had random components, it changed considerably every time (for example, a
UUID
column). This bloated PR diffs even when the actual code change was small.
Now, the ideal solution would look something like this -
- A change in database schema is detected
- The test data generation script runs in Github Actions (ideally not on every run).
- This dump is stored somewhere and loaded on the next run.
The hack
To store the dump for a change, we can use the Github Actions cache instead of the repository.
Each schema change is done by Django via a migration, which it stores as files in migrations/
directories. We calculate the hash of the all such files and use it as the cache key.
- name: Generate migrations hash
id: hash
run: |
HASH=$(find . -path "*/migrations/*.py" -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum | cut -d' ' -f1)
echo "migrations_hash=$HASH" >> $GITHUB_OUTPUT
If thereโs a cache miss, we assume that thereโs a schema change. We generate the dump and store it against this hash. When the workflow runs again (presumably when the developer pushes a fix in the same PR), the cache is hit. This downloads the file and loads it into the docker container is before!
- name: Restore cached SQL dumps
id: cache-restore-dumps
uses: actions/cache/restore@v5
with:
key: sql-dumps-${{ steps.hash.outputs.migrations_hash }}
path: |
docker/test/init.sql
docker/test/init_oauth.sql
- name: Setup test data
if: steps.cache-restore-dumps.outputs.cache-hit != 'true'
steps:
# Test data generation steps are run only if the cache is not hit
# Cache the data in case new data was generated
- name: Save SQL dumps to cache
if: steps.cache-restore-dumps.outputs.cache-hit != 'true'
uses: actions/cache/save@v5
with:
key: sql-dumps-${{ steps.hash.outputs.migrations_hash }}
path: |
docker/test/init.sql
docker/test/init_oauth.sql
# Test running jobs
The following jobs can now use the dump whether or not the cache was hit, since it is sure to be created by then.
Note: In case thereโs some issue in the generated data (perhaps caused by a bug in the generation script), we can manually evict the cache in Github, fix the generation logic and run the workflow again.
Thanks for reading!