Shifting Loki log storage to S3

06 Apr 2025

We recently had an incident at work where Loki logs were not appearing in Grafana. As I investigated the issue, I found that the EC2 instance that we were hosting Loki containers on had run out of storage. This was bound to happen as we were using the default local storage backend in Loki, and didn’t bother to change it for a long time. It was temporarily fixed by re-creating the instances, but the old logs were lost as a result. In hindsight, it would have been better to increase the size of the attached EBS volume. But the best way to do this was to store the logs in S3 or DocumentDB. I chose S3, since configuring it seemed less complicated

Links to docs-
https://grafana.com/docs/loki/latest/operations/storage/
https://grafana.com/docs/loki/latest/configure/storage/#aws-deployment-s3-single-store

Some obvious advantages of using S3-

Unlimited storage
Persistent
Allows shifting logs to archives (AWS Glacier/archive buckets) periodically.

Another persistent problem was that the data and config files for all the monitoring services were stored on the instance itself and mounted into the containers from there. In case the instance was destroyed, that data was lost and the service had to be re-configured. This problem was a barrier to configuring Loki to use an S3 backend. It also made scaling the monitoring infra tedious, since every instance that was launched by the capacity provider had to have a copy of all the configs required (Prometheus and Loki until now). The solution to this was to use EFS, making the same file system accessible to all the instances in the subnet, and hence available to mount into all containers running in the cluster.

So there were two things that had to be done in order-

Shift ECS to use EFS and replicate the old config files there.
Configure Loki to use S3 as a backend.

Shifting to EFS

These were the steps I took-

Created an EFS with a backup policy

resource "aws_efs_file_system" "monitoring_efs" {
  creation_token = "monitoring-efs"
  encrypted      = true
}

resource "aws_efs_backup_policy" "monitoring_efs_backup_policy" {
  file_system_id = aws_efs_file_system.monitoring_efs.id
  backup_policy {
    status = "ENABLED"
  }
}

Created mount targets on each private subnet and a security group for NFS access via the 2049 port-

resource "aws_security_group" "monitoring_efs_sg" {
  name   = "monitoring-efs-sg"
  vpc_id = aws_vpc.vpc.id

  ingress {
    from_port       = 2049
    to_port         = 2049
    protocol        = "tcp"
    cidr_blocks = [var.vpc_cidr]
  }

}

resource "aws_efs_mount_target" "monitoring_efs_mount_targets" {
  count           = length([aws_subnet.vpc_subnet_private_az_a.id, aws_subnet.vpc_subnet_private_az_b.id])
  file_system_id  = aws_efs_file_system.monitoring_efs.id
  subnet_id       = [aws_subnet.vpc_subnet_private_az_a.id, aws_subnet.vpc_subnet_private_az_b.id][count.index]
  security_groups = [aws_security_group.monitoring_efs_sg.id]
}

ECS does not create the directories it mounts in EFS. One option was to mount the EFS on a temporary EC2 instance and create the directories. But I felt it was better to use EFS Access Points, which cleanly keeps the services’ data separate, with different UNIX user and group permissions. Doc - https://docs.aws.amazon.com/efs/latest/ug/efs-access-points.html
Each docker image uses its particular UID (user ID) and GID (group ID) on the host which needs to be configured in the creation_info field.

The above discovery was made after going through a series of confusing errors such as-

CannotCreateContainerError: Error response from daemon: failed to copy file info for /var/lib/ecs/volumes/ecs-loki-td-8-loki-aad0c9cd67a989eb8612: failed to chown /var/lib/ecs/volumes/ecs-loki-td-8-loki-aad0c9cd67a989eb8612: lchown /var/li

On the brighter side, I got to learn about these permissions and their significance
Ref- https://www.docker.com/blog/understanding-the-docker-user-instruction/

The UID and GID can be found as follows (using the Grafana image as an example)-

$ docker run grafana/grafana
$ docker exec -it <container_id> id
uid=472(grafana) gid=0(root) groups=0(root)

The terraform corresponding Access Point config will be-

resource "aws_efs_access_point" "monitoring_efs_grafana_access_point" {
  file_system_id = aws_efs_file_system.monitoring_efs.id

  root_directory {
    path = "/grafana-data"
    creation_info {
      owner_uid   = 472
      owner_gid   = 0
      permissions = "0775"
    }
  }
  posix_user {
    uid = 472
    gid = 0
  }
}

resource "aws_ecs_task_definition" "ecs_grafana_task_definition" {
  ...
  

  volume {
    name = "grafana"
    efs_volume_configuration {
      file_system_id = aws_efs_file_system.monitoring_efs.id
      transit_encryption = "ENABLED"
      authorization_config {
        access_point_id = aws_efs_access_point.monitoring_efs_grafana_access_point.id
        iam             = "ENABLED"
      }
    }
  }
  
  ...
  ...
}

Once each service had been configured as above, applied the config.
Forced redeployment for the ECS services with the new task definitions
Launched a temporary EC2 instance with and mounted the EFS on it to edit the configs.
Redeployed the services after editing and verified that they are using the new configs.

Shifting log storage to S3

This step was significantly easier since the config was editable from the temporary instance.

Created a bucket for logs and attach the required policy to the ECS task role-

resource "aws_s3_bucket" "logs_bucket" {
  bucket = "<logs bucket name>"
}

data "aws_iam_policy_document" "loki_s3_access" {
  statement {
    actions = [
      "s3:ListBucket"
    ]
    resources = [
      aws_s3_bucket.logs_bucket.arn
    ]
  }

  statement {
    actions = [
      "s3:GetObject",
      "s3:PutObject",
      "s3:DeleteObject"
    ]
    resources = [
      "${aws_s3_bucket.logs_bucket.arn}/*"
    ]
  }
}

resource "aws_iam_policy" "loki_s3_access_policy" {
  name        = "LokiS3AccessPolicy"
  description = "IAM policy for Loki to access S3 bucket for log storage"
  policy      = data.aws_iam_policy_document.loki_s3_access.json
}

resource "aws_iam_role_policy_attachment" "loki_s3_ecs_task_attachment" {
  role       = aws_iam_role.ecs_task_role.name
  policy_arn = aws_iam_policy.loki_s3_access_policy.arn
}

Created a config file in the EFS mounted on the EC2 instance- /<mount point>/loki/config.yaml

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9095

common:
  path_prefix: /loki
  storage:
    s3:
      bucketnames: <logs bucket name>
      region: us-east-1
  replication_factor: 1

  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: boltdb-shipper
      object_store: s3
      schema: v12
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/boltdb-cache

  aws:
    s3: s3://<logs bucket name>
    region: us-east-1

limits_config:
  allow_structured_metadata: false

Added a custom command in the Loki container definition

command = [
  "-config.file=/loki/config.yaml",
],

Applied the config and force deployed the service with the new task definition.
Confirmed that logs are appearing in the S3 Bucket
Stopped the temporary EC2 instance (did not destroy, it can be used for future edits to the config)

Go to link →

tech