Shifting Loki log storage to S3
We recently had an incident at work where Loki logs were not appearing in Grafana. As I investigated the issue, I found that the EC2 instance that we were hosting Loki containers on had run out of storage. This was bound to happen as we were using the default local storage backend in Loki, and didnβt bother to change it for a long time. It was temporarily fixed by re-creating the instances, but the old logs were lost as a result. In hindsight, it would have been better to increase the size of the attached EBS volume. But the best way to do this was to store the logs in S3 or DocumentDB. I chose S3, since configuring it seemed less complicated
Links to docs-
https://grafana.com/docs/loki/latest/operations/storage/
https://grafana.com/docs/loki/latest/configure/storage/#aws-deployment-s3-single-store
Some obvious advantages of using S3-
- Unlimited storage
- Persistent
- Allows shifting logs to archives (AWS Glacier/archive buckets) periodically.
Another persistent problem was that the data and config files for all the monitoring services were stored on the instance itself and mounted into the containers from there. In case the instance was destroyed, that data was lost and the service had to be re-configured. This problem was a barrier to configuring Loki to use an S3 backend. It also made scaling the monitoring infra tedious, since every instance that was launched by the capacity provider had to have a copy of all the configs required (Prometheus and Loki until now). The solution to this was to use EFS, making the same file system accessible to all the instances in the subnet, and hence available to mount into all containers running in the cluster.
So there were two things that had to be done in order-
- Shift ECS to use EFS and replicate the old config files there.
- Configure Loki to use S3 as a backend.
Shifting to EFS
These were the steps I took-
- Created an EFS with a backup policy
resource "aws_efs_file_system" "monitoring_efs" {
creation_token = "monitoring-efs"
encrypted = true
}
resource "aws_efs_backup_policy" "monitoring_efs_backup_policy" {
file_system_id = aws_efs_file_system.monitoring_efs.id
backup_policy {
status = "ENABLED"
}
}
- Created mount targets on each private subnet and a security group for NFS access via the 2049 port-
resource "aws_security_group" "monitoring_efs_sg" {
name = "monitoring-efs-sg"
vpc_id = aws_vpc.vpc.id
ingress {
from_port = 2049
to_port = 2049
protocol = "tcp"
cidr_blocks = [var.vpc_cidr]
}
}
resource "aws_efs_mount_target" "monitoring_efs_mount_targets" {
count = length([aws_subnet.vpc_subnet_private_az_a.id, aws_subnet.vpc_subnet_private_az_b.id])
file_system_id = aws_efs_file_system.monitoring_efs.id
subnet_id = [aws_subnet.vpc_subnet_private_az_a.id, aws_subnet.vpc_subnet_private_az_b.id][count.index]
security_groups = [aws_security_group.monitoring_efs_sg.id]
}
- ECS does not create the directories it mounts in EFS. One option was to mount the EFS on a temporary EC2 instance and create the directories. But I felt it was better to use EFS Access Points, which cleanly keeps the servicesβ data separate, with different UNIX user and group permissions.
Doc - https://docs.aws.amazon.com/efs/latest/ug/efs-access-points.html
Each docker image uses its particular UID (user ID) and GID (group ID) on the host which needs to be configured in thecreation_info
field.
The above discovery was made after going through a series of confusing errors such as-
CannotCreateContainerError: Error response from daemon: failed to copy file info for /var/lib/ecs/volumes/ecs-loki-td-8-loki-aad0c9cd67a989eb8612: failed to chown /var/lib/ecs/volumes/ecs-loki-td-8-loki-aad0c9cd67a989eb8612: lchown /var/li
On the brighter side, I got to learn about these permissions and their significance
Ref- https://www.docker.com/blog/understanding-the-docker-user-instruction/
The UID and GID can be found as follows (using the Grafana image as an example)-
$ docker run grafana/grafana
$ docker exec -it <container_id> id
uid=472(grafana) gid=0(root) groups=0(root)
The terraform corresponding Access Point config will be-
resource "aws_efs_access_point" "monitoring_efs_grafana_access_point" {
file_system_id = aws_efs_file_system.monitoring_efs.id
root_directory {
path = "/grafana-data"
creation_info {
owner_uid = 472
owner_gid = 0
permissions = "0775"
}
}
posix_user {
uid = 472
gid = 0
}
}
resource "aws_ecs_task_definition" "ecs_grafana_task_definition" {
...
volume {
name = "grafana"
efs_volume_configuration {
file_system_id = aws_efs_file_system.monitoring_efs.id
transit_encryption = "ENABLED"
authorization_config {
access_point_id = aws_efs_access_point.monitoring_efs_grafana_access_point.id
iam = "ENABLED"
}
}
}
...
...
}
- Once each service had been configured as above, applied the config.
- Forced redeployment for the ECS services with the new task definitions
- Launched a temporary EC2 instance with and mounted the EFS on it to edit the configs.
- Redeployed the services after editing and verified that they are using the new configs.
Shifting log storage to S3
This step was significantly easier since the config was editable from the temporary instance.
- Created a bucket for logs and attach the required policy to the ECS task role-
resource "aws_s3_bucket" "logs_bucket" {
bucket = "<logs bucket name>"
}
data "aws_iam_policy_document" "loki_s3_access" {
statement {
actions = [
"s3:ListBucket"
]
resources = [
aws_s3_bucket.logs_bucket.arn
]
}
statement {
actions = [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
]
resources = [
"${aws_s3_bucket.logs_bucket.arn}/*"
]
}
}
resource "aws_iam_policy" "loki_s3_access_policy" {
name = "LokiS3AccessPolicy"
description = "IAM policy for Loki to access S3 bucket for log storage"
policy = data.aws_iam_policy_document.loki_s3_access.json
}
resource "aws_iam_role_policy_attachment" "loki_s3_ecs_task_attachment" {
role = aws_iam_role.ecs_task_role.name
policy_arn = aws_iam_policy.loki_s3_access_policy.arn
}
- Created a config file in the EFS mounted on the EC2 instance-
/<mount point>/loki/config.yaml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9095
common:
path_prefix: /loki
storage:
s3:
bucketnames: <logs bucket name>
region: us-east-1
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: s3
schema: v12
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/boltdb-cache
aws:
s3: s3://<logs bucket name>
region: us-east-1
limits_config:
allow_structured_metadata: false
- Added a custom command in the Loki container definition
command = [
"-config.file=/loki/config.yaml",
],
- Applied the config and force deployed the service with the new task definition.
- Confirmed that logs are appearing in the S3 Bucket
- Stopped the temporary EC2 instance (did not destroy, it can be used for future edits to the config)