Testing Athena partition projection
This Terraform recipe creates a full, minimal infrastructure to test the partition projection feature of Amazon Athena. With this, we don’t have to manage anymore partitions, manually, using Glue Crawlers or using Lambda functions that have to run periodically.
This recipe is focused primarily on the analysis of ELB logs, although it can be modified to analyze other kinds of data or logs, like CloudFront logs.
Why should I partition the data? By partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost.
I always recommend enabling logs on the load balancers and CloudFront, so, later, we can get statistics. We don’t need to pay anything for having Athena configured to query the logs, only for the storage of the logs in S3. When we need to query the logs, we will pay for the queries, according to the data scanned.
We will use 2 additional tools, apart from the AWS CLI:
- athenacli, that provides an easier way to run queries and get information about them (time, data analyzed, estimated cost)
- hey, to generate traffic (like ab (Apache Benchmark))
You can test it by cloning this gist repo:
git clone https://gist.github.com/okelet/ba9a5b98233362a96f22c33c465289dd.git tf-athenapartitionprojectiontest
The recipe creates the following elements:
- VPC and related objects
- A bastion EC2 instance (not needed, just for testing)
- An application load balancer
- An ECS service with 2 tasks using the mendhak/http-https-echo image, deployed using Fargate Spot
- Related Athena objects (database, table, S3 buckets)
After cloning, create the infrastructure (the variable key_pair
is optional, just to access the EC2 bastion instance):
terraform init
terraform apply -auto-approve -var key_pair=mykeypair
Wait for the ECS service and load balancer to be ready…
SERVICE_ADDR="http://$(terraform output -raw lb_addr)/echo/"
while [[ "$(curl -s -o /dev/null -w '%{http_code}' ${SERVICE_ADDR})" != "200" ]]; do sleep 5; done
Generate some traffic using hey, and wait up to 5 minutes for the log files to be created:
hey -z 5m -q 30 ${SERVICE_ADDR}
Configure the environment for athenacli:
export AWS_ALB_S3_LOGS_PATH=$(terraform output -raw alb_s3_logs_path)
export AWS_ATHENA_WORK_GROUP=$(terraform output -raw athena_workgroup_name)
export AWS_ATHENA_S3_STAGING_DIR=$(terraform output -raw athena_workgroup_bucket)
export ATHENA_DB_NAME=$(terraform output -raw athena_db_name)
athenacli ${ATHENA_DB_NAME}
Get today’s number of requests (should be greater than 0, data scanned greater than 0):
SELECT COUNT(*) FROM alb_logs WHERE year=cast(date_format(current_timestamp, '%Y') as integer) and month=cast(date_format(current_timestamp, '%m') as integer) and day=cast(date_format(current_timestamp, '%d') as integer);
The number of requests for 2001/01/01 should be 0, and the data scanned also 0, as the date is out of the range of the configured table projection parameters (the automatic partitions start in 2020):
SELECT COUNT(*) FROM alb_logs WHERE year=2001 and month=1 and day=1;
Wait for tomorrow and generate some traffic again:
hey ${SERVICE_ADDR}
Get the number of requests for day 1 and 2; the results (number of requests and data scanned) should be different, so the partitions are working:
--- Yesterday
SELECT COUNT(*) FROM alb_logs WHERE year=cast(date_format(current_date - interval '1' day, '%Y') as integer) and month=cast(date_format(current_date - interval '1' day, '%m') as integer) and day=cast(date_format(current_date - interval '1' day, '%d') as integer);
--- Today
SELECT COUNT(*) FROM alb_logs WHERE year=cast(date_format(current_timestamp, '%Y') as integer) and month=cast(date_format(current_timestamp, '%m') as integer) and day=cast(date_format(current_timestamp, '%d') as integer);
We can repeat the test with projection disabled:
terraform apply -auto-approve -var key_pair=mykeypair -var athena_projection_enabled=false
So no matter the query, the number of results and the data scanned is always 0, because the table is partitioned but no partition exists yet:
--- 2001/01/01
SELECT COUNT(*) FROM alb_logs WHERE year=2001 and month=1 and day=1;
--- Yesterday
SELECT COUNT(*) FROM alb_logs WHERE year=cast(date_format(current_date - interval '1' day, '%Y') as integer) and month=cast(date_format(current_date - interval '1' day, '%m') as integer) and day=cast(date_format(current_date - interval '1' day, '%d') as integer);
--- Today
SELECT COUNT(*) FROM alb_logs WHERE year=cast(date_format(current_timestamp, '%Y') as integer) and month=cast(date_format(current_timestamp, '%m') as integer) and day=cast(date_format(current_timestamp, '%d') as integer);
We can create manually a partition for today:
athenacli ${ATHENA_DB_NAME} -e "ALTER TABLE alb_logs ADD PARTITION (year=$(date +%Y), month=$(date +%m), day=$(date +%d)) LOCATION '${AWS_ALB_S3_LOGS_PATH}/$(date +%Y)/$(date +%m)/$(date +%d)';"
So if we repeat the queries, we will only get results for today (although there are requests from yesterday, that partition doesn’t exist):
--- 2001/01/01
SELECT COUNT(*) FROM alb_logs WHERE year=2001 and month=1 and day=1;
--- Yesterday
SELECT COUNT(*) FROM alb_logs WHERE year=cast(date_format(current_date - interval '1' day, '%Y') as integer) and month=cast(date_format(current_date - interval '1' day, '%m') as integer) and day=cast(date_format(current_date - interval '1' day, '%d') as integer);
--- Today
SELECT COUNT(*) FROM alb_logs WHERE year=cast(date_format(current_timestamp, '%Y') as integer) and month=cast(date_format(current_timestamp, '%m') as integer) and day=cast(date_format(current_timestamp, '%d') as integer);
We can enable again partition projection omiting the variable athena_projection_enabled
or setting it to true
:
terraform apply -auto-approve -var key_pair=mykeypair -var athena_projection_enabled=true
And repeat the tests, getting data for yesterday and today:
--- Yesterday
SELECT COUNT(*) FROM alb_logs WHERE year=cast(date_format(current_date - interval '1' day, '%Y') as integer) and month=cast(date_format(current_date - interval '1' day, '%m') as integer) and day=cast(date_format(current_date - interval '1' day, '%d') as integer);
--- Today
SELECT COUNT(*) FROM alb_logs WHERE year=cast(date_format(current_timestamp, '%Y') as integer) and month=cast(date_format(current_timestamp, '%m') as integer) and day=cast(date_format(current_timestamp, '%d') as integer);
In the case that partition projection is enabled, manually created partitions are ignored.
Finally, you can destroy the infrastructure:
terraform destroy -auto-approve
References: