Optimizing ELB Access Log Queries at Bizom

By Rahul Ghosh, Senior Tech Lead | 03rd March. 2026 | 08 min read

Introduction: Unlocking the Gold Mine

Load Balancer Access Logs are often described as a gold mine of information. They provide a high-fidelity record of every request, every 5xx error, and every millisecond of latency hitting your infrastructure. At Bizom, the potential of this data was clear, but the reality was frustrating.

Early in my journey here, my mentor pointed out that our ability to debug and improve the system was directly tied to how well we could navigate these logs. The advice was simple but powerful: Treat Athena not just as a tool, but as a library. If you can master the art of searching it, you unlock a wealth of insights that are otherwise buried in the noise.

However, we soon encountered a significant architectural bottleneck. Querying raw ELB access logs directly in S3 was both performance-prohibitive and cost-inefficient. At Bizom’s scale, processing tens of millions of daily requests, the sheer volume of our logs meant that even simple analytical queries required scanning massive amounts of unoptimized data, creating a drag on our engineering velocity

The problem wasn’t Amazon Athena (AWS’s serverless query service) itself, but the raw, row-oriented GZIP format that Application Load Balancers use by default. Even with daily partitioning, fetching just one hour of data was sluggish. Trying to aggregate trends across days or months was practically impossible. Every query felt like digging through a mountain of granite with a hand shovel because Athena had to scan every single byte of text to find a specific record.

This wasn’t just an inconvenience; it was a long-term blocker for our engineering velocity. We needed a solution that could scale with our growth. We needed to transform our “gold mine” from a dormant, expensive storage bin into a high-speed, cost-effective analytical resource.

Moving from Reactive to Predictive

Before diving into the architecture, it is important to understand why ELB access logs are the “first point of contact” for every request hitting Bizom.

What are these logs?

An ELB access log captures detailed information about requests sent to your load balancer. A typical raw log looks like this:

				
					https 2026-01-12T10:53:26.123456Z app/bizom-load-balancer/50dc6c495c0c9188 192.168.1.1:54321 10.0.0.1:80 0.001 0.320 0.001 200 200 1024 2048 "GET https://api.bizom.com:443/v1/orders HTTP/1.1" "Mozilla/5.0..." ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2
				
			

To an external observer, this might look like a wall of text. But to an engineer, these fields (client_ip, target_processing_time, elb_status_code, and request_url) are the vital signs of the application.

Predictive vs. Reactive Maintenance

Historically, we treated these logs reactively. When a customer or support team reported an error, we went “hunting” for the specific issue. By transforming this data into a high-speed analytical resource, we shifted our entire engineering culture toward predictive maintenance:

  • Catching Latency Creep: We can monitor target_processing_time across all API versions to catch performance degradation before it becomes a bottleneck.
  • Error Spiking: We can set up queries to find 4xx or 5xx trends per User-Agent or Client IP, helping us identify broken app versions or malicious bot activity in real-time.

Phase 1: Solving the Slowness (The Athena Win)

The primary hurdle was the storage format. ELBs dump raw, GZIP-compressed text logs into S3. Because text is row-oriented, Athena must scan every single byte of a file to locate a specific trace_id or status_code. Searching a single day’s logs resulted in significant latency and high scan costs.

The Architecture

The following flow describes how we transitioned from raw logs to an optimized analytical layer:

1. Data Ingress: ELB writes raw .log.gz files to the S3 Raw Bucket, triggering an ObjectCreated event.

2. Asynchronous Decoupling: The event is pushed to Amazon SQS. This decouples the bursty log generation from our processing layer, ensuring zero data loss during traffic spikes.

3. Converter Service (AWS Fargate): A Spring Boot application polls SQS, downloads the GZIP files, parses them via Regex into POJOs, and aggregates them into 128KB JSON chunks in memory.

4. Optimized Ingestion: The service sends these chunks to Kinesis Data Firehose via PutRecordBatch, using surgical retry logic for reliability.

5. Final Transformation & Storage: Firehose performs inline conversion to Apache Parquet and writes partitioned files to the destination S3 bucket.

6. Consumption: Engineers query the Parquet data via Amazon Athena.

Why Fargate and SQS?

We built a dedicated Converter Service to bridge the gap between raw S3 logs and Kinesis Firehose. While AWS Lambda is a common choice for S3 triggers, we opted for AWS Fargate for several reasons:

  • Execution Limits: ELB access logs arrive in massive bursts. Lambda’s 15-minute timeout and memory constraints for high-throughput stream processing risked partial executions and “Out of Memory” errors.
  • Resource Predictability: Fargate allowed us to define specific CPU and Memory profiles that could handle the sustained heavy lifting of GZIP decompression and Regex parsing without cold-start penalties.
  • Managed Scaling: By using SQS as a buffer, Fargate tasks consume messages at a steady, manageable pace, preventing downstream service exhaustion.
Why Kinesis Data Firehose?
We chose Firehose as our delivery stream because it acts as a managed “Swiss Army Knife” for data ingestion:
  • Native Format Conversion: It handles the complex JSON-to-Parquet conversion natively using the AWS Glue Schema Registry, removing the need for us to write and maintain custom Parquet serialization logic in our application code.
  • Automated Partitioning: Firehose manages the S3 prefixing (year/month/day) automatically, ensuring our data lake remains organized and query-efficient.
  • Managed Scale & Durability: As a fully managed service, it scales automatically with our throughput and handles the retries and buffering to S3, allowing our team to focus on the business logic rather than infrastructure plumbing.
Choosing the Columnar Format: Parquet vs. ORC
We evaluated both Apache Parquet and ORC for our columnar storage. We ultimately chose Parquet for our stack because:
  • Athena Native: Parquet is the industry standard for Athena, offering superior predicate pushdown and compression.
  • Schema Evolution: Parquet handles schema changes more gracefully, which is critical as ELB access log formats evolve.
  • Ecosystem Maturity: The integration between Firehose and Parquet is highly optimized and battle-tested in the AWS ecosystem.
Result

Our metrics spoke for themselves. Query times dropped from 20s to under 2s, and our scan volume dropped from 8GB to ~200MB for many queries. We had achieved the speed; however, our billing dashboard soon revealed we had simply shifted our costs to a new bottleneck.

Phase 2: Solving the Cost

While our Athena bill started reducing, our Kinesis Firehose costs began to climb. After an audit, we found the culprit: Firehose’s 5KB billing tax, which was causing us to be billed for nearly 3.5x more data than we were actually sending.

Firehose rounds every incoming record up to the nearest 5KB for billing. Our average ELB access log is only 1.4KB. By sending logs one by one, we were paying for 3.6KB of empty air per line, a 3.5x markup.

The Fix: In-Memory Aggregation
We refactored our Fargate service to “pack” logs before transmission. Instead of sending one record per log line, we aggregate them into larger chunks.
Earlier
Each record was sent to Firehose separately, so even a small record was billed as 5KB
				
					{"timestamp": "2026-01-12...", "request_id": "abc-123", ...}
				
			
Optimized Approach

This involved multiple json objects joined by a new line (\n), which is sent as One Firehose record.

				
					{"timestamp": "2026-01-12...", "request_id": "abc-1"}\n 
{"timestamp": "2026-01-12...", "request_id": "abc-2"}\n 
{"timestamp": "2026-01-12...", "request_id": "abc-3"}\n
				
			

We implemented an aggregation threshold of 128KB. This ensures that we consistently exceed the 5KB billing floor, aligning our billed volume with our actual data footprint. Consequently, instead of sending 100 separate logs (which AWS would bill as 500KB due to rounding), we now transmit a single 128KB aggregated record. This shift effectively eliminates the ‘padding tax,’ reducing our billed ingestion volume by nearly 75% for the same amount of raw data.

Surgical Retry Logic
Aggregated records mean higher stakes; if one 128KB chunk fails, we lose nearly 100 log lines. We implemented “surgical” retry logic to handle partial failures (like throttling) without duplicating data that was successfully ingested.
				
					// Simplified Surgical Retry Logic
public void sendWithRetry(List<Record> batch, int attempt) {
    PutRecordBatchRequest request = PutRecordBatchRequest.builder()
            .deliveryStreamName(streamName)
            .records(batch).build();
    
    PutRecordBatchResponse response = firehoseClient.putRecordBatch(request);
    
    if (response.failedPutCount() > 0) {
        List<Record> retryBatch = new ArrayList<>();
        for (int i = 0; i < response.requestResponses().size(); i++) {
            // Only retry records that specifically returned an error code
            if (response.requestResponses().get(i).errorCode() != null) {
                retryBatch.add(batch.get(i));
            }
        }
        // Exponential backoff before retrying subset
        backoff(attempt);
        sendWithRetry(retryBatch, attempt + 1);
    }
}
				
			
Result
By implementing record aggregation, we reduced our daily Firehose expenditure by ~75%. This shift effectively eliminated the ‘padding tax’ and aligned our billed volume with our actual data footprint.

A Word of Caution: The Cost of Bad Queries

While Parquet is a “Gold Mine”, it can quickly become an expensive one if queried incorrectly. Athena charges based on the amount of data scanned ($5.00 per TB).

Even with a columnar format, a “bad” query can spike costs:
  1. Avoid SELECT *: Always specify only the columns you need. Because Parquet is columnar, Athena only reads the files for the columns you name. SELECT * forces Athena to scan the entire dataset, negating the benefits of Parquet.
  2. Always use Partition Keys: Never run a query without a WHERE clause on the partition keys (e.g., year, month, day). Without them, Athena performs a full table scan.
  3. Be Selective with Time Ranges: If you are debugging a specific incident, filter by time (hours or minutes) in addition to your partition keys. The goal is to be as selective as possible; scanning 10 minutes of logs is significantly cheaper and faster than scanning a full 24 hours.
  4. Use LIMIT for Previewing: If you just want to see the data structure or verify your logic, use LIMIT. This allows the engine to stop the scan as soon as it finds enough records to satisfy your request.

Remember: Parquet makes queries fast, but query discipline makes them cheap.

Engineering References

Appendix 1: Performance Benchmarks

To quantify the success of this transition, we ran identical analytical queries against both the Legacy (Raw GZIP) and Optimized (Parquet) datasets.

Query: 5xx count for 1 day, taken on hourly basis

				
					SELECT SUBSTRING(request_url, 1, 100) as URL______________________________, count(*) as count 
FROM <database>
WHERE year IN ('2026') AND month IN ('01') AND day IN ('11', '12') 
AND target_status_code >= '500' AND request_url LIKE '%' 
AND time between '2026-01-11T18:30:00' AND '2026-01-12T18:29:59' 
GROUP BY request_url 
ORDER BY count(*) desc
				
			
Metric Raw GZIP (Legacy) Aggregated Parquet (Optimized) Improvement (%) X Times Improvement
Data Scanned
10.85 GB
163.27 MB
98.53% Reduction
68X Smaller
Query Execution Time (Seconds)
22.31
2.307
89.66% Faster
9.7X Faster

Query: 5xx count for one customer for one week

				
					SELECT time,client_ip,target_ip,request_verb,request_url,trace_id,user_agent,target_status_code
FROM <database>
WHERE year = '2026'
AND month = '01'
AND (( time >= '2026-01-02T00:00:00)' and time <= '2026-01-12T23:59:59)' ) )
And target_status_code >= '500'
AND ( request_url LIKE '%customer%' )
				
			
Metric Raw GZIP (Legacy) Aggregated Parquet (Optimized) Improvement (%) X Times Improvement
Data Scanned
82.13 GB
2.11 GB
97.43% Reduction
38.9X Smaller
Query Execution Time (Seconds)
129.156
3.221
97.51% Faster
40X Faster

Share this:

Like this:

Like Loading…

Discover more from Algorithmic Jugaad @ Bizom

Subscribe now to keep reading and get access to the full archive.

Continue reading