Rate limiting is one of those things every backend engineer knows they need but few implement well from scratch. In this post, I walk through three approaches I evaluated — fixed window, sliding window log, and token bucket — and explain why I chose a hybrid for a production Spring Boot service handling 10K+ RPM.
Why Build Your Own?
Off-the-shelf solutions like API gateways handle rate limiting at the edge, but sometimes you need fine-grained, business-logic-aware throttling. Our use case required per-tenant, per-endpoint limits with dynamic configuration.
The Token Bucket Algorithm
The token bucket is elegant: imagine a bucket that fills with tokens at a fixed rate. Each request consumes a token. If the bucket is empty, the request is rejected. This naturally allows short bursts while enforcing average rates.
I implemented this using Redis for distributed state, with a Lua script to make the check-and-decrement atomic. The key insight was using a single Redis key per tenant-endpoint pair with a TTL matching the refill window.
Sliding Window Counters
For endpoints where we needed stricter smoothing, I implemented sliding window counters. Instead of hard window boundaries (which cause burst problems at window edges), this approach weights the previous window's count proportionally.
Production Considerations
The real complexity isn't the algorithm — it's the operational concerns. How do you handle Redis failures? (I used a local fallback with a generous limit.) How do you surface rate limit status to callers? (Standard headers: X-RateLimit-Remaining, X-RateLimit-Reset.) How do you monitor and tune limits without deployments? (Dynamic config via a feature flag service.)
Results
After deploying, we saw a 73% reduction in abuse-related incidents and zero legitimate traffic impact. The system processes rate limit decisions in under 2ms p99.