Autocomplete / Type-ahead System for a Search Box – Part 2


In Part 1, we covered the problem statement, requirements, and capacity estimation.
In this part, we’ll deep-dive into the core data structures, ranking optimizations, and high-level system architecture that make autocomplete systems fast and scalable at Google-level traffic.




πŸ“Œ Table of Contents (Part 2)

  1. Choosing the Core Data Structure
  2. Optimizing the Trie Structure
  3. How Top K Suggestions Are Retrieved
  4. Building and Maintaining Top K Lists
  5. High Level System Architecture
  6. Storage, Replication, and Availability



Choosing the Core Data Structure

Autocomplete is fundamentally a prefix-matching problem.

Given a prefix, return the most relevant strings that start with it β€” fast.

This immediately eliminates naive solutions like scanning a list or database table. We need a structure that is prefix-aware by design.




Why a Trie Works Well

A Trie (Prefix Tree) stores strings character by character.

  • Each node represents a prefix
  • All descendants share that prefix



Key Advantages

  • Lookup time depends only on prefix length
  • No scanning of unrelated strings
  • Naturally groups similar queries
  • Perfect fit for prefix search



Example

Stored queries:

be, bee, bell, bent, belt
Enter fullscreen mode

Exit fullscreen mode

Trie traversal for prefix "be" leads to a subtree containing:

bee, bell, bent, belt
Enter fullscreen mode

Exit fullscreen mode

This is exactly what autocomplete needs β€” fast prefix isolation.




Optimizing the Trie Structure

A naive trie becomes extremely large when storing billions of queries.



Problem with a Naive Trie

  • Too many nodes
  • Deep chains with single children
  • High memory overhead
  • Excessive pointer chasing



Compressed Trie (Radix Tree)

In many cases, nodes have only one child.
These chains can be merged into a single edge storing multiple characters.



Benefits

  • Reduced memory usage
  • Fewer pointer traversals
  • Faster cache-friendly lookups

This optimized trie is often called a Radix Tree or Compact Trie.




Storing Popularity Information

How popular is a full search query?

Each complete query ends at a terminal node in the trie.

At that node, we store:

  • Search frequency
  • Or a weighted score combining frequency + recency



Example

User searches:

"apple" β†’ 100 searches
"application" β†’ 60 searches
"app store" β†’ 200 searches
Enter fullscreen mode

Exit fullscreen mode

Trie view:

app
 β”œβ”€β”€ apple (freq=100)
 β”œβ”€β”€ application (freq=60)
 └── app store (freq=200)
Enter fullscreen mode

Exit fullscreen mode

At this stage:

  • We know popularity of full queries
  • We cannot yet answer prefix queries fast

This layer answers:

β€œHow popular is each query?”




How Top K Suggestions Are Retrieved

Given a prefix, how do we return the best suggestions instantly?




Naive Approach ❌

For prefix "be":

  1. Traverse to prefix node
  2. Traverse entire subtree
  3. Collect all terminal queries
  4. Sort by frequency
  5. Return top K

🚫 This is far too slow for real-time systems.




Pre Computed Top K Optimization βœ…

To guarantee fast responses:

πŸ‘‰ Each trie node stores the top K suggestions for that prefix.

So at query time:

Traverse to prefix node β†’ return stored top-K
Enter fullscreen mode

Exit fullscreen mode

No subtree traversal.
No sorting.
No computation on request path.




Storage Optimization

To reduce memory overhead:

  • Store references to terminal nodes
  • Store frequency alongside references

This is a classic space vs latency trade-off β€” and autocomplete is extremely read-heavy, so it’s worth it.




Building and Maintaining Top K Lists

Top-K lists are built bottom-up.



Build Process

  1. Leaf nodes know their own frequency
  2. Parent nodes merge children’s top-K lists
  3. Only top K entries are retained
  4. Process propagates upward to root

This ensures:

  • Every prefix node has pre-sorted suggestions
  • Lookup time is O(prefix length)



Two-Layer Mental Model

Think of autocomplete as two separate layers:



Layer 1: Popularity Tracking (Data Layer)

  • Tracks how often full queries are searched
  • Stored at terminal nodes
  • Updated offline via logs and aggregation



Layer 2: Fast Retrieval (Serving Layer)

  • Uses popularity data to compute rankings
  • Stores only top-K per prefix
  • Optimized purely for read latency



High Level System Architecture

Autocomplete is a read-heavy, latency-critical pipeline.

Serve from memory first β†’ fall back to structured storage β†’ never compute on the request path

Autocomplete is not a compute problem β€” it’s a data access problem.




Logical Components

User
 ↓
Browser / Mobile Client
 ↓
Global Load Balancer
 ↓
Stateless Application Servers
 ↓
In-Memory Cache (Redis)
 ↓
Distributed Trie Storage
Enter fullscreen mode

Exit fullscreen mode

Each layer has a single responsibility, which keeps the system scalable and debuggable.




Step-by-Step Request Flow



1️⃣ Client Layer (Browser / Mobile)

  • Sends only the current prefix
  • Requests are debounced (e.g., 150 ms)
  • Avoids flooding backend while typing

Example:


  "prefix": "how",
  "k": 5,
  "locale": "en-IN"

Enter fullscreen mode

Exit fullscreen mode




2️⃣ Load Balancer

Responsibilities:

  • Route to nearest region
  • Distribute traffic evenly
  • Remove unhealthy servers

At scale:

  • Global LB (GeoDNS / Anycast)
  • Regional LB inside data centers



3️⃣ Application Servers (Stateless)

Act as coordinators, not data holders.

They handle:

  • Input normalization ("How " β†’ "how")
  • Validation
  • Cache orchestration
  • Fallback logic

Stateless design means:

  • Easy horizontal scaling
  • No data migration during scale-up



4️⃣ Redis Cache (Hot Prefix Layer)

First data stop.

Cache Key

autocomplete:locale:prefix
Enter fullscreen mode

Exit fullscreen mode

Example:

autocomplete:en:how
Enter fullscreen mode

Exit fullscreen mode

Value:

[
  "how to cook rice",
  "how to lose weight",
  "how to make tea"
]
Enter fullscreen mode

Exit fullscreen mode

Why Redis works well:

  • Small set of prefixes dominates traffic
  • Sub-millisecond memory access
  • TTL removes stale data automatically

Prefixes like "a", "th", "how" are often served 99% from cache.




5️⃣ Cache Miss β†’ Trie Lookup

On cache miss:

  • Query distributed trie storage
  • Trie node already contains pre-computed top-K
  • No traversal or sorting at runtime

Important:

  • Trie servers are read-only on serving path
  • Writes and rebuilds happen offline



6️⃣ Response & Cache Population

  • Results returned to client
  • Same response written back to Redis
  • TTL applied (10–30 minutes)

Cold prefixes naturally become hot.




Failure Handling



Redis Down

  • App servers bypass cache
  • Read directly from trie
  • Slight latency increase, system remains available



Trie Shard Down

  • Traffic routed to replica
  • Load balancer avoids unhealthy node



App Server Down

  • Instantly removed
  • No data loss (stateless)



Regional Deployment

User (India)
   ↓
India Load Balancer
   ↓
India App Servers
   ↓
India Redis
   ↓
India Trie Replica
Enter fullscreen mode

Exit fullscreen mode

Each region:

  • Serves local users
  • Has independent cache
  • Shares replicated trie data



Conceptual Architecture Diagram

+---------+      +--------------+      +------------------+
|  Client | ---> | Load Balancer| ---> | Application Server|
+---------+      +--------------+      +------------------+
                                              |
                         Cache Hit            | Cache Miss
                        +-----------+         v
                        |  Redis    |     +-----------------+
                        +-----------+     | Distributed Trie|
                                           |    Storage     |
                                           +-----------------+
Enter fullscreen mode

Exit fullscreen mode




Storage, Replication, and Availability

A full autocomplete trie cannot live on a single machine.




Distributed Trie Storage

The trie is:

  • Logically one structure
  • Physically distributed across many servers



Partitioning Strategy

Common approach: prefix-based partitioning

Examples:

  • a–c β†’ Shard 1
  • d–f β†’ Shard 2
  • Hot prefixes (a, s, th) further subdivided

This keeps shards balanced and scalable.




Replication Strategy

Each trie partition is replicated across multiple servers.



Why Replication Matters

  • High Availability
    Node failures don’t impact service

  • Fault Tolerance
    Handles hardware failures and deployments

  • Read Scalability
    Replicas share massive QPS load




Storage Technology Choices



Cassandra (Durable Storage)

Best fit when:

  • Large trie partitions
  • Very high read throughput
  • Cross-region replication needed

Usage pattern:

  • Trie partitions stored as serialized blocks
  • Loaded into memory on startup
  • Cassandra acts as durable backing store

Strengths:

  • Horizontal scalability
  • High availability
  • Tunable consistency



ZooKeeper (Metadata & Coordination)

ZooKeeper is not used to store trie data.

It is used for:

  • Partition metadata
  • Prefix-to-server mappings
  • Leader election (if required)
  • Atomic version switching during trie rebuilds



Replication Guarantees

Replication ensures:

  • βœ… High availability
  • βœ… Fault tolerance
  • βœ… Read scalability



🎯 Final Takeaway (Part 2)

A real-world autocomplete system succeeds because it:

  • Uses tries for prefix isolation
  • Pre-computes top-K suggestions
  • Serves requests from memory first
  • Pushes all heavy computation offline
  • Scales via partitioning and replication

This is the level of clarity interviewers look for.



Source link