Tracing Spans as Russian Nesting Dolls

Simor Consulting | 21 Mar, 2025 | 03 Mins read

Russian nesting dolls (Matryoshka) are wooden dolls where each one opens to reveal a smaller doll inside, which opens to reveal another, and so on. Each doll represents an operation in your distributed system, and the way they nest shows how operations relate in time and causality. That’s distributed tracing with spans.

The Single Doll (Monolithic System)

In monolithic systems, debugging is like examining a single wooden doll. You see everything at once:

Start time: When you picked it up
End time: When you put it down
What happened: You looked at it

Simple. But modern systems aren’t single dolls - they’re entire collections working together.

The Nesting Begins (Distributed Calls)

A request unfolds like opening a set of nesting dolls:

Outermost Doll (API Gateway): “Handle user request” - 500ms total
Second Doll (User Service): “Validate authentication” - 50ms
Third Doll (Database): “Look up user” - 20ms
Fourth Doll (Cache): “Check permissions” - 5ms

Each doll (span) fits inside its parent, showing the relationship and timing:

This diagram requires JavaScript.

Enable JavaScript in your browser to use this feature.

Why Dolls Have Unique IDs

Each doll needs identification:

Trace ID: The family name (all dolls in this set share it)
Span ID: Each doll’s unique serial number
Parent ID: Which doll this one was inside

Like real nesting dolls, you can’t have a tiny doll without the larger ones that contain it. Every span knows its parent.

The Doll Workshop (Span Creation)

When your system handles a request, it’s like a workshop creating dolls:

First, open the outermost doll labeled “handle_order”
Inside that, create and open a “validate_payment” doll
Inside the payment doll, nest a “check_card” doll
The actual credit card check happens in this innermost doll
As each operation completes, its doll closes, recording the time

This nesting structure captures the relationship between operations - payment validation contains card checking, and order handling contains payment validation.

Each doll records:

When it was opened (start time)
When it was closed (end time)
What happened inside (attributes)
Any problems (errors)

Decorating the Dolls (Span Attributes)

Real nesting dolls have painted details. Spans have attributes:

Order Processing Doll:
- customer_id: 12345
- order_total: $99.99
- items_count: 3
- payment_method: "credit_card"
- region: "us-west"
- status: "completed"

These attributes help you understand what each operation was doing.

Finding the Broken Doll

When something goes wrong, distributed tracing gives you X-ray vision for your nesting dolls:

User complains: “My order took forever!”
Look at the outermost doll: Total time 10 seconds
Open each nested doll:
- API Gateway: 100ms
- Order Service: 9.8s
- Payment Service: 9.5s
- External Payment API: 9.4s (found it!)

Without tracing, you’d be shaking the entire set wondering which doll is rattling.

Parallel Dolls (Concurrent Spans)

Sometimes operations happen in parallel, like having multiple smaller dolls at the same level:

Shopping Cart Doll contains:
├── Check Inventory Doll (100ms)
├── Calculate Shipping Doll (150ms)
└── Apply Discounts Doll (50ms)
All happening simultaneously!

The shopping cart operation finishes when all inner dolls are complete.

The Doll Collection System (Trace Collectors)

A distributed system creates thousands of these doll sets every second. You need:

Doll Collectors (Agents): Gather dolls from each service
Doll Warehouse (Storage): Keep organized collections
Doll Catalog (UI): Browse and search your collection
Doll Analyzer (Analytics): Find patterns in your dolls

Jaeger, Zipkin, and AWS X-Ray are sophisticated doll museums.

Sampling: Not Every Doll Makes It to the Museum

Keeping every single doll would fill your warehouse instantly. Instead, you sample:

Keep all broken dolls (errors)
Keep all slow dolls (high latency)
Keep 1% of normal dolls (sampling)
Keep all VIP dolls (important customers)

The Context Propagation Magic

The clever bit is how each service knows which doll it’s supposed to be inside. When Service A calls Service B, it passes along a note: “You’re inside Doll XYZ, your parent is Doll ABC.” This context propagation ensures each doll ends up in the right set.

Real-World Doll Collections

Here’s what tracing reveals in production:

The Mysterious Slowdown: Opening the dolls shows a service making 100 database calls in a loop - each a tiny doll adding up to massive delay.

The Cascade Failure: One broken doll (failed service) contains dozens of retry dolls, each containing timeout dolls. The pattern is clear in the nesting.

The Performance Optimization: You notice some dolls are unnecessarily nested - sequential operations that could be parallel dolls instead.

Decision Rules

Add tracing when:

Your system has more than three services communicating
You’re debugging latency issues that are hard to reproduce
You need to understand dependencies between services

Start with sampling at 1% and increase for errors and slow traces. Use OpenTelemetry for vendor-neutral instrumentation.

Somewhere in that nested collection is the doll that’s taking too long, failing silently, or just painted the wrong color. With proper tracing, you can open each layer methodically until you find exactly which doll needs fixing.

Ready to Implement These AI Data Engineering Solutions?

Get a comprehensive AI Readiness Assessment to determine the best approach for your organization's data infrastructure and AI implementation needs.

Take AI Readiness Assessment Schedule Technical Consultation

This comment section requires JavaScript.

Enable JavaScript in your browser to use this feature.