Last updated about 2 minutes ago

Parallel Log Cruncher — Language-Agnostic Learning Spec

Build this in any language you’re learning.


Mandatory

1. Parse a log file

timestamp level ip path method status duration
2026-04-23T14:22:01Z INFO 192.168.1.10 /api/users GET 200 15ms

What to do: Read the file one line at a time. Split each line by whitespace. Validate each field:

Skip any line that doesn’t match. Print a warning to stderr for each skipped line. Keep count of valid vs skipped.

Line-by-line matters — don’t read the whole file into memory at once. Real log files can be gigabytes. Use the language’s buffered reader.

2. Compute analytics

From the parsed lines, compute:

3. Output

Write report.json containing:

{
  "total_requests": 15000,
  "skipped_lines": 3,
  "errors_per_endpoint": { "/api/users": 12, "/api/checkout": 45 },
  "p95_latency_ms": { "/api/users": 120, "/api/checkout": 890 },
  "top_ips": [
    { "ip": "192.168.1.10", "count": 2300 },
    { "ip": "10.0.0.5", "count": 1800 }
  ]
}

Use the language’s standard JSON serializer. Don’t build strings manually.

4. CLI

logcrunch <file>

The tool runs the analytics and writes report.json.


Optional

OptionWhat you learn
Geolocation + rate limitingAsync HTTP client, bounded concurrency, rate limiter, shared state, graceful error handling
HTTP serverWeb framework, routing, JSON responses, graceful shutdown
Benchmark modeTiming instrumentation, reporting, separating measurement from logic
Unit testsTest structure, fixtures, assertion patterns, code coverage
Streaming parse (stdin)Reading from pipes, handling unbounded input, testable with echo ... | logcrunch

Geolocation + rate limiting

For each unique IP, fetch its country code:

GET http://ip-api.com/json/{ip}
{"countryCode": "US", ...}

Constraints:

How: Collect unique IPs from parsed lines. Spawn --workers concurrent tasks. Each worker acquires a permit from a rate limiter before making the request. Write results to a shared map protected by the language’s concurrency primitive (mutex, channel, actor).

If the API is down: The tool should still complete — mark all unknown, print a warning, continue.

HTTP server

After processing the file, start a web server on the port specified by --port (default 8080).

Routes:

How: Start the server only after all processing is done. Bind to 0.0.0.0. Handle SIGINT/SIGTERM for graceful shutdown.

Benchmark mode

--bench flag: process the file, print timing breakdown to stdout, then exit. Do NOT start the HTTP server. Do NOT write report.json.

Output:

parse: 1.24s
analytics: 0.08s
total: 1.32s
lines: 15000
parse_rate: 12096 lines/s
memory_peak: 24.2 MB

How: Wrap each phase with a high-resolution timer. Report memory if the language provides it (otherwise skip).

Unit tests

Test the parser in isolation — feed it known-good and known-bad lines, verify it returns the right struct or error.

Test analytics — feed it a parsed dataset with known values, verify P95, error counts, and top IPs match.

Test output — parse a small file, run analytics, check that report.json has the expected schema and values.

How: Use the language’s standard test framework. Put test data inline (small strings) or in a testdata/ directory. Test edge cases: empty file, all errors, single line, malformed fields.

Streaming parse (stdin)

Accept input from stdin when no file argument is given:

cat access.log | logcrunch

How: If <file> is missing or -, read from stdin. Same line-by-line loop, different source.


Concepts

ConceptWhy it matters
Buffered I/OReading line by line without loading the whole file into memory. Every language has a BufReader or equivalent.
String splitting / pattern matchingParsing structured text. Some languages use regex, some use destructuring, some use split-and-index. The trade-offs matter.
Structs / records / data classesDefining typed data. How the language models structured data affects everything downstream.
Sum types / enumsLevel (INFO, WARN, ERROR, DEBUG) is a natural enum. Does the language have real sum types or do you fake it with strings?
Serialization (JSON)Converting memory to wire format. Derive macros vs reflection vs manual.
Error handlingWhat happens when a line is malformed, a file doesn’t exist, or a network request fails. Idiomatic vs try-catch vs Result types.
Concurrency primitivesThreads, async/await, goroutines, fibers — how the language models parallel work.
Rate limitingA real-world constraint that forces you to think about time and resource fairness.
Shared mutable stateMultiple tasks writing to the same map. Mutex? Channels? Actors? STM?
HTTP client / serverMaking and serving requests. Timeouts, connection pooling, middleware.
CLI argument parsingLibraries that generate help text, validate types, and handle missing arguments.
Percentile calculationSorting + indexing. Straightforward but easy to get wrong (off-by-one on the index).
Graceful shutdownCatching SIGINT, finishing in-flight work, flushing output, then exiting cleanly.

Resources

Buffered I/O

JSON serialization

Async I/O

Rate limiting patterns

Worker pool pattern

Concurrency vs parallelism

Shared mutable state

Graceful shutdown

Percentile calculation

Signal handling

Testing philosophy

CLI argument parsing patterns (general)


All resources are about concepts, not language APIs. If you need the API docs for a specific feature, search: "<language>" "<concept>".


Quick log generator (bash)

i=0; s=42; l=(INFO WARN ERROR DEBUG); m=(GET POST PUT DELETE PATCH); p=(/api/users /api/checkout /api/login /api/products /api/search /api/orders /api/reviews)
RANDOM=$s; while ((i++ < ${1:-1000})); do
  t=$(date -u -d "2026-04-23T00:00:00+00:00 + $i seconds" +%FT%TZ 2>/dev/null || date -u -j -v+${i}S -f "%Y-%m-%dT%H:%M:%S%z" "2026-04-23T00:00:00+0000" +%FT%TZ 2>/dev/null || printf "2026-04-23T%02d:%02d:%02dZ" $((i/3600)) $(((i%3600)/60)) $((i%60)))
  st=200; ((RANDOM > 28000)) && st=$((400+RANDOM%100))
  echo "$t ${l[$RANDOM%4]} $((RANDOM%256)).$((RANDOM%256)).$((RANDOM%256)).$((RANDOM%256)) ${p[$RANDOM%7]} ${m[$RANDOM%5]} $st $((RANDOM%500+1))ms"
done > test.log