Level 06 · 85 → 100
Production-grade thinking
What changes when real users hit your code on a real URL. Security at the conceptual level, error handling, logs, monitoring, performance, scaling, technical debt, code review culture, on-call, incident response, postmortems. The mental shift from 'does it work' to 'does it keep working under stress.'
The shift to production
Up to here, the goal has been does it work. From here, the goal is does it keep working under stress. Same code, different question.
Production introduces things you didn't have to think about locally:
- Adversaries. People who want to break in, scrape data, or knock the site over.
- Concurrency. A thousand users hitting the same endpoint at the same instant.
- Imperfect networks. Slow connections, dropped requests, retries.
- Drift. The data in production isn't your tidy seed data — it's everything users have managed to put in.
- Visibility. When something breaks, you need to know. Without monitoring, you're flying blind.
- Cost. Every query, log line, and request is a real bill someone pays.
What follows is the toolkit and vocabulary for handling all of that — at the conceptual level, not implementation level.
Security at the conceptual level
Security is a deep field. You don't need to be a practitioner to be fluent — you need to recognize the categories and know not to roll your own. The starting point is the OWASP Top 10, the canonical list of common web vulnerabilities (referenced and updated; see References).
The vocabulary you'll meet:
XSS (Cross-Site Scripting)
An attacker injects JavaScript that runs in another user's browser, on your site. Classic example: a comment field where someone enters <script>steal cookies</script> and you display it raw. Defense: escape user input before rendering. React does this by default; raw dangerouslySetInnerHTML is the escape hatch where it goes wrong.
CSRF (Cross-Site Request Forgery)
Attacker tricks a logged-in user's browser into making a request to your site, using their cookies. Defense: CSRF tokens on state-changing requests, and SameSite cookies. Frameworks handle most of this for you.
SQL injection
Attacker puts SQL syntax into a form field and your code concatenates it into a query. '; DROP TABLE users; --. Defense: parameterized queries. ORMs (Prisma, Drizzle) do this by default; the trap is when someone writes raw SQL with string concatenation.
Auth bypass / IDOR (Insecure Direct Object Reference)
Endpoint expects authentication but checks weakly, or doesn't check ownership. GET /api/orders/123 returns order 123 — but did you check the requester actually owns it? Most "data leak" stories are this. Defense: always check both authn and authz on every endpoint.
Secrets in code / repos
Database password, API key, signing key committed to git. Even a private repo isn't safe — repos get cloned, leaked, made public by mistake. Defense: secrets in environment variables, never in code. Rotate immediately if leaked. Use secret-scanning (GitHub does this for free).
Rate limiting / abuse
An endpoint that's expensive (sending email, calling an LLM, hitting your DB hard) without a rate limit can be turned into a denial-of-service or cost-attack vector. Defense: rate limit by user, by IP, and globally. Most platforms (Cloudflare, Vercel) offer this at the edge.
Error handling — graceful degradation
Two failure modes:
- Hard failures
- Code throws, the page crashes, the user sees a blank screen or an unhelpful "something went wrong." Worst experience.
- Soft failures (graceful degradation)
- Code catches the error, logs it, and shows the user something useful — a fallback UI, a retry button, a clear message. The system stays usable.
The patterns engineers use to handle errors well:
- Try/catch boundaries at the right level — not too low (you swallow errors silently), not too high (one error takes down the page).
- Error boundaries in React — let part of the UI fail without unmounting the whole app.
- Retry with backoff for transient failures — the API was momentarily down; try again in 1s, then 2s, then 4s before giving up.
- Idempotency for operations that might be retried — calling the endpoint twice should be safe. Especially critical for payments.
- Fail loudly to engineers, gently to users. Log full stack traces; show users a friendly message.
Logs — the breadcrumbs
A log is a line of text your code writes when something interesting happens. "User 4321 signed up." "Payment failed: card declined." "Query took 2.4s." Logs are the breadcrumbs you follow when something goes wrong in production.
Log levels (you'll see these everywhere):
| Level | For | Example |
|---|---|---|
debug | Verbose, dev-only detail | "Cache lookup for key user:42" |
info | Normal lifecycle events | "User 4321 signed up" |
warn | Unusual but recoverable | "Retried payment 2x before success" |
error | Something failed; needs attention | "Failed to send email to user 4321: connection timeout" |
fatal | Process is going to die | "Database unreachable, shutting down" |
Where logs go:
- Stdout/stderr in dev. The terminal.
- Hosting platform's log viewer in production (Cloudflare, Vercel, Render, Railway all have one).
- Centralized log services for serious operations: Datadog, Logtail, Better Stack, ELK (Elasticsearch + Logstash + Kibana). They aggregate, search, alert.
Monitoring & observability
Three pillars, sometimes called the three pillars of observability:
- Logs
- Discrete events. "What happened?"
- Metrics
- Numerical values over time. Request rate, error rate, latency percentiles, CPU usage. "How's the system doing?"
- Traces
- The path of a single request through your system — the request hit the API, which queried the DB, which called an external service. "Where did the time go?"
The tools you'll meet:
- Sentry — error tracking. When code throws in production, Sentry catches it, deduplicates, and notifies you. The first thing most projects add when they go live.
- Datadog — full APM (application performance monitoring): metrics, logs, traces, dashboards. Heavyweight; common at scale.
- Grafana — dashboards on top of any data source. Often paired with Prometheus for metrics.
- OpenTelemetry — vendor-neutral standard for emitting telemetry. Lets you switch backends without re-instrumenting.
- PostHog, Plausible, Fathom — product analytics + lightweight observability for indie projects.
Vocabulary:
- p50, p95, p99 latency
- The 50th, 95th, 99th percentile of response time. p50 is the median. p99 is "the slowest 1% of requests" — usually the user experience that matters because the worst 1% is what people remember.
- SLI / SLO / SLA
- Service Level Indicator (the metric you measure), Service Level Objective (the target), Service Level Agreement (the contractual promise to customers).
- Alert
- Automated notification when a metric crosses a threshold. Bad alerts wake people up at 3am for things that don't matter; good alerts only fire on real problems.
Performance
"Make it fast" is too vague. Performance work happens in three places, each with different tools:
- Frontend / page load. Time-to-first-paint, time-to-interactive, layout shift. Tools: Lighthouse, WebPageTest, Chrome DevTools. The metrics that matter: LCP (largest contentful paint), INP (interaction to next paint), CLS (cumulative layout shift). Google calls these the Core Web Vitals.
- Backend / API. Response time per endpoint. Tools: APM (Datadog, New Relic), platform analytics, custom logging. The big lever is usually the database query — slow endpoints are usually slow queries.
- Database. Indexes, query plans, N+1 problems (one query for the list, then N queries — one per item — defeating the point). Tools:
EXPLAINin Postgres, slow query logs, ORM-aware tools.
The order to optimize in: measure first. "Premature optimization is the root of all evil" is a real engineering quote (Knuth) and it's accurate. You can spend a week speeding up code that's never the bottleneck. Profile, find what's actually slow, fix that.
Scaling
Two flavors:
- Vertical scaling (scale up)
- Bigger machine. More CPU, more RAM. Simple — change a setting, pay more, done. Hits a ceiling at the largest single machine available.
- Horizontal scaling (scale out)
- More machines. Distribute load across them. No theoretical ceiling, but introduces complexity: load balancers, shared state, distributed databases.
Where things bottleneck, in order:
- Database. Almost always first. One Postgres can handle a lot, but eventually you need read replicas, caching, sharding, or a different DB.
- Compute. Slow endpoints, CPU-bound work, hot loops. Mitigated by horizontal scaling and async/queues.
- Memory. Leaky caches, in-memory queues. Solved by external services (Redis) and discipline.
- Network. Bandwidth in/out. Mitigated by CDN, compression, and not shipping huge payloads.
Technical debt
Code you wrote knowing it wasn't quite right, with a plan to come back. The metaphor is financial — short-term shortcut, long-term interest. Like financial debt, some is wise (taken on deliberately to ship faster), some is unwise (accidental, accumulated by neglect).
The categories:
- Strategic debt. "We're hardcoding this list because the proper admin UI takes a week. We'll build it after launch." Healthy if you actually pay it back.
- Reluctant debt. "We don't have time to refactor this; we know it's bad." Tolerated. Address opportunistically.
- Bit-rot. The code was fine; the world moved. The library deprecated the function; the framework changed conventions; the OS updated. Maintenance work, not glamorous.
- Accidental debt. Bad code shipped because no one knew better, or no one looked. The kind that creeps in without a decision.
How to talk about it:
- "We owe ourselves X." Names the debt as something to be paid.
- "Every quarter we spend N% on debt." Most working teams allocate 10–25% of capacity to maintenance and refactoring.
- "This is a leaky abstraction." The interface promises one thing, the implementation reveals another. You can feel where it leaks.
Code review culture
Past a certain size, the difference between a healthy team and a struggling one is largely visible in how they review code. Patterns of healthy review:
- Small PRs. Reviewers do better work on diffs they can hold in their head — under ~400 lines is a good target.
- Fast turnaround. A PR open for three days is a tax on the author and the team. Most strong teams aim for review within four hours during work time.
- "Why" comments. Reviewers explain reasoning, not just demand changes. "Use X here" is okay; "Use X here because Y bites us when..." is much better.
- Author leaves their own comments. "I'm not sure about this; happy to change." Lowers the temperature, surfaces uncertainty.
- Disagreement is normal. Two reasonable engineers disagree about implementation regularly. The pattern is "disagree, discuss, decide, document the decision."
- The author is responsible for landing. Reviewers comment; the author resolves and merges. Not the other way around.
Incident response
Something is broken in production. Real users are affected. What do you do?
Acknowledge
The alert fired. Someone is looking at it. Update a status page or internal channel so others know it's being handled.
Assess severity
How many users are affected? Is data at risk? Is money on the line? Sev levels (1=critical, 4=minor) shape urgency and who to wake up.
Stop the bleeding
The first job is not fix it — it's stop making it worse. Roll back the recent deploy. Disable the broken feature. Rate-limit the misbehaving endpoint. Buy yourself time.
Communicate
Customers should hear it from you, not from Twitter. A public status page update — even "we're aware and investigating" — is far better than silence.
Investigate
What changed? Recent deploys, traffic spikes, dependencies, infra. Logs, metrics, traces.
Fix it forward or restore
If rollback fixed it, you have time to find the real cause without pressure. If not, you fix and ship.
Resolve and update
System is healthy. Status page returns to green. Begin the postmortem.
Postmortems
After an incident, the team writes a postmortem — a document explaining what happened, why, and how to prevent it next time. Healthy postmortem culture is one of the strongest predictors of a team that gets reliably better.
The structure that works:
- Summary
- Two sentences. What happened, what was the impact, when was it resolved.
- Timeline
- Minute-by-minute log of who saw what when. Often automatic from chat logs and incident tools.
- Root cause
- The chain of events that led here. Often there are several contributing causes, not one.
- Impact
- Users affected, money lost, data at risk, time to restore.
- What went well
- The detection was fast. The on-call responded in two minutes. The runbook worked.
- What didn't
- The alert fired but went to the wrong channel. The deploy lacked a feature flag.
- Action items
- Concrete, owned, with deadlines. "Add feature flag to deploy script — owner: X — by: next Friday."
Wrap-up
Jargon recap
- XSS / CSRF / SQL injection
- Three common attack categories. Recognize them by name.
- OWASP Top 10
- The canonical list of web vulnerabilities. Look it up; don't memorize.
- Logs / metrics / traces
- The three pillars of observability.
- Sentry
- Error monitoring. The first observability tool most projects add.
- p50, p95, p99
- Latency percentiles. p99 is what users remember.
- Core Web Vitals
- LCP, INP, CLS. Google's frontend performance metrics.
- N+1 query
- 1 query for the list, then N more — one per item. Classic perf bug.
- Vertical / horizontal scaling
- Bigger machine vs. more machines.
- Tech debt
- Code shipped knowing it wasn't quite right. Strategic when paid back; rotting when not.
- Sev 1–4
- Severity levels for incidents. Sev 1 is everyone-wakes-up.
- Rollback
- Revert the deploy. The first move in most incidents.
- Postmortem
- Blameless writeup after an incident. The system, not the person.
You should now be able to
Mini-exercise
Pick a recent outage of a famous company that had a public postmortem (Cloudflare, GitHub, AWS, OpenAI all publish them). Read the writeup. Then, in your own words, write the timeline, root cause, and action items as if you were on that team. Notice what details mattered, what jargon they used, and how the tone stayed blameless even when something serious happened.