L8 · SysOps · Shipyard

What SysOps actually is

SysOps (systems operations) is the practice of keeping the infrastructure underneath your application alive, performant, secure, and cost-effective. Where DevOps is about how you ship, SysOps is about where the shipped thing actually runs.

For most modern teams on managed platforms (Cloudflare, Vercel, Render), the SysOps work is mostly outsourced to the platform. You don't tune the kernel. But the day something breaks, the gap between fluent and lost is huge — and it's all SysOps fluency. This level builds that.

Linux fundamentals — what to recognize in a terminal

Almost every server you'll ever touch runs Linux. You don't need to master it; you need to read it. The vocabulary:

Shell: The program that takes commands and runs them. Default is bash or zsh. The thing you type into.
Process: A running program. Has a PID (process ID), an owner, a state. ps aux lists them; top / htop shows them live.
Signal: A message sent to a process. SIGTERM ("please stop"), SIGKILL ("stop now"), SIGHUP ("reload config"). kill sends them.
Permissions: Every file has an owner, a group, and a mode (rwxrwxrwx). chmod changes the mode, chown changes the owner.
Package manager: Installs software. apt on Ubuntu/Debian, yum/dnf on RHEL/Fedora, brew on macOS, apk on Alpine.
systemd: The init system on most modern Linux. Manages services. systemctl status nginx, systemctl restart nginx — you'll do this in real life.
cron: The scheduler. Runs jobs at intervals. crontab -e opens the editor. Reading a cron line is a rite of passage.
Pipes & redirection: | sends output of one command into another. > writes to a file. cat access.log | grep 500 | wc -l = "count 500-error lines."
SSH: Secure shell. The way you log into remote servers. Keys (not passwords) for authentication.

BASH
# A real day-of-incident scratch — no need to memorize.
# What's running and using CPU?
top -b -n 1 | head -20

# Tail the app logs in real time
sudo journalctl -u tasklane.service -f

# Find recent error lines
grep -i "error\|exception" /var/log/tasklane/app.log | tail -50

# How much disk space is left?
df -h

# Restart the service
sudo systemctl restart tasklane

# Open ports?
sudo ss -tlnp

Networking — TCP/IP, VPCs, security groups

The mental model: every network communication is one machine talking to another over an IP address on a specific port, using a protocol (TCP or UDP).

IP address — the machine's network address. IPv4 (192.0.2.45) or IPv6 (2001:db8::1).
Port — the socket on that machine. HTTP is 80, HTTPS is 443, SSH is 22, Postgres is 5432.
TCP — reliable, ordered, connection-oriented. Used by HTTP, SSH, most things.
UDP — fire-and-forget, fast, unreliable. Used by DNS lookups, video streaming.
Firewall — rules controlling which traffic is allowed through. "Allow port 443 from anywhere; allow port 22 only from the office IP."
NAT — Network Address Translation. Your laptop's local IP isn't visible on the internet; your router rewrites packets so the world sees one shared IP.
VPN — Virtual Private Network. A tunnel that makes your traffic appear to come from somewhere else, encrypted end-to-end.

Cloud networking — VPCs, subnets, security groups

Inside any cloud (AWS, GCP, Azure), you build a virtual network called a VPC (Virtual Private Cloud). You divide it into subnets — typically a public subnet (load balancers, bastion hosts) and private subnets (databases, app servers) per availability zone. Security groups are stateful firewalls attached to instances, controlling inbound and outbound traffic at the level of "this instance can talk to that instance on port 5432."

A standard cloud network: public subnet for the load balancer, private subnets for app and database. Security groups (not shown) control who can talk to whom on which port.

DNS, deeper — recursion, TTLs, glue, propagation

L4 introduced DNS as the internet's phone book. At this level, you understand how the lookup actually works — because when DNS is the cause of an outage (it often is), you'll need to read what's happening.

The recursion

When a browser asks for tasklane.example, the lookup walks a tree:

Root nameservers — the .'s at the top. They know who handles .example.
TLD nameservers — for .example, .com, .io. They know who handles tasklane.example.
Authoritative nameservers — for the specific domain. They have the actual record (the A, CNAME, MX, etc.).

The browser doesn't do this walk itself — it asks a resolver (your ISP's, or 1.1.1.1, or 8.8.8.8) which does it on its behalf and caches the answer.

TTLs and propagation

Every DNS record has a TTL (time to live) in seconds. Resolvers cache the answer for that long before re-checking. A TTL of 300 means changes take up to 5 minutes to be visible everywhere; 86400 means up to a day.

"DNS propagation" is just resolvers getting a fresh answer as their cached one expires. There's no global flag-flip; it's millions of independent caches updating at different times.

Common failure modes

Stale cache — a resolver still has the old IP. Solution: wait, or flush DNS locally (sudo dscacheutil -flushcache on macOS).
Wrong nameservers at the registrar — you set up Cloudflare DNS but never updated your domain's nameservers at the registrar. Records exist in Cloudflare but no one looks there.
Glue records missing — the registrar needs to know the IP of your nameservers if they're inside the same domain.
SPF / DKIM / DMARC misconfigured — email-related TXT records. Get these wrong and your emails go to spam.

Tools you'll reach for: dig, nslookup, online tools like dnschecker.org for "what does this record look like from 30 places worldwide?"

Load balancing — L4 vs L7, ALB vs NLB

A load balancer sits in front of multiple instances of your service and distributes incoming requests. It also handles failover (a sick instance is removed from rotation), TLS termination, and often other things.

Layer	Operates at	Use for	AWS name
L4 (transport)	TCP/UDP packets — doesn't understand HTTP	Anything non-HTTP, ultra-low latency, raw TCP	NLB (Network Load Balancer)
L7 (application)	HTTP requests — sees URLs, headers, methods	Web apps, path-based routing, header rules	ALB (Application Load Balancer)

You'll mostly meet L7. It can route tasklane.com/api/* to one service and tasklane.com/* to another, all through the same DNS name.

Self-hosted alternatives:

Nginx — the workhorse. Reverse proxy + load balancer + web server. Configuration is its own art form.
HAProxy — even faster, even more configurable, reputation for absolute reliability.
Caddy — modern, automatic HTTPS, simpler config. Newer; growing fast.
Traefik — cloud-native, integrates with Docker and Kubernetes, auto-discovers services.
Envoy — the proxy underneath service meshes (Istio, Linkerd). High-end use cases.

Health checks

The load balancer pings each instance (e.g. GET /health) on an interval. Instances that fail the health check are removed from rotation. The endpoint that does this matters: shallow health checks (does the process answer?) versus deep health checks (can it reach the database?). Both have failure modes — use them deliberately.

Storage — block, object, file

Three categories of cloud storage, with different jobs:

Block storage: Acts like a hard drive. Mounted to one instance at a time. Low latency, suitable for databases. AWS: EBS. GCP: Persistent Disk.
Object storage: Files identified by keys, accessed over HTTP. Massively scalable, cheap, durable. Suitable for user uploads, backups, static assets. AWS: S3. GCP: GCS. Azure: Blob Storage. Cloudflare: R2.
File storage: Network file system shared across many instances. POSIX-compatible. Slower than block. AWS: EFS. GCP: Filestore.

The decision tree most teams use: databases on block, user files and backups on object, almost nothing on file.

Durability vs availability vs consistency

Three different guarantees that get casually confused:

Durability: Once written, will it stay there? S3 advertises 11 nines (99.999999999%). It will not lose your data.
Availability: Can you access it right now? S3 has 99.9% availability — occasional brief windows where you can't reach it.
Consistency: If two readers ask, do they get the same answer? Strong consistency = always; eventual consistency = "soon."

Backups & disaster recovery — RPO, RTO, the "tested" distinction

A team that says "we have backups" without being specific is one bad day from learning what they actually have. The vocabulary:

RPO (Recovery Point Objective): How much data can you afford to lose? An RPO of 1 hour means backups every hour, worst case losing 60 minutes of writes. RPO of 0 means continuous replication.
RTO (Recovery Time Objective): How long can you be down? An RTO of 4 hours means you must be able to restore service within 4 hours.
Tested backups: Backups you have actually restored from. Untested backups are a hope, not a plan.
3-2-1 rule: Three copies, on two different media, one off-site. Old advice, still good.

Capacity planning & autoscaling

Two questions: how much do you need on average, and how much do you need at peak. Most systems run idle most of the time and break under spikes.

Autoscaling solves the average-vs-peak problem by adjusting capacity based on a metric. The triggers you'll meet:

CPU-based — "If CPU > 70% for 5 minutes, add an instance." The default. Often fine, sometimes deceptive (a fast crash leaves CPU low while everything is broken).
Request-rate-based — scale on requests per second. Lags slightly but more directly tied to user load.
Queue-depth-based — "If the SQS queue has > 500 messages, add a worker." The right move for async workloads.
Schedule-based — "Scale up before the morning rush; scale down at midnight." Works when traffic is predictable.

FinOps — the cloud bill is a behavior

FinOps is the practice of treating cloud cost as an engineering concern, not just a finance concern. The cloud bill is the consequence of thousands of small architectural decisions; engineers write the checks, even though finance signs them.

The line items that surprise teams:

Egress (data leaving the cloud): Inbound is free; outbound costs money. A chatty integration shipping gigabytes daily is a hidden ongoing tax. The famous AWS exit-tax shape.
Idle resources: The dev environment that's been "temporarily" running for two years. The orphan EBS volume left behind by a deleted instance. The over-provisioned RDS. Most cloud bills have 20–40% in here.
NAT gateway charges: Charged per GB processed. A talkative app sitting behind one can quietly accrue a four-figure monthly line.
Logging & metrics: CloudWatch, Datadog, similar. Verbose logging sends gigabytes per day. Tame log volume early.
Cross-AZ traffic: Within the same region but different availability zones — still costs egress. Architectures that bounce traffic between AZs unnecessarily pay for it.

The tools: AWS Cost Explorer, GCP Cost Tools, third parties like Vantage, CloudHealth, Cloudability. The discipline: tag everything (so cost can be attributed), review the bill monthly, set budgets and alerts.

Compliance at concept-level

You don't need to be a lawyer; you need to recognize what each acronym is for and which apply to a given app.

Standard	Applies to	The summary
SOC 2	SaaS companies, especially B2B	Audited assurance that you handle customer data responsibly. Type I = "you have controls"; Type II = "you've followed them for ≥6 months." Gate to enterprise sales.
ISO 27001	International equivalent of SOC 2	Information security management system. Often required for non-US enterprise customers.
GDPR	Anyone serving EU residents	EU data protection law. Right to access, delete, export. Real fines.
CCPA / CPRA	Anyone serving California residents	US analog of GDPR. Less aggressive but real.
HIPAA	US healthcare-adjacent apps	Protected health information. Specific encryption, audit, access requirements.
PCI DSS	Apps that touch credit cards	Payment card industry data security. Most teams use Stripe to avoid handling card data themselves and stay out of scope.
FedRAMP, IRAP, similar	Apps selling to specific governments	Heavy. Specific to the customer.

On-call hygiene

On-call is the practice of having someone responsible for responding to incidents 24/7. Healthy on-call is a discipline; unhealthy on-call burns out engineers fast.

The structure:

Rotation — engineers take turns. Weekly or monthly. Length matters; too short = no continuity, too long = burnout.
Primary / secondary — primary gets paged first; secondary backs them up if primary doesn't acknowledge in N minutes.
Runbook — a document for each known alert: what it means, how to diagnose, how to resolve. Reduces the cognitive load when paged at 3am.
Handoff — at the end of a shift, a written summary of what happened. So the next on-call starts informed.
Time-in-lieu — pages outside business hours earn time off. Compensation for sleep disruption. Healthy teams take this seriously.

The tools you'll meet:

PagerDuty — the category leader. Schedules, escalations, integrations with everything.
Opsgenie (Atlassian) — strong alternative.
incident.io, FireHydrant, Rootly — newer, incident-focused. Some include retro/postmortem tooling.
Slack + Statuspage — many smaller teams build their own out of these.

Vulnerability management

You ship far more code than you write. A typical Node project has ~1,500 transitive dependencies. Every one of them is potentially a vulnerability — and the only way to keep up is automation.

The tools:

Dependabot (GitHub, free) — opens PRs to bump vulnerable dependencies. The default for projects on GitHub.
Renovate — like Dependabot, more configurable. Self-hostable.
Snyk — commercial. Code, dependencies, container images, IaC — all in one dashboard.
Trivy — open-source scanner for container images and IaC.
GitHub Advanced Security — secret scanning, code scanning, dependency review. Worth turning on early.

Patch cadence

The tension: patch fast for security versus patch slow for stability. Mature teams settle on:

Critical CVEs: patch within 24–72 hours.
High: within a week or two.
Medium / low: roll into a regular cadence (monthly).

The supply chain awareness: SBOM (Software Bill of Materials) — a manifest of every component your app contains. Required by US executive order EO 14028 for federal contractors. The format the industry settled on is SPDX or CycloneDX. npm audit, pip-audit, and Trivy generate them.

End of level

Wrap-up

Jargon recap

Shell / process / signal: What you type into / running program / message sent to it.
systemd / journalctl / cron: Linux service manager / its log viewer / the scheduler.
VPC / subnet / security group: Cloud network / divisions of it / firewall on instances.
L4 / L7 load balancer: Operates on TCP packets vs. HTTP requests.
ALB / NLB / Nginx / HAProxy: AWS L7 / AWS L4 / general-purpose / high-end.
Block / object / file storage: EBS / S3 / EFS shape. Pick by access pattern.
RPO / RTO: Recovery point / time objective. How much data, how much downtime.
Autoscaling triggers: CPU / requests / queue depth / schedule.
FinOps: Cloud cost as engineering. Egress, idle, NAT.
SOC 2 / ISO 27001 / GDPR / HIPAA / PCI: The compliance acronyms by what they cover.
On-call rotation / runbook / handoff: The structure of healthy on-call.
PagerDuty / Opsgenie: The alerting category leaders.
CVE / SBOM: Common Vulnerabilities and Exposures / Software Bill of Materials.
Dependabot / Trivy / Snyk: Vulnerability scanning across code, deps, containers.

You should now be able to

Read a typical Linux incident triage and explain what each command revealed.
Sketch a VPC layout for a small web app with a database.
Tell the difference between an L4 and L7 load balancer.
Explain RPO and RTO and propose values for a small SaaS.
Recognize the costliest line items on a typical AWS bill.
Pick the right compliance acronyms for a hypothetical app.
Describe a healthy on-call rotation and what makes one unhealthy.
Explain what an SBOM is and why governments are asking for them.
Read a Dependabot PR and decide whether to merge it.

Mini-exercise

Read a public outage postmortem (Cloudflare, GitHub, AWS, OpenAI all publish them). Identify which of L8's topics show up — DNS, load balancing, autoscaling, capacity, runbooks. Notice which were the cause and which were the saving grace. Real outages are the best textbook for SysOps.