Level 08 · 120 → 140 · Yardmaster path
SysOps
What's actually happening on the metal. Linux, networking, DNS deeper, load balancing, storage, backups & DR, capacity planning, FinOps, compliance at concept-level, on-call hygiene, vulnerability management. The deeper layer that lets you reason about a real outage instead of just reading dashboards.
What SysOps actually is
SysOps (systems operations) is the practice of keeping the infrastructure underneath your application alive, performant, secure, and cost-effective. Where DevOps is about how you ship, SysOps is about where the shipped thing actually runs.
For most modern teams on managed platforms (Cloudflare, Vercel, Render), the SysOps work is mostly outsourced to the platform. You don't tune the kernel. But the day something breaks, the gap between fluent and lost is huge — and it's all SysOps fluency. This level builds that.
Linux fundamentals — what to recognize in a terminal
Almost every server you'll ever touch runs Linux. You don't need to master it; you need to read it. The vocabulary:
- Shell
- The program that takes commands and runs them. Default is bash or zsh. The thing you type into.
- Process
- A running program. Has a PID (process ID), an owner, a state.
ps auxlists them;top/htopshows them live. - Signal
- A message sent to a process.
SIGTERM("please stop"),SIGKILL("stop now"),SIGHUP("reload config").killsends them. - Permissions
- Every file has an owner, a group, and a mode (rwxrwxrwx).
chmodchanges the mode,chownchanges the owner. - Package manager
- Installs software.
apton Ubuntu/Debian,yum/dnfon RHEL/Fedora,brewon macOS,apkon Alpine. - systemd
- The init system on most modern Linux. Manages services.
systemctl status nginx,systemctl restart nginx— you'll do this in real life. - cron
- The scheduler. Runs jobs at intervals.
crontab -eopens the editor. Reading a cron line is a rite of passage. - Pipes & redirection
|sends output of one command into another.>writes to a file.cat access.log | grep 500 | wc -l= "count 500-error lines."- SSH
- Secure shell. The way you log into remote servers. Keys (not passwords) for authentication.
BASH# A real day-of-incident scratch — no need to memorize. # What's running and using CPU? top -b -n 1 | head -20 # Tail the app logs in real time sudo journalctl -u tasklane.service -f # Find recent error lines grep -i "error\|exception" /var/log/tasklane/app.log | tail -50 # How much disk space is left? df -h # Restart the service sudo systemctl restart tasklane # Open ports? sudo ss -tlnp
Networking — TCP/IP, VPCs, security groups
The mental model: every network communication is one machine talking to another over an IP address on a specific port, using a protocol (TCP or UDP).
- IP address — the machine's network address. IPv4 (
192.0.2.45) or IPv6 (2001:db8::1). - Port — the socket on that machine. HTTP is 80, HTTPS is 443, SSH is 22, Postgres is 5432.
- TCP — reliable, ordered, connection-oriented. Used by HTTP, SSH, most things.
- UDP — fire-and-forget, fast, unreliable. Used by DNS lookups, video streaming.
- Firewall — rules controlling which traffic is allowed through. "Allow port 443 from anywhere; allow port 22 only from the office IP."
- NAT — Network Address Translation. Your laptop's local IP isn't visible on the internet; your router rewrites packets so the world sees one shared IP.
- VPN — Virtual Private Network. A tunnel that makes your traffic appear to come from somewhere else, encrypted end-to-end.
Cloud networking — VPCs, subnets, security groups
Inside any cloud (AWS, GCP, Azure), you build a virtual network called a VPC (Virtual Private Cloud). You divide it into subnets — typically a public subnet (load balancers, bastion hosts) and private subnets (databases, app servers) per availability zone. Security groups are stateful firewalls attached to instances, controlling inbound and outbound traffic at the level of "this instance can talk to that instance on port 5432."
DNS, deeper — recursion, TTLs, glue, propagation
L4 introduced DNS as the internet's phone book. At this level, you understand how the lookup actually works — because when DNS is the cause of an outage (it often is), you'll need to read what's happening.
The recursion
When a browser asks for tasklane.example, the lookup walks a tree:
- Root nameservers — the .'s at the top. They know who handles
.example. - TLD nameservers — for
.example,.com,.io. They know who handlestasklane.example. - Authoritative nameservers — for the specific domain. They have the actual record (the A, CNAME, MX, etc.).
The browser doesn't do this walk itself — it asks a resolver (your ISP's, or 1.1.1.1, or 8.8.8.8) which does it on its behalf and caches the answer.
TTLs and propagation
Every DNS record has a TTL (time to live) in seconds. Resolvers cache the answer for that long before re-checking. A TTL of 300 means changes take up to 5 minutes to be visible everywhere; 86400 means up to a day.
"DNS propagation" is just resolvers getting a fresh answer as their cached one expires. There's no global flag-flip; it's millions of independent caches updating at different times.
Common failure modes
- Stale cache — a resolver still has the old IP. Solution: wait, or flush DNS locally (
sudo dscacheutil -flushcacheon macOS). - Wrong nameservers at the registrar — you set up Cloudflare DNS but never updated your domain's nameservers at the registrar. Records exist in Cloudflare but no one looks there.
- Glue records missing — the registrar needs to know the IP of your nameservers if they're inside the same domain.
- SPF / DKIM / DMARC misconfigured — email-related TXT records. Get these wrong and your emails go to spam.
Tools you'll reach for: dig, nslookup, online tools like dnschecker.org for "what does this record look like from 30 places worldwide?"
Load balancing — L4 vs L7, ALB vs NLB
A load balancer sits in front of multiple instances of your service and distributes incoming requests. It also handles failover (a sick instance is removed from rotation), TLS termination, and often other things.
| Layer | Operates at | Use for | AWS name |
|---|---|---|---|
| L4 (transport) | TCP/UDP packets — doesn't understand HTTP | Anything non-HTTP, ultra-low latency, raw TCP | NLB (Network Load Balancer) |
| L7 (application) | HTTP requests — sees URLs, headers, methods | Web apps, path-based routing, header rules | ALB (Application Load Balancer) |
You'll mostly meet L7. It can route tasklane.com/api/* to one service and tasklane.com/* to another, all through the same DNS name.
Self-hosted alternatives:
- Nginx — the workhorse. Reverse proxy + load balancer + web server. Configuration is its own art form.
- HAProxy — even faster, even more configurable, reputation for absolute reliability.
- Caddy — modern, automatic HTTPS, simpler config. Newer; growing fast.
- Traefik — cloud-native, integrates with Docker and Kubernetes, auto-discovers services.
- Envoy — the proxy underneath service meshes (Istio, Linkerd). High-end use cases.
Health checks
The load balancer pings each instance (e.g. GET /health) on an interval. Instances that fail the health check are removed from rotation. The endpoint that does this matters: shallow health checks (does the process answer?) versus deep health checks (can it reach the database?). Both have failure modes — use them deliberately.
Storage — block, object, file
Three categories of cloud storage, with different jobs:
- Block storage
- Acts like a hard drive. Mounted to one instance at a time. Low latency, suitable for databases. AWS: EBS. GCP: Persistent Disk.
- Object storage
- Files identified by keys, accessed over HTTP. Massively scalable, cheap, durable. Suitable for user uploads, backups, static assets. AWS: S3. GCP: GCS. Azure: Blob Storage. Cloudflare: R2.
- File storage
- Network file system shared across many instances. POSIX-compatible. Slower than block. AWS: EFS. GCP: Filestore.
The decision tree most teams use: databases on block, user files and backups on object, almost nothing on file.
Durability vs availability vs consistency
Three different guarantees that get casually confused:
- Durability
- Once written, will it stay there? S3 advertises 11 nines (99.999999999%). It will not lose your data.
- Availability
- Can you access it right now? S3 has 99.9% availability — occasional brief windows where you can't reach it.
- Consistency
- If two readers ask, do they get the same answer? Strong consistency = always; eventual consistency = "soon."
Backups & disaster recovery — RPO, RTO, the "tested" distinction
A team that says "we have backups" without being specific is one bad day from learning what they actually have. The vocabulary:
- RPO (Recovery Point Objective)
- How much data can you afford to lose? An RPO of 1 hour means backups every hour, worst case losing 60 minutes of writes. RPO of 0 means continuous replication.
- RTO (Recovery Time Objective)
- How long can you be down? An RTO of 4 hours means you must be able to restore service within 4 hours.
- Tested backups
- Backups you have actually restored from. Untested backups are a hope, not a plan.
- 3-2-1 rule
- Three copies, on two different media, one off-site. Old advice, still good.
Capacity planning & autoscaling
Two questions: how much do you need on average, and how much do you need at peak. Most systems run idle most of the time and break under spikes.
Autoscaling solves the average-vs-peak problem by adjusting capacity based on a metric. The triggers you'll meet:
- CPU-based — "If CPU > 70% for 5 minutes, add an instance." The default. Often fine, sometimes deceptive (a fast crash leaves CPU low while everything is broken).
- Request-rate-based — scale on requests per second. Lags slightly but more directly tied to user load.
- Queue-depth-based — "If the SQS queue has > 500 messages, add a worker." The right move for async workloads.
- Schedule-based — "Scale up before the morning rush; scale down at midnight." Works when traffic is predictable.
FinOps — the cloud bill is a behavior
FinOps is the practice of treating cloud cost as an engineering concern, not just a finance concern. The cloud bill is the consequence of thousands of small architectural decisions; engineers write the checks, even though finance signs them.
The line items that surprise teams:
- Egress (data leaving the cloud)
- Inbound is free; outbound costs money. A chatty integration shipping gigabytes daily is a hidden ongoing tax. The famous AWS exit-tax shape.
- Idle resources
- The dev environment that's been "temporarily" running for two years. The orphan EBS volume left behind by a deleted instance. The over-provisioned RDS. Most cloud bills have 20–40% in here.
- NAT gateway charges
- Charged per GB processed. A talkative app sitting behind one can quietly accrue a four-figure monthly line.
- Logging & metrics
- CloudWatch, Datadog, similar. Verbose logging sends gigabytes per day. Tame log volume early.
- Cross-AZ traffic
- Within the same region but different availability zones — still costs egress. Architectures that bounce traffic between AZs unnecessarily pay for it.
The tools: AWS Cost Explorer, GCP Cost Tools, third parties like Vantage, CloudHealth, Cloudability. The discipline: tag everything (so cost can be attributed), review the bill monthly, set budgets and alerts.
Compliance at concept-level
You don't need to be a lawyer; you need to recognize what each acronym is for and which apply to a given app.
| Standard | Applies to | The summary |
|---|---|---|
| SOC 2 | SaaS companies, especially B2B | Audited assurance that you handle customer data responsibly. Type I = "you have controls"; Type II = "you've followed them for ≥6 months." Gate to enterprise sales. |
| ISO 27001 | International equivalent of SOC 2 | Information security management system. Often required for non-US enterprise customers. |
| GDPR | Anyone serving EU residents | EU data protection law. Right to access, delete, export. Real fines. |
| CCPA / CPRA | Anyone serving California residents | US analog of GDPR. Less aggressive but real. |
| HIPAA | US healthcare-adjacent apps | Protected health information. Specific encryption, audit, access requirements. |
| PCI DSS | Apps that touch credit cards | Payment card industry data security. Most teams use Stripe to avoid handling card data themselves and stay out of scope. |
| FedRAMP, IRAP, similar | Apps selling to specific governments | Heavy. Specific to the customer. |
On-call hygiene
On-call is the practice of having someone responsible for responding to incidents 24/7. Healthy on-call is a discipline; unhealthy on-call burns out engineers fast.
The structure:
- Rotation — engineers take turns. Weekly or monthly. Length matters; too short = no continuity, too long = burnout.
- Primary / secondary — primary gets paged first; secondary backs them up if primary doesn't acknowledge in N minutes.
- Runbook — a document for each known alert: what it means, how to diagnose, how to resolve. Reduces the cognitive load when paged at 3am.
- Handoff — at the end of a shift, a written summary of what happened. So the next on-call starts informed.
- Time-in-lieu — pages outside business hours earn time off. Compensation for sleep disruption. Healthy teams take this seriously.
The tools you'll meet:
- PagerDuty — the category leader. Schedules, escalations, integrations with everything.
- Opsgenie (Atlassian) — strong alternative.
- incident.io, FireHydrant, Rootly — newer, incident-focused. Some include retro/postmortem tooling.
- Slack + Statuspage — many smaller teams build their own out of these.
Vulnerability management
You ship far more code than you write. A typical Node project has ~1,500 transitive dependencies. Every one of them is potentially a vulnerability — and the only way to keep up is automation.
The tools:
- Dependabot (GitHub, free) — opens PRs to bump vulnerable dependencies. The default for projects on GitHub.
- Renovate — like Dependabot, more configurable. Self-hostable.
- Snyk — commercial. Code, dependencies, container images, IaC — all in one dashboard.
- Trivy — open-source scanner for container images and IaC.
- GitHub Advanced Security — secret scanning, code scanning, dependency review. Worth turning on early.
Patch cadence
The tension: patch fast for security versus patch slow for stability. Mature teams settle on:
- Critical CVEs: patch within 24–72 hours.
- High: within a week or two.
- Medium / low: roll into a regular cadence (monthly).
The supply chain awareness: SBOM (Software Bill of Materials) — a manifest of every component your app contains. Required by US executive order EO 14028 for federal contractors. The format the industry settled on is SPDX or CycloneDX. npm audit, pip-audit, and Trivy generate them.
Wrap-up
Jargon recap
- Shell / process / signal
- What you type into / running program / message sent to it.
- systemd / journalctl / cron
- Linux service manager / its log viewer / the scheduler.
- VPC / subnet / security group
- Cloud network / divisions of it / firewall on instances.
- L4 / L7 load balancer
- Operates on TCP packets vs. HTTP requests.
- ALB / NLB / Nginx / HAProxy
- AWS L7 / AWS L4 / general-purpose / high-end.
- Block / object / file storage
- EBS / S3 / EFS shape. Pick by access pattern.
- RPO / RTO
- Recovery point / time objective. How much data, how much downtime.
- Autoscaling triggers
- CPU / requests / queue depth / schedule.
- FinOps
- Cloud cost as engineering. Egress, idle, NAT.
- SOC 2 / ISO 27001 / GDPR / HIPAA / PCI
- The compliance acronyms by what they cover.
- On-call rotation / runbook / handoff
- The structure of healthy on-call.
- PagerDuty / Opsgenie
- The alerting category leaders.
- CVE / SBOM
- Common Vulnerabilities and Exposures / Software Bill of Materials.
- Dependabot / Trivy / Snyk
- Vulnerability scanning across code, deps, containers.
You should now be able to
Mini-exercise
Read a public outage postmortem (Cloudflare, GitHub, AWS, OpenAI all publish them). Identify which of L8's topics show up — DNS, load balancing, autoscaling, capacity, runbooks. Notice which were the cause and which were the saving grace. Real outages are the best textbook for SysOps.