Operations Without K8s
I’ll share our experience operating production infrastructure without Kubernetes at Kinescope. We moved away from K8s about three years ago and built a simple, reliable, and efficient system based on standard Linux tools: systemd, package managers, and Ansible.
This article describes the reasons for moving away from Kubernetes, system requirements, solution choices, and practical experience operating more than 50 servers across 4 data centers with one administrator.
TL;DR
Key takeaways from our experience operating without Kubernetes:
- Simplicity over complexity — standard Linux tools (systemd, package managers) solve most tasks without additional abstraction layers
- One binary — one service — all services are built into a single binary file, simplifying delivery and operations
- Automation via Ansible — all infrastructure is described as code, developers don’t depend on admins
- Scaling via server groups — simple horizontal scaling without orchestrators
- Fault tolerance via DNS and BGP — traffic routing at the network level, not application level
- Fewer components — fewer problems — in three years, nothing has failed due to systemd or package managers
Context: What Was and What Became
We currently don’t have Kubernetes or Docker in production. They existed before, but we didn’t specifically remove them — the transition happened naturally. We have one admin, one DevOps engineer, operate in 4 data centers (two in Russia, one in Europe, one in the US, and we’re planning in Asia), and run a bit more than 50 servers.
A long time ago we had Kubernetes, and we’ve been operating without it for about three years (maybe a bit less). It ran, consumed resources, sometimes broke, and we didn’t understand what broke. Everything seemed to work, but fixing it was often very difficult — an admin handled it. We even had two admins back then.
I really didn’t like what was happening because there was no control. Having two data centers meant several Kubernetes clusters, which created additional complexity.
We didn’t want to give up anything, but we wanted more stability in the system.
What We Wanted from the System
Before solving problems with Kubernetes as such, we wanted to ask: what do we actually want from our systems?
Operations Requirements
An engineer who handles operations is not a developer. They don’t know what’s in a service, and they don’t need to know what the service does or how it works. We have about 40 services, maybe more.
Operations wanted a service to:
- Expose some ports externally (if it works over HTTP — that’s not so important)
- Have a metrics port (mandatory)
- Store parameters via environment variables (all configs should be via environment variables, so there are no config files if possible)
- Not differ much from other services (why use many different services that are configured and operated differently — this will be complex, expensive, and inefficient)
Development Requirements
Development, naturally, wanted:
- Not to depend on operations — not to be in a hostage situation
- Not to ask to add a service or deploy somewhere
- Not to beg for logs, monitoring, and everything else — common everyday problems that remain problems for some
General Requirements
What we wanted from the system overall:
- Automation — it existed, but we wanted more
- Scaling — the system should scale
- Fault tolerance — the system should be fault-tolerant
- Unified package format and delivery — this is important: we deploy every day, sometimes 10–20 releases
- Service persistence — we want to build a service in exactly the same state, test it, and then deploy to production
- Resource isolation — more than one service can run on one machine
- Security — this is important
- Performance — we had performance problems with Kubernetes, it broke, this was expensive in terms of hardware
Solution: One Binary — One Service
We took a slightly different path: instead of building a platform for services that could do anything, we brought the services themselves to the needed state.
Service Requirements
One service should be one binary with standard metrics and everything else. We had services written in Ruby — “one service — one binary” didn’t fit them. We rewrote the remaining Ruby services in Go1 so they could be built as binaries.
We build not only traditional services. For example, we have a player (JS files), a website, and an admin panel — we pack all of these into one binary together with static assets. From an operations perspective, everything is the same — whether it’s a player, API, or DNS server, they’re all treated identically for operations and delivery.
Unified Metrics Format
Metrics should be standardized, but there should always be at least one that describes the application: version, build time, and name. This way we always see it.
Ultimately, all services became the same — they don’t differ at all from an operations perspective. We ended up with one binary, which made operations much easier.
Why Not Docker?
The benefits go beyond being able to run a binary manually. We need to deliver and start services. For this, many say Docker is needed: because Docker is convenient, Docker “packages,” Docker has resource isolation and other capabilities.
But Docker has overhead. Plus networking and diagnostics become more complicated. We needed a proven and reliable solution, and Docker didn’t fit: not that there were specific problems, but it’s too large, unnecessary, and adds performance overhead.
Solution: Package Manager and systemd
To deliver something reliably and start it, everything needed is in the operating system. We run Linux (Ubuntu everywhere), and like any operating system, it has a package manager (apt in our case) and systemd, which manages applications.
These tools are mature, reliable, and proven. No additional systems need to be installed — we avoid extra complexity and have fewer parts that can fail. This is a significant advantage.
Practical Note. systemd and package managers are standard Linux tools that are time-tested and used everywhere. Their advantage isn’t just reliability, but also that any Linux administrator knows how to work with them. This lowers the barrier to entry for new team members and simplifies problem diagnostics. Source: experience operating more than 50 servers without Kubernetes.
The package manager doesn’t have to be apt: it could be yum or any other, depending on your operating system. We use Ubuntu, so apt. Overall there’s no difference.
Delivery Process: From Build to Launch
The full delivery pipeline looks like this:
[1] Development
|
v
[2] Build (go build)
|
v
[3] Create package (kinectl deb)
| |
| +---> Read spec.yml
| +---> Generate deb/rpm via nfpm
|
v
[4] Package repository (apt repo)
| |
| +---> Versioning
| +---> Store old versions (for rollback)
|
v
[5] Ansible playbook (kinescope-service.yml)
| |
| +---> Prepare: load spec.yml
| +---> Deploy: install package on host group
| | |
| | +---> apt update && apt install
| | +---> Create systemd unit
| | +---> Configure environment variables
| | +---> Configure secrets
| | +---> systemctl start service
| | +---> Health check (check /metrics)
| |
| +---> Monitoring: add to Prometheus
| +---> Notify: annotation in Grafana
|
v
[6] Service running
|
+---> Metrics available
+---> Logs collected
+---> Alerts configured
Deployment Process Details
Stage 1: Preparation (Prepare)
- Playbook reads
spec.ymlfrom the service repository - Parses the specification and prepares variables
- Loads encrypted data from Ansible Vault
- Merges variables from different sources (common, environment, service-specific)
- Generates template for environment variables (including encrypted secrets)
Stage 2: Deployment (Deploy)
- Updates package cache on target hosts (
apt update) - Installs package of specified version (
apt install package=version) - Pins package version (
dpkg --set-selections hold), to prevent automatic updates - Creates configuration directory (
/etc/${service_name}/) - Writes environment variables (including encrypted secrets) to
/etc/${service_name}/environment - Generates systemd unit file from template with all limits and settings
- Reloads systemd daemon (
systemctl daemon-reload) - Starts service (
systemctl start ${service_name}) - Enables autostart (
systemctl enable ${service_name}) - Checks health check: makes HTTP request to
/metricsup to 12 times with 5-second interval, until it gets 200 response
Stage 3: Monitoring (Monitoring)
- Adds service to Prometheus configuration (file_sd)
- Reloads Prometheus configuration
- Creates annotation in Grafana about deployment (version, host, user, time)
Stage 4: Cleanup (Cleanup)
- Removes temporary files created during deployment
Rollback
Rollback to a previous version is as simple as deployment:
ansible-playbook playbooks/kinescope-service.yml \
-i inventories/production \
-e app_version=v1.2.3 # Previous version
The playbook will install the specified package version, restart the service, and check health check. Old package versions are stored in the repository, so rollback is always possible.
Canary Deployment
For gradual deployment, you can use the serial parameter in spec.yml:
service:
deploy:
serial: 1 # Deploy one host at a time
Or limit deployment to a specific host:
ansible-playbook playbooks/kinescope-service.yml \
-i inventories/production \
-e app_version=v2.0.0 \
--limit host-01 # Only on one host
After checking on one host, you can deploy to the rest.
Health Check and Availability Verification
After starting the service, the playbook checks its availability via the health check endpoint. This is important because:
- Early problem detection: if the service didn’t start or crashed right after startup, we learn about it before traffic goes to it
- Automatic check: no need to manually check that the service is working
- Integration with load balancers: the health check endpoint is used not only by Ansible, but also by load balancers for traffic routing
If the health check doesn’t pass after 12 attempts (60 seconds), the playbook exits with an error, and you can figure out what’s wrong.
Everything with standard OS tools: package manager for delivery, systemd for starting and management, Ansible for automation.
Building Packages: NFPM
The only thing we use for building is NFPM (Not FPM). We’re too lazy to write a file or edit it manually for building packages each time, so there’s a great tool NFPM from GoReleaser. This is a simple tool for creating packages (deb, rpm, apk, ipk, arch linux), written in Go, without dependencies. It allows describing in YAML what you want to see in a package, and building for the needed OS — and then running via systemd.
systemd Advantages
System Visibility
First, the service is visible to the system, and we monitor this easily. We throw exporters on the machine, and even if a developer suddenly didn’t add monitoring (e.g., endpoint not added or we didn’t add it to Prometheus), we’ll still see the service because it exposes standard metrics. We pick it up by name — all services are visible in the system; whether it’s up or down — that’s also visible.
Resource Isolation
If someone says Docker is needed for resource isolation, that’s not quite right. In systemd, limits are written the same way — CPU, memory, and everything you want. That is, it’s done flexibly enough.
Security
Security, naturally, exists: if a service, for example, doesn’t work with some devices, it won’t be able to reach them. You write all of this in the same place, and it’s all visible.
Scheduled Tasks
We use systemd for scheduled tasks. If someone remembers cron — same idea. We have several tasks that we run rarely, so there’s no point keeping a service running all the remaining time. Via systemd we just set a timer/task — it starts, runs, and finishes.
Performance and Reliability
If we have nothing extra, then there’s nothing to break, and no one to waste resources. systemd works stably. Yes, I’ve heard complaints about systemd that it supposedly consumes many resources, but in our practice its overhead is practically unnoticeable. It’s reliable: if systemd doesn’t work — nothing works for us, so it’s a basic system that everyone watches.
Automation via Ansible
Usually everyone talks about implementing something. We did the opposite — we removed things and got results almost without doing anything. We didn’t even install any additional software. We ended up with a unified package format, delivery works well, there’s versioning, isolation — almost a silver bullet.
However, we still had automation questions to address.
Everything Needs to Be Automated
This might be obvious, but everything needs to be automated, even if you have one server. If it fails, you won’t be able to restore it manually — you won’t remember the exact configuration, environment, or what it was doing.
Why Ansible?
For automation, we chose Ansible. We already had it, so the choice was logical. “Why Ansible?” — what else? We used Puppet years ago. It can be extended with plugins and maintained, but “easy to maintain” needs an asterisk: in practice this isn’t always true.
If you use Ansible, you inevitably get “garbage” and technical debt. So from time to time we come back and clean: rewrite playbooks, remove unnecessary roles/services and everything that accumulated.
How Ansible Works for Us
Our setup is simple. We have two teams — developers and engineers who support the system (admins, to put it simply).
Operations handles the base: when a server appears and network is connected, a base role is rolled out on it. Then the server is added to inventory, root access is disabled, access is configured, and monitoring/logging is installed: for example, node exporter and an agent for log collection. We collect all logs in the system — not only our services, but everything that exists. This is rarely needed, but when it is, you can see exactly what happened on the host.
For developers, Ansible is also important: playbooks are in the repository, and both sides can commit to it — developers and admins. Developers are fully responsible for their services. Since our services are unified, most often it’s enough to copy a “skeleton” and set the needed settings. Routing rules in nginx/proxy are also handled by development — operations rarely touches this.
A developer can add their service “from commit,” build it, and deploy it — no admin needed. This lets admins sleep better and worry less.
“Manifest Style” Without Kubernetes: Unified spec File Per Service
Human error still remains: you can forget memory limits, port for monitoring, some required parameter or dependency. To reduce the probability of errors, we use a unified specification file — spec.yml — that describes the entire service lifecycle: from building a package to configuring systemd unit and monitoring.
Simple idea: one file in the service repository describes everything needed for its operations. This resembles a Kubernetes manifest, but without Kubernetes itself. From this file, the following is automatically generated:
- deb/rpm package (via the
kinectlutility, which usesnfpmas a library) - Ansible playbook for deployment
- systemd unit file with correct limits and settings
- monitoring configuration
spec.yml Structure
The spec.yml file describes a service in YAML format. Here’s a minimal example:
service:
name: my-service
group: applications
description: "My service"
deploy:
probe:
type: "http"
port: 9090
path: /metrics
limits:
mem: 500M
no_file: 5_000
environments: |
HTTP_ADDRESS=:8080
LOG_LEVEL=info
A more complete example with dependencies and secrets:
service:
name: my-api-service
group: applications
description: "API service"
deploy:
serial: 1 # Deploy one host at a time
probe:
type: "http"
port: 9090
path: /metrics
limits:
mem: 1G
no_file: 10_000
nproc: 1000
resources:
- postgres
- sentinel
- memcached
- nats
environments: |
HTTP_ADDRESS=:{{ api_port }}
LOG_LEVEL={{ log_level }}
SECRET_KEY={{ vault_secret_key_encrypted }}
API_TOKEN={{ vault_api_token_encrypted }}
exec:
start: "/usr/bin/my-api-service"
stop: "/bin/kill -s SIGINT $MAINPID"
security:
owner: "kinescope"
group: "kinescope"
What spec.yml Describes
Basic parameters:
name— service name (must match package name)group— host group for deployment (applications, encoders, storage, etc.)description— service description
Deployment:
deploy.serial— how many hosts to deploy simultaneously (for canary or gradual deployment)deploy.probe— health check settings (type, port, path)deploy.numa_netdevs— list of network interfaces for NUMA binding (if needed)
Resource limits:
limits.mem— memory limit (e.g.,500M,1G,unlimited)limits.no_file— maximum number of open fileslimits.nproc— maximum number of processeslimits.nice— nice value for processlimits.allowed_cpus— allowed CPUs (cpuset, e.g.,"0-3")
Dependencies:
resources— list of dependencies on other services (postgres, redis, nats, etc.). Environment variables for connecting to these services are automatically added
Configuration:
environments— environment variables (including encrypted secrets). Sensitive data is encrypted via Ansible Vault and encoded in base64. When building the binary, a decryption password is embedded (via thegitlab.kinescope.dev/go/vaultlibrary), and on startup the application automatically decrypts environment variablesrequired_packages— system packages that need to be installed
Management commands:
exec.start— start command (default/usr/bin/${service.name})exec.stop— stop command (defaultSIGINT)exec.reload— reload command (if supported)
Security:
security.ownerandsecurity.group— user and group for running the servicecapabilities— Linux capabilities (e.g.,CAP_NET_BIND_SERVICE)
Common Mistakes and How to Avoid Them
Mistake 1: Forgot to specify memory limit
# Bad: no limit
service:
name: my-service
# limits missing
# Good: limit specified
service:
name: my-service
limits:
mem: 500M
Mistake 2: Wrong port for health check
# Bad: port doesn't match actual
deploy:
probe:
port: 8080 # But service listens on 9090
# Good: port matches
deploy:
probe:
port: 9090 # Service really listens on 9090
Mistake 3: Forgot to specify dependencies
# Bad: service uses PostgreSQL, but it's not specified
service:
name: my-service
# resources missing, but code uses POSTGRES_DSN
# Good: dependencies specified
service:
name: my-service
resources:
- postgres # Will automatically add POSTGRES_DSN
Mistake 4: Secrets in plain text
# Bad: secrets in plain text
environments: |
SECRET_KEY=super-secret-key # Insecure!
# Good: secrets encrypted via Ansible Vault and encoded in base64
environments: |
LOG_LEVEL=info
SECRET_KEY={{ vault_secret_key_encrypted_base64 }} # Encrypted via Ansible Vault, then base64
# When building binary, decryption password is embedded via gitlab.kinescope.dev/go/vault
# On startup, application automatically decrypts environment variables
How this works in practice:
Encrypting secret (when preparing configuration via Ansible Vault):
# Secret is encrypted via Ansible Vault and encoded in base64 ansible-vault encrypt_string "my-secret-key" --vault-password-file vault-pass | base64In spec.yml the secret is stored encrypted:
environments: | SECRET_KEY={{ vault_secret_key_encrypted_base64 }}In application code the
gitlab.kinescope.dev/go/vaultlibrary is used:package action import "gitlab.kinescope.dev/go/vault" var ( vaultPassword = "" // Embedded at build time via -ldflags skipDecrypt, _ = strconv.ParseBool(os.Getenv("SKIP_DECRYPT")) ) var Flags = []cli.Flag{ vault.StringFlag{ Name: "secret-key", EnvVar: "SECRET_KEY", Value: "", // Value from environment variable (encrypted) Password: vaultPassword, // Embedded at build time via Makefile SkipDecrypt: skipDecrypt, // For local development }, }At build time the decryption password is embedded via Makefile:
VAULT_PASSWORD ?="" LDFLAGS := -X '$(GO_PACKAGE)/cmd/service/action.vaultPassword=$(VAULT_PASSWORD)' build: go build -ldflags "$(LDFLAGS)" -o service cmd/service/main.goOn startup the application automatically decrypts environment variables. If the
SKIP_DECRYPT=truevariable is set, the value is used as-is (for local development).
Thus, encrypted secrets can be safely stored in spec.yml and environment variables without exposing them in plain text on servers. The decryption password is embedded in the binary at build time, which allows the application to automatically decrypt secrets on startup.
Approach Benefits
- Single source of truth: all service information in one place
- Fewer errors: the utility can check required fields and set defaults
- Automation: everything needed for deployment is generated from one file
- Versioning: spec.yml is stored in Git with code, changes can be tracked
- Simplicity: a developer doesn’t need to know Ansible or systemd details — just describe the service in spec.yml
This isn’t a complete replacement for Kubernetes manifests, but for our case it’s sufficient. We get many benefits of the “manifest style” approach without orchestrator complexity.
Scaling and Fault Tolerance
What Is Scaling?
Scaling was one of our goals: the system should scale and be fault-tolerant. In Kubernetes, autoscaling means load arrives and a new service instance “appears.” Creating service instances on the fly seemed questionable to us: it adds complexity and magic, and we wanted predictability.
If a service needs to handle more traffic, we wanted it to handle that traffic through normal horizontal scaling by hosts/groups. We didn’t want a separate “magical” entity deciding when and how many instances to create. A single load balancer is also a failure point — once you have a master, you have a failure risk, and we didn’t want that.
Simple Solution: Server Groups
We solved scaling simply — “like in the old days.” Servers are divided into groups: application servers (simple, cheap machines), CDN servers (we work with video — lots of traffic and disk), transcoding servers (different configuration: network/CPU/GPU for video processing), etc.
Servers have different configurations because they handle different tasks. It’s simple: each group has more than one server. If one isn’t enough, we add more. This is most relevant for CDN (more traffic means more servers) and transcoding (we add machines to the group, they pull tasks from the queue).
Fault Tolerance via DNS and BGP
Fault tolerance comes from simple mechanisms. For background tasks: if one server drops out, tasks remain in the queue and are picked up by others — everything continues working.
For HTTP traffic we use a two-tier scheme. First, DNS directs users to the appropriate region (e.g., users from Russia go to Russian servers). Then within the region, load balancing by server groups routes users to a data center, and then to a specific server in the group.
If a server fails, BGP is disabled and it drops out of distribution in about one and a half seconds — fast enough for us. When it comes back up, it’s automatically added back to routing. This works well.
Custom DNS Server
We went further: we built our own DNS servers and can configure routing very precisely. Yes, we “reinvented the wheel” and wrote our own DNS server — but it lets us set traffic processing rules exactly as needed.
This helps a lot with CDN scaling. For large clients, we can optimize infrastructure, and DNS is key to this. No one offered such a solution out of the box — we had to build it ourselves.
Solution Benefits
Benefits from what we got:
- Almost nothing had to be done — usually everyone talks about what they “implemented,” but we removed a lot instead
- Reliable — we removed complexity rather than adding it. We use what the operating system provides out of the box, without installing anything extra
- Simple enough — no overhead: software runs in the environment, just install and start
- Monitors excellently — not because we wrote something clever, but because we standardized our approach to metrics, logs, and everything else. All logs in one place — thanks to Loki, which we switched to at some point. Now there’s a lot available “out of the box” and free for operations
Drawbacks
For us, there are basically no drawbacks. Others might see them, but the result works for us. There’s not much to “improve” here: we have fewer components because we removed a lot.
If we think about it, of course they exist:
- No autoscaling — which I talked about in the Kubernetes context. We have our own hardware, that is, we can’t “snap our fingers” and get additional resources. In the cloud this is simpler: most providers provide autoscaling — more traffic came, a virtual machine started; traffic dropped — the machine can be deleted and money saved
- Can break — for example, a repository might disappear or require manual package management. So we package almost all additional software ourselves (Grafana, Loki, exporters, VictoriaMetrics, Prometheus, etc.) and keep it in our repository. We use external repositories minimally (e.g., nginx)
- Can’t locally spin up “the whole system with one command” — for newcomers this would sometimes be convenient. But it’s not needed: there are many services, and to work on a specific service, you don’t need to spin up everything. Usually a database, nginx, and the service itself are enough
- Can’t google a ready Helm chart and close the task — sometimes this is convenient, but it has a downside: you can “install by default” and not understand how it works. I had a real case: Kafka was set up with default paths in
/tmp, and this only came to light when problems started. So we prefer people to understand what they’re installing and how it will run in production
Observability: Metrics, Logs, Alerts
Observability doesn’t depend on Kubernetes. We use a standard stack: Prometheus for metrics, Loki for logs, Grafana for visualization, AlertManager for alerts.
Minimum Metrics Set for a Service
Each service should expose at minimum:
- Version and build time — to always know which version is running
- Health check endpoint —
/metricsor/healthfor availability check - Go runtime metrics (if the service is in Go) — GC activity, memory usage, number of goroutines
Additionally useful:
- Business logic metrics (request count, errors, latency)
- Dependency metrics (DB response time, cache, external APIs)
- Resource metrics (CPU, memory, disk, network)
Centralized Logs
All services write logs to stdout/stderr, systemd collects them in journald, and Promtail sends to Loki. This allows:
- Seeing logs of all services in one place
- Searching by time, host, service
- Setting alerts on log patterns
- Analyzing incidents after they’re resolved
Alerts and Response
Alerts are configured in Prometheus and sent to AlertManager, which groups them and sends to Telegram. Main alert types:
- Service unavailable — health check doesn’t respond
- High memory usage — exceeding limit or threshold
- Disk space shortage — free space < 25%
- High temperature — for physical servers
- Dependency problems — DB, cache, queues unavailable
Checklist: What Each Service Should Have
For operations to be predictable, each service should have:
- ✅ Health check endpoint —
/metricsor/healthon a separate port - ✅ Version metrics — version, build time, service name
- ✅ Logging — structured logs to stdout/stderr
- ✅ Resource limits — memory, file descriptors, processes (in
spec.yml) - ✅ Graceful shutdown — correct handling of SIGTERM/SIGINT
- ✅ Environment variables — all configuration via env, not via files
- ✅ Dependencies — specified in
spec.ymlviaresources - ✅ Documentation — description in
spec.ymlviadescription
If a service matches this checklist, its operations will be simple and predictable.
Common Problems and Their Solutions
Service Doesn’t Start
Symptoms: systemctl status shows failed, logs show an error.
Diagnostics:
# Check status
systemctl status my-service
# View logs
journalctl -u my-service -n 50
# Check configuration
cat /etc/my-service/environment
# Check file permissions
ls -la /usr/bin/my-service
Common causes:
- Wrong environment variables (check
/etc/my-service/environment) - Missing dependencies (DB, cache unavailable)
- Wrong file permissions
- Port already occupied by another process
Solution: Fix configuration and restart the service.
High Memory Usage
Symptoms: MemoryUsed alert, service works slowly, possible OOM kill.
Diagnostics:
# Check memory usage
free -h
htop
# Find processes with high usage
ps aux --sort=-%mem | head
# Check systemd limits
systemctl show my-service | grep Memory
# Check Go metrics (if service is in Go)
curl http://localhost:9090/metrics | grep go_memstats
Common causes:
- Memory leak in code
- Limit too small for load
- Many goroutines/processes (check
go_goroutines)
Solution: Increase limit in spec.yml (short-term) or fix leak (long-term).
Disk Space Shortage
Symptoms: RootFS or DataFS alert, write errors.
Diagnostics:
# Check disk usage
df -h
# Find large files
du -sh /* | sort -h
# Check systemd logs
journalctl --disk-usage
# Clean old logs
journalctl --vacuum-time=7d
Common causes:
- Accumulated logs (journald, applications)
- Temporary files not deleted
- Data growth (DB, cache)
Solution: Clean logs, delete temporary files, increase disk or add new one.
Network Problems
Symptoms: High latency, packet loss, service unavailability.
Diagnostics:
# Check network interfaces
ip addr
ip link
# Check routing
ip route
birdc show protocols # If BGP is used
# Check availability
ping host
traceroute host
# Check ports
ss -tulpn | grep port
Common causes:
- Physical connection problems
- Wrong routing
- Firewall blocking traffic
- BGP session problems
Solution: Check physical connection, routing, firewall rules, BGP sessions.
Dependency Problems
Symptoms: Service can’t connect to DB, cache, queues.
Diagnostics:
# Check environment variables
cat /etc/my-service/environment | grep POSTGRES
cat /etc/my-service/environment | grep REDIS
# Check dependency availability
nc -zv db-host 5432
redis-cli -h cache-host ping
# Check service logs
journalctl -u my-service | grep -i "connection\|timeout\|error"
Common causes:
- Dependency unavailable (down, network problems)
- Wrong credentials (check encrypted environment variables)
- Connection limit exceeded
Solution: Check dependency availability, fix credentials, increase connection limit.
Non-Obvious Benefits
There are also these non-obvious benefits:
- Became simple and reliable — in three years, nothing has failed due to systemd or package managers. If something broke, it was our mistakes (usually on the service side). We can always roll back. The scariest thing, as usual, is deployment — that’s where you can really break something
- Admins relaxed — before, admins lived in 24/7 mode: many alerts, constant investigations. Now there are fewer alerts, which sometimes leads to “relaxation” — someone can be on a train and respond “I’ll be available in half an hour.” This is good (less burnout) but requires discipline
Conclusion
We’ve shown that you can successfully operate production infrastructure without Kubernetes and Docker, using standard Linux tools: systemd, package managers, and Ansible. This solution turned out simpler, more reliable, and more efficient for our case.
Key points:
- Simplicity over complexity — standard tools solve most tasks without additional abstraction layers
- One binary — one service — simplifies delivery, versioning, and operations
- Unified spec.yml — one file describes the entire service lifecycle, reducing error probability
- Automation via Ansible — all infrastructure as code, developers don’t depend on admins
- Scaling via server groups — simple horizontal scaling without orchestrators
- Fault tolerance via DNS and BGP — traffic routing at the network level
- Observability doesn’t depend on orchestrator — metrics, logs, and alerts work independently of Kubernetes
In three years of operation, nothing has failed due to systemd or package managers. This shows that a simple solution can be more reliable than a complex one.
It’s important to understand: this doesn’t mean Kubernetes is bad or shouldn’t be used. It means that for our case (own hardware, specific requirements, small team) a simple solution turned out more efficient.
Similar topic with a different focus (requirements, tool choice, economics) — see Why We Didn’t Choose K8s.
Useful Resources
- systemd: Documentation and Examples — official documentation on systemd unit files, resource limits, and security
- Ansible: Documentation and Best Practices — official Ansible documentation with examples of playbooks and roles
- NFPM: Creating Linux Packages — tool for creating deb/rpm packages from YAML description
- systemd: Resource Isolation and Security — how to configure CPU, memory limits, and other restrictions via systemd
- Debian Packaging Guide — guide to creating deb packages (useful for understanding package structure)
- systemd timers instead of cron — how to use systemd for scheduled tasks instead of cron
- Ansible Vault: Storing Secrets — secure storage of passwords and secrets in Ansible playbooks
- BGP and Anycast for Fault Tolerance — how to use BGP/anycast for traffic routing and ensuring fault tolerance
- DNS for Traffic Routing — DNS basics and how to use it for geo-routing
- Simplicity vs Complexity in Infrastructure — examples of how excessive complexity can lead to problems (using Cloudflare as an example)
Practical Note. When choosing infrastructure tools, it’s important to understand: the more components, the more failure points. Standard Linux tools (systemd, package managers) are time-tested and have less “magic,” which simplifies problem diagnostics. Source: experience operating more than 50 servers without Kubernetes.
Footnotes
Go’s single-binary deployment model and minimal dependencies make it ideal for our infrastructure approach. Similar topic with a different focus — see Why We Didn’t Choose K8s. ↩︎