Files
company-nix/plan-add-employee-helper-cli-and-home-manager-integration-for-safe-vps-provisioning.md
2026-03-18 02:44:54 +01:00

17 KiB
Raw Blame History

Plan: Add Employee Helper CLI and Home Manager Integration For Safe VPS Provisioning

Summary

Build a packaged helper CLI that is exported from the existing root flake and installed through the existing Home Manager module. The helper suite will wrap the current manual provisioning flow into validated, guided commands that:

  • inspect a live VPS over SSH
  • scaffold hosts/<name>/ files in this repo
  • generate or update disko.nix safely from probed facts
  • create the OpenBao policy/AppRole/bootstrap files for Tailscale enrollment
  • generate and optionally execute the nixos-anywhere install command
  • run post-install checks
  • generate the corresponding colmena host stanza instructions

The helper CLI will be the supported employee interface. Raw manual commands remain possible, but the helper commands become the documented and safest path.

Chosen defaults:

  • helper surface: packaged CLI installed by Home Manager
  • mutation mode: interactive confirm for risky operations
  • scope: full VPS workflow
  • repo writes: yes, write directly into this repo after validation
  • OpenBao auth: use the employees existing bao login/session
  • host fact collection: SSH probe live host
  • topology: implement as root-flake exports rather than a nested subflake, because the user marked either fine and this is lower-friction

Goals

User-facing goal

Employees should be able to provision a new VPS with a small number of safe commands and without manually editing Nix files or remembering the OpenBao/Tailscale bootstrap sequence.

Success criteria

A new employee with:

  • a checkout of this repo
  • a working local bao login
  • SSH access to a target VPS

can run helper commands that:

  1. probe the target host and derive safe defaults
  2. create or update the host files in hosts/<name>/
  3. create the OpenBao AppRole bootstrap material for that host
  4. print or run the exact nixos-anywhere command
  5. verify first boot and tailnet enrollment
  6. avoid footguns through validation, confirmations, and idempotency checks

Architecture

Repo additions

Add a helper package and Home Manager module to the existing root flake.

New files/directories:

  • pkgs/helpers/
  • pkgs/helpers/cli.py or equivalent main entrypoint
  • pkgs/helpers/templates/
  • modules/helpers/home.nix
  • optional pkgs/helpers/lib/ for command submodules
  • optional pkgs/helpers/README.md if internal tool docs grow beyond the top-level README

Root flake exports to add:

  • packages.<system>.nodeiwest-helper
  • apps.<system>.nodeiwest-helper
  • homeManagerModules.helpers

Existing modules/home.nix should import or include the helper Home Manager module so employees automatically get the command suite.

Packaging choice

Use a packaged Python CLI with stdlib only unless a strong reason appears otherwise.

Why:

  • safer structured argument parsing than shell aliases
  • simpler file templating and repo mutation than Bash
  • easier SSH probing output parsing and validation
  • no need to depend on a persistent external runtime beyond python3
  • clean packaging through pkgs.writeShellApplication or python3Packages.buildPythonApplication

The CLI should still shell out to:

  • ssh
  • bao
  • nix
  • git
  • optionally colmena

These tools are already aligned with the repos workflow.

Home Manager integration

Existing behavior

Current modules/home.nix only installs:

  • openbao
  • colmena

New behavior

Employees should also get:

  • nodeiwest helper CLI command
  • any runtime dependencies not guaranteed by the base environment, if needed

Recommended command name:

  • nodeiwest

Reason:

  • short
  • organization-scoped
  • extensible subcommands
  • does not collide with upstream tools

CLI command surface

The CLI should be subcommand-based and decision-complete from day one.

1. nodeiwest host probe

Purpose:

  • SSH into the target host
  • collect disk and boot facts
  • output normalized machine facts

Inputs:

  • --ip <ip>
  • optional --user <user> default root

Behavior:

  • run the exact discovery commands already documented in the README:
    • lsblk -o NAME,SIZE,TYPE,MODEL,FSTYPE,PTTYPE,MOUNTPOINTS
    • boot mode probe via /sys/firmware/efi
    • root mount source via findmnt -no SOURCE /
    • swap via cat /proc/swaps
  • parse into structured output
  • determine:
    • primary disk candidate
    • root partition
    • boot mode
    • whether current disk naming is sda / vda / nvme
    • whether current system appears UEFI or BIOS

Output:

  • human-readable summary
  • optional --json machine-readable output

Failure conditions:

  • no SSH connectivity
  • multiple ambiguous disk candidates
  • missing required commands on remote host

2. nodeiwest host init

Purpose:

  • create or update hosts/<name>/configuration.nix
  • create or update hosts/<name>/disko.nix
  • create placeholder hardware-configuration.nix if missing
  • print exactly what changed
  • ask for confirmation before writing

Inputs:

  • --name <host>
  • --ip <ip>
  • optional --user <ssh-user> default root
  • optional overrides:
    • --disk /dev/sda
    • --boot-mode uefi|bios
    • --swap-size 4GiB
    • --timezone UTC
    • --tailscale-openbao on|off default on
  • optional --write or --apply
    • without it: dry-run plan only
    • with it: write after interactive confirmation

Behavior:

  • call host probe implicitly unless overrides are provided
  • derive safe defaults from live facts
  • validate host name format
  • validate that target files do not already contain contradictory data
  • create backup copies before overwriting existing tracked files
  • update configuration.nix with:
    • hostName
    • boot loader config matching boot mode
    • Tailscale AppRole bootstrap enabled by default
  • update disko.nix with:
    • correct disk device
    • GPT
    • correct UEFI/BIOS boot partition shape
    • ext4 root
    • swap partition using chosen/default size
  • create placeholder hardware-configuration.nix if absent
  • detect if host is missing from flake.nix and report exact block to add

Safety checks:

  • if boot mode is BIOS but host config would still emit EFI loader, abort
  • if probed disk differs from existing disko.nix, require explicit confirmation
  • if repo has unrelated dirty changes in the exact target host files, warn and stop unless --force

3. nodeiwest openbao init-host

Purpose:

  • create the OpenBao policy and AppRole for a host
  • generate bootstrap files locally for install-time injection

Inputs:

  • --name <host>
  • optional --namespace it
  • optional --secret-path tailscale
  • optional --field auth_key
  • optional --auth-path auth/approle
  • optional --policy-name tailscale-<host>
  • optional --role-name tailscale-<host>
  • optional --out ./bootstrap
  • optional --kv-mount-path <actual-policy-path>
  • optional --cidr <cidr> repeatable if later needed
  • optional --apply
    • without it: show policy/AppRole plan and output paths
    • with it: execute after interactive confirmation

Behavior:

  • verify bao is available
  • verify employee is already authenticated:
    • e.g. bao token lookup or equivalent harmless auth check
  • generate policy content from inputs
  • write policy via bao policy write
  • write AppRole via bao write auth/approle/role/...
  • fetch role_id
  • generate secret_id
  • create local bootstrap directory:
    • bootstrap/var/lib/nodeiwest/openbao-approle-role-id
    • bootstrap/var/lib/nodeiwest/openbao-approle-secret-id
  • chmod both 0400
  • print exact next-step command to install with nixos-anywhere

Defaults:

  • namespace it
  • auth path auth/approle
  • secret path tailscale
  • field auth_key
  • role name tailscale-<host>
  • policy name tailscale-<host>

Important validation:

  • if bao kv get equivalent path probe fails for the chosen namespace/path, abort with a clear explanation that the KV mount path/policy path likely needs adjustment
  • if OpenBao is unreachable or auth is missing, fail fast

4. nodeiwest install plan

Purpose:

  • assemble and print the exact nixos-anywhere command
  • validate all required local files exist before install

Inputs:

  • --name <host>
  • optional --ip <ip> default from flake/host inventory if available
  • optional --bootstrap-dir ./bootstrap
  • optional --copy-host-keys on|off default on
  • optional --generate-hardware-config on|off default on

Behavior:

  • validate:
    • hosts/<name>/configuration.nix exists
    • hosts/<name>/disko.nix exists
    • bootstrap role_id/secret_id files exist
    • target host is present in flake.nix or warn with exact missing stanza
  • print the exact install command
  • print preflight checklist:
    • provider snapshot taken
    • app/data backup taken
    • public SSH reachable
    • host keys may change

No mutation beyond optional temp checks.

5. nodeiwest install run

Purpose:

  • run the validated nixos-anywhere command

Inputs:

  • same as install plan
  • requires explicit --apply

Behavior:

  • run the same validation as install plan
  • print command and require interactive confirmation
  • execute nix run github:nix-community/nixos-anywhere -- ...
  • stream logs
  • on success, print post-install verification commands

Safety:

  • refuse to run without confirmation unless --yes
  • refuse to run if bootstrap files are missing
  • refuse to run if target IP is not reachable over SSH

6. nodeiwest verify host

Purpose:

  • verify first boot and Tailscale/OpenBao bootstrap

Inputs:

  • --name <host>
  • --ip <ip>
  • optional --user root

Behavior:

  • SSH to host and run:
    • systemctl status vault-agent-tailscale
    • systemctl status nodeiwest-tailscale-authkey-ready
    • systemctl status tailscaled-autoconnect
    • tailscale status
  • summarize health
  • print actionable failures grouped by likely cause:
    • missing AppRole files
    • OpenBao auth failed
    • wrong secret path / field
    • Tailscale autoconnect blocked

7. nodeiwest colmena plan

Purpose:

  • ensure the host is ready for ongoing deploys
  • print or verify the Colmena target block

Inputs:

  • --name <host>
  • optional --ip <ip>

Behavior:

  • check that flake.nix has colmena.<host>.deployment.targetHost
  • if missing, print exact snippet to add
  • print the post-install deploy command:
    • nix run .#colmena -- apply --on <host>

Templates and file generation

Use explicit templates stored under pkgs/helpers/templates/.

Templates to maintain:

  • configuration.nix.j2 equivalent
  • disko-uefi-ext4.nix
  • disko-bios-ext4.nix if BIOS support is intended
  • hardware-configuration.placeholder.nix
  • OpenBao policy template

Rendering rules:

  • do not overwrite files blindly
  • preserve existing CA key list and obvious host-specific overrides when possible
  • if parsing existing files is too risky, switch to a guarded “abort and instruct user” policy rather than clever rewrites

Chosen default:

  • for v1, only support the current ext4+swap single-disk pattern cleanly
  • if the existing host directory contains custom structure outside the supported template shape, abort with a clear message instead of trying to merge arbitrarily

Public interfaces to add

Flake outputs

Add:

  • packages.<system>.nodeiwest-helper
  • apps.<system>.nodeiwest-helper
  • homeManagerModules.helpers

The root Home Manager module should include or re-export the helper package, so employees get it automatically.

Home Manager module contract

modules/helpers/home.nix should:

  • install packages.${pkgs.system}.nodeiwest-helper
  • optionally ensure helper runtime dependencies are present if not embedded by packaging

CLI contract

Stable executable:

  • nodeiwest

Stable subcommands for v1:

  • nodeiwest host probe
  • nodeiwest host init
  • nodeiwest openbao init-host
  • nodeiwest install plan
  • nodeiwest install run
  • nodeiwest verify host
  • nodeiwest colmena plan

Error-handling and safety policy

Global rules:

  • every mutating command supports dry-run first
  • every mutating command requires confirmation unless --yes
  • every repo write command creates a timestamped backup copy of the target file
  • every external command failure is surfaced with:
    • the exact subcommand
    • stdout/stderr summary
    • the next likely fix

Abort conditions:

  • current repo is not the expected flake root
  • bao is not authenticated for OpenBao actions
  • target SSH host is unreachable
  • disk facts are ambiguous
  • target host files already exist in a shape the helper cannot safely reason about

Out of scope for v1

Not included in the first helper bundle:

  • multi-disk or ZFS disko generation
  • non-AppRole Tailscale bootstrap flows
  • OpenBao response wrapping automation
  • automatic flake.nix AST rewriting beyond a tightly controlled supported block shape
  • full Colmena block auto-editing if the flake layout drifts from the current structure
  • vps2/OpenBao server provisioning

For unsupported cases, commands should fail with a precise message and the manual fallback step.

Implementation details

Packaging

Recommended implementation:

  • Python CLI in pkgs/helpers/
  • build with python3 from nixpkgs
  • keep dependencies stdlib-only if possible

Why not shell aliases:

  • too weak for idempotent file writes and validation
  • poorer error reporting
  • weaker SSH probe parsing

File writing strategy

For repo-tracked file mutation:

  • read current file if it exists
  • compare against generated output
  • if unchanged, report no-op
  • if changed, write a .bak.<timestamp> sibling first, then overwrite target after confirmation

Chosen default:

  • direct repo writes are allowed because the user explicitly asked for helpers that make the workflow hard to get wrong
  • commands remain conservative and stop on unsupported customizations

OpenBao auth assumptions

Helpers assume:

  • user already has a valid bao login/session locally
  • BAO_ADDR is already set by Home Manager or can be inferred
  • namespace defaults to it

If auth is missing:

  • fail fast
  • print the exact bao command that must work before retrying

SSH probing assumptions

Helpers use:

  • ssh root@<ip> by default
  • override via --user

The CLI should probe live host facts by default and allow manual overrides for exceptional cases.

README integration

After the helper CLI exists, the top-level README.md should be revised to:

  • keep the underlying manual process documented
  • make the helper commands the recommended flow
  • reduce the manual sequence to a fallback / advanced section

Test cases and scenarios

Static/evaluation tests

  1. Flake exposes:
  • packages.<system>.nodeiwest-helper
  • apps.<system>.nodeiwest-helper
  • homeManagerModules.helpers
  1. Home Manager module installs the helper package without breaking current modules/home.nix.

  2. Generated commands reflect current repo paths:

  • hosts/<name>/configuration.nix
  • hosts/<name>/disko.nix
  • hosts/<name>/hardware-configuration.nix

CLI behavior tests

  1. host probe on a UEFI /dev/sda VPS returns:
  • boot mode UEFI
  • disk /dev/sda
  • root partition /dev/sda2
  1. host init dry-run for a new host:
  • prints file plan
  • does not write files
  1. host init --apply:
  • creates host files
  • uses probed disk and boot mode
  • creates backups when overwriting
  1. openbao init-host with valid local bao auth:
  • writes policy and AppRole
  • creates bootstrap files with mode 0400
  1. openbao init-host without local bao auth:
  • fails before attempting writes
  1. install plan:
  • refuses to proceed if bootstrap files are missing
  • prints exact nixos-anywhere command otherwise
  1. install run:
  • requires confirmation
  • executes the printed command
  1. verify host after successful install:
  • reports agent healthy
  • reports tailscaled autoconnect healthy
  • reports Tailscale joined

Failure scenarios

  1. Multiple candidate disks on target host:
  • helper aborts and requires explicit --disk
  1. BIOS host with UEFI template:
  • helper aborts before writing
  1. Existing custom disko.nix not matching supported template:
  • helper aborts with “manual intervention required”
  1. OpenBao secret path exists but field auth_key is missing:
  • helper fails during validation with a precise message
  1. SSH probe works but nixos-anywhere later loses connectivity:
  • helper reports the exact command that failed and reminds user to recover via provider console/public SSH

Assumptions and defaults

Chosen assumptions:

  • keep one root flake, not a nested helper flake
  • helper package is installed through Home Manager
  • command name is nodeiwest
  • direct repo writes are acceptable with backups and confirmation
  • employees already authenticate to OpenBao manually before using helper OpenBao commands
  • live SSH probing is preferred over manual fact entry
  • first release only supports the current single-disk ext4+swap provisioning shape cleanly
  • helper commands should be conservative and stop rather than guess in unsupported cases

If these assumptions remain acceptable, the implementation can proceed without further design decisions.