Netgate DNS Cutover Toolkit

I ran a read-only safety check before migrating LAN DNS onto the firewall. It came back NO-GO. Good, that's the gate doing its job. What it found underneath was more interesting than a simple outage: a routing trap quietly breaking a box's own network access, plus a one-line bug in the toolkit's own API client mislabeling a working endpoint as broken. Both got fixed. The gate came back clean. Then the actual cutover ran, live, against a real client.

Preflight returns NO-GO on a real outage, root cause traced to a Tailscale routing trap and an API bug, both fixed, preflight returns GO, DHCP cutover applied and verified live.

real outage caught

root causes fixed

unconfirmed writes

Why this needed a gate at all

A Pi-hole VM had been doing double duty: ad-blocking, and DNS resolution for every internal hostname. The instability that came with that wasn't "the firewall can't do DNS well." It was that DNS policy had no single owner. Depending on which device you asked, the answer to "what server resolves this name" could be the firewall, the old DNS box, a hardcoded public resolver, or whatever Tailscale happened to be injecting on that client. The fix is one enforced rule at the network layer, not a setting repeated per device. This toolkit makes that provable through the pfSense REST API, under one constraint: nothing destructive happens without explicit confirmation, and every write is preceded by a read-only check that can veto it.

The NO-GO

./preflight-netgate-dns-cutover.sh --save-snapshot

One blocker: the DNS fallback host was completely unreachable. Not a script error. Independent ping, raw TCP, and DNS queries against it all timed out identically. The gate stopped before writing anything against a network that wasn't in the state the plan assumed.

That host turned out to be alive. Console access confirmed the OS was up, the interface had the right IP, link state UP. But it couldn't ping its own gateway, in either direction, while staying reachable over its own Tailscale connection. Alive on the tailnet, invisible on the LAN it was physically sitting on.

The routing trap

That split is a specific, well-known Tailscale failure mode. The host had previously been this network's subnet router, advertising the LAN range it physically lived on, and it also had route-accepting enabled (RouteAll: true). Once a different machine, the firewall, mid-migration to taking over that role, started advertising the same subnet, this host installed a policy route sending its own LAN traffic, including packets to its own gateway, into the Tailscale interface instead of out its physical NIC. The tunnel only understands tailnet peers, not "the LAN I'm bolted to," so every packet got silently dropped.

sudo tailscale set --advertise-routes=

sudo tailscale set --accept-routes=false

Confirmed with a route-table check before and after: the self-routing entry disappeared, the gateway ping went from 100% loss to 0%, and DNS/HTTP/SSH from the LAN all came back immediately.

The lesson generalizes past this one box: any device physically on a LAN should not accept Tailscale subnet routes for that same LAN, regardless of which machine is advertising them. The moment anyone advertises a subnet, every tailnet member sitting on it with route-accepting enabled risks routing its own local traffic into a tunnel that can't deliver it.

The bug in the gate itself

With that resolved, preflight still reported a second blocker: "cannot read DHCP config via API," even though the same endpoint answered fine from a plain curl. The difference: the shared API helper sent Content-Type: application/json on every request, including bodyless GETs. pfSense's REST API tries to parse query-string params like ?id=lan out of a JSON body when that header is present, finds nothing, and fails with MODEL_REQUIRES_ID, even though the parameter it ignored would have worked fine as a query string.

curl -H "Content-Type: application/json" "$URL/api/v2/services/dhcp_server?id=lan" # 400, MODEL_REQUIRES_ID

curl "$URL/api/v2/services/dhcp_server?id=lan" # 200, full config

One-line fix: only attach Content-Type when there's an actual JSON payload, on PATCH and POST, never on a plain GET. Reverified across several runs before trusting it again. With both issues resolved, the now-working DHCP read also surfaced the real misconfiguration this project exists to fix: the firewall's DHCP scope was handing clients a public resolver plus the now-unreachable fallback host, never its own resolver at all.

Clean GO, then the real cutover

Re-running preflight: six sections green, zero blockers, two informational notes describing exactly what the apply step would change. Then the actual write, gated behind an explicit flag:

./apply-netgate-dns-policy.sh --dhcp-only --confirm --skip-unchanged

[OK] DHCP lan -> dnsserver <firewall-ip> only

[OK] DHCP server applied

[OK] DHCP lan dnsserver verified: <firewall-ip>

Verified against a live client immediately after, not just trusted on the API's word. DNS server, internal hostname resolution, and public resolution all checked on a real machine:

networksetup -getdnsservers Wi-Fi # firewall-ip only

nslookup internal-host.lab.example.net # resolves correctly

nslookup google.com # public resolution still works

All three passed. The firewall is now the DNS server DHCP hands out, internal hostnames resolve through it, and public resolution still works.

The network after

LAN and remote Tailscale clients get DHCP and DNS from the firewall only. The firewall's Unbound resolver handles internal host overrides and forwards everything else upstream. Pi-hole is demoted to ad-blocking only, off the DNS authority path.

One device decides routing, DHCP, and DNS policy. The DNS box that used to make that decision quietly, alongside Tailscale and a hardcoded public resolver, now does exactly one job.

What's next

○Credential hygiene. A dedicated least-privilege API user, an access list scoping which addresses can call the API, key rotation if a key has ever touched a place it shouldn't have.

○Closing the DNS bypass path. A firewall rule blocking outbound DNS to anything but the firewall itself, so a device can't quietly opt out of DHCP policy. Not live yet; easy to write, easy to lock yourself out with.

○Static DHCP mappings and firewall aliases as code, extending the same dry-run-first pattern.

○Config backups before any automated write.

What's staying manual on purpose: WAN-facing firewall policy stays human-reviewed, and ad-blocking stays with a dedicated box doing only that, once it's no longer also trying to be a router and a DNS authority at the same time.

Status

DHCP-only cutover is live and verified against a real client. Full-policy apply (system DNS, resolver, DHCP, and host overrides together) is dry-run clean and ready, held until a full DHCP lease-renewal cycle has been observed network-wide without intervention. The fallback DNS host stays up as a safety net until then.

Stack

pfSense APIUnboundBashDHCPDNSTailscale

← Previous

Network Automation Toolkit

BGP Mesh with Private ASNs