DNS troubleshooting playbook: diagnosing propagation, DNSSEC, and record misconfigurations
dnsnetworkingtroubleshooting

DNS troubleshooting playbook: diagnosing propagation, DNSSEC, and record misconfigurations

AAlex Mercer
2026-05-05
23 min read

A step-by-step DNS troubleshooting playbook for propagation, DNSSEC, registrar vs. nameserver issues, and record misconfigurations.

DNS is one of those systems that feels simple until it breaks. A hostname resolves, then it does not; one region sees the new record, another still answers with the old one; a registrar change looks correct, but the site is still offline. This playbook is built for admins and developers who need a fast, structured way to isolate failures across registrars, authoritative nameservers, caches, and DNSSEC validation. If you also maintain broader operational docs, pair this guide with our notes on documentation guardrails and workflow templates, workflow automation selection, and data contracts and traceability so your DNS procedures stay repeatable and auditable.

The goal here is not to memorize every record type. The goal is to identify where the failure lives, prove it with command-line evidence, and fix the smallest thing that restores resolution. In practice, DNS incidents usually fall into four buckets: propagation and caching confusion, incorrect records or delegation, DNSSEC validation errors, and domain-registrar versus nameserver mismatches. For teams that document everything from onboarding to runbooks, this same method mirrors how strong internal guides are built in developer-friendly internal tutorials and how operational teams think about recovery planning.

1) Start with the symptom, not the tool

Determine what is actually broken

Before running dig or changing anything, define the failure in one sentence. Is the hostname not resolving at all, resolving to the wrong IP, resolving inconsistently across networks, or only failing for some resolvers? The answer tells you whether you are dealing with delegation, caching, record data, or DNSSEC. This distinction matters because a web outage may look like “DNS down” when the real issue is a broken A record, an expired glue record, or a bad registrar nameserver entry.

Start by asking four questions: does the domain resolve from a public resolver, does it resolve from the authoritative server, does it resolve for IPv4 and IPv6, and does it fail only for validating resolvers. If you need a broader mental model for troubleshooting under uncertainty, the same habit shows up in scenario analysis and in end-to-end validation workflows: isolate variables, test one layer at a time, and avoid assuming the visible symptom is the root cause.

Identify the affected lookup path

A DNS lookup can fail in several places: the recursive resolver can have a stale cache, the authoritative server can serve the wrong zone, the registrar may point to old nameservers, or the DNSSEC chain can fail validation. A disciplined first pass is to compare answers from multiple resolvers, such as your local resolver, 1.1.1.1, and 8.8.8.8. If one resolver sees the right answer and another does not, propagation or cache behavior is usually involved; if none of them can find the name, the issue is often delegation or authoritative zone data. This same comparative approach is the backbone of testing and validation strategies and telemetry-based diagnosis.

Confirm impact and scope

Check whether the issue is universal, regional, or provider-specific. If a site works from your VPN but not from a public ISP, a recursive resolver cache or negative caching issue may be present. If one subdomain fails while the apex works, a record-specific misconfiguration is more likely than a domain-wide delegation problem. If all records under the zone fail only for DNSSEC-validating clients, the zone may be unsigned incorrectly or the DS/DNSKEY chain may be broken. For teams that want to standardize incident notes, you can model this as a checklist much like vendor diligence or source triage: capture scope first, then move to root cause.

2) Use the right command-line tools in the right order

dig is your primary verifier

dig is the fastest way to inspect record answers, authority, TTLs, and DNSSEC flags. Start with a simple query and progressively add detail. For example, dig example.com A tells you what the resolver returns, while dig @8.8.8.8 example.com A +nocmd +noall +answer +ttlid exposes TTL and answer state in a clean format. For delegation problems, query the NS records explicitly, then ask the authoritative servers directly to compare their answers.

dig example.com NS +short

dig @8.8.8.8 example.com A +noall +answer +ttlid

dig @ns1.yourdnsprovider.net example.com SOA +noall +answer

dig +trace example.com

Use +trace when you suspect a broken chain from root to authoritative nameserver. It bypasses the recursive resolver and shows each referral step, which is ideal for diagnosing registrar or delegation mistakes. If your ops stack relies on structured runbooks, this kind of repeatable command sequence resembles the discipline described in programmatic vetting workflows and measuring operational signals.

host and nslookup still have value

host is useful for quick, terse lookups when you only need the final answer. nslookup remains common on legacy systems and Windows-based workflows, though its output is less consistent than dig. In mixed environments, use the tool that is available, but prefer dig for authoritative debugging. If you need to check IPv6, remember to query AAAA explicitly, because some “site down” reports are really IPv6-only failures hidden behind a working A record.

Use these commands when you need a fast local triage pass: host example.com, host -t ns example.com, and nslookup -type=soa example.com. They are not replacements for authoritative tests, but they are helpful for confirming whether the problem appears in your resolver path or only beyond it. For teams that work across multiple systems and levels of maturity, that pragmatic layering is similar to hands-on examples: a small command often reveals the next best test.

Use whois and registrar checks for delegation clues

When DNS seems “stuck,” the problem may be outside the zone entirely. Query the registrar or WHOIS data to confirm the current nameserver set, expiration status, and domain lock state. If the registrar still points to old nameservers, the correct zone contents will never matter until delegation is fixed. If the domain is expired or in redemption, some registrars may still show records while public resolution has degraded or stopped.

Think of registrar data as the routing table for your domain: it tells the world where to ask next. If you are standardizing this type of operational verification, the same due-diligence mindset used in procurement checklists and live risk feeds applies well here: verify the control plane before editing the payload.

3) Diagnose propagation and TTL behavior correctly

Propagation is really cache expiry plus delegation updates

DNS “propagation” is often misunderstood as a broadcast that slowly reaches everyone. In reality, once you change a record at the authoritative source, the new value becomes available immediately there, but recursive resolvers may continue serving cached responses until TTL expires. If the change also involves nameserver delegation at the registrar, then some resolvers may still query old and new authoritative servers during a transition window. The phrase propagation is useful shorthand, but operationally you should think in terms of cache lifetime, stale answers, and referral consistency.

This is where TTL becomes critical. A record with a 3600-second TTL can remain cached for up to an hour after the authoritative change, and a 86400-second TTL can take a day to settle. If you are planning a migration, lower TTL values well in advance, not minutes before the change. That planning mentality is similar to the sequencing used in route disruption planning and timing-sensitive optimization: prepare the system before you need the transition.

Negative caching can mislead you

If a name did not exist when a resolver last asked, the resolver may cache the NXDOMAIN response for the zone’s negative caching TTL, which is often derived from the SOA record. That means creating a new record does not guarantee instant visibility to everyone. People testing from the same network may repeatedly see “it still does not exist” even after the record is live, because the resolver is honoring a cached negative answer. To debug this, query authoritative servers directly and compare that result with a public resolver.

When a new subdomain is added during an incident, negative caching can make it look like the change failed when it actually succeeded. This is why a change window should include a verification matrix: authoritative lookup, public recursive lookup, and client-side lookup. The idea resembles streaming analytics and auditable data foundations—don’t trust one metric if another layer may be caching stale state.

Practical TTL strategy for migrations

For routine changes, lower the TTL of affected records 24 to 48 hours before the migration. This helps shrink the cache tail and reduces the time users are exposed to stale values after cutover. Then make the actual DNS update, confirm the authoritative answer, and watch resolver responses over time. After the migration stabilizes, raise TTL back to a value that balances agility and query volume. Extremely low TTLs can increase resolver load and make troubleshooting noisier, so avoid leaving emergency TTLs in place indefinitely.

Problem patternMost likely layerBest first commandWhat to look for
New record not visible everywhereRecursive cache / negative cachedig @8.8.8.8 name ATTL still high or NXDOMAIN cached
Domain resolves to old IPResolver cachedig +trace name AAuthoritative answer differs from public resolver
Nothing resolves at allDelegation / registrardig +trace domain NSBroken referral or wrong nameservers
Only validating clients failDNSSECdig +dnssec name ARRSIG/DS mismatch or bogus chain
Subdomain fails, apex worksRecord misconfigurationdig sub.domain AMissing zone entry, wrong CNAME, or typo

4) Hunt record misconfigurations systematically

Check A, AAAA, CNAME, and MX records separately

A zone can look healthy while a single record type is broken. Start with A and AAAA for web traffic, then MX for mail, then CNAME chains for aliases and service endpoints. A common mistake is creating a CNAME at the apex, which many DNS systems do not support because the apex already needs SOA and NS records. Another frequent issue is a CNAME pointing to a hostname that itself has no A or AAAA record. If mail is failing, compare MX targets and make sure their hostnames resolve cleanly.

For web hosting specifically, verify that the hostname used by the application matches the record in DNS and the certificate on the origin. A DNS change can be technically correct but still fail if the app expects a different host header or the CDN expects a different origin configuration. This sort of service-chain dependency is also why teams benefit from guides like policy-impact documentation and workflow onboarding guides: the visible layer is only part of the system.

Watch for typos, trailing dots, and zone cuts

Simple spelling mistakes are still a leading cause of DNS incidents. A missing character in a hostname, a wrong target in a CNAME, or a forgotten trailing dot in zone-file style configurations can redirect lookups unexpectedly. In zone files, an omitted trailing dot often turns an intended absolute name into a relative one, causing the DNS software to append the zone origin and create a broken target. Likewise, zone cuts between parent and child zones can hide a record if you update the wrong side of the delegation boundary.

Use a checklist: compare the intended record against the actual zone file, verify the record class and type, confirm the exact owner name, and test the target hostname independently. If you manage multiple environments, create a diff-based review process so changes are auditable. That habit aligns with how teams approach productivity measurement and

Zone file and provider UI mismatches

Many outages happen because a provider’s web console and exported zone file do not represent the same configuration state. A UI may hide apex flattening, automatic CNAME handling, or records created by an external API. Always confirm the authoritative state as served on the wire, not just what the dashboard displays. If your DNS provider supports API changes, keep the source of truth in version control and compare the rendered zone with live answers regularly.

This principle is especially important during migrations from one DNS vendor to another. The nameserver switch can appear complete in the registrar while the new zone is missing a critical record or contains a stale TTL. To avoid silent failures, test each crucial hostname from the authoritative server before changing delegation. It is the same “prove before promote” approach seen in internal training design and readiness planning.

5) Debug DNSSEC without guessing

Know the moving parts: DNSKEY, DS, RRSIG, and chain of trust

DNSSEC introduces cryptographic signatures that let resolvers validate authenticity. The zone publishes DNSKEY records; the parent zone publishes DS records that point to the child’s key; and RRSIG records sign the responses. If any link in that chain breaks, validating resolvers may return SERVFAIL even when non-validating resolvers still answer. That makes DNSSEC failures confusing unless you intentionally test both validating and non-validating paths.

Start with dig +dnssec example.com or dig +multi +dnssec example.com SOA to inspect signatures and flags. If the zone was recently re-signed, check whether the DS record at the registrar matches the current DNSKEY. If a key rollover was not completed cleanly, validation can fail for some resolvers while others still cache older state. This pattern is similar to staged cutovers in hardware transitions: each step must be synchronized or the end system rejects the result.

Common DNSSEC failure patterns

The most common DNSSEC mistakes are stale DS records at the parent, expired RRSIGs on the child zone, and mismatched algorithm or key tags. Another frequent issue is enabling DNSSEC at the registrar before the child zone is actually signed, or disabling signing without removing the DS record. Either mistake breaks validation because the parent and child no longer agree about trust. Also remember that signed zones require correct NSEC/NSEC3 behavior, so malformed auto-generated responses from some providers can trigger errors only under validation.

When troubleshooting, compare validating and non-validating queries side by side. If a non-validating resolver returns data but a validating resolver returns SERVFAIL, you are almost certainly dealing with a DNSSEC chain problem rather than a simple record typo. For teams used to troubleshooting complex systems, this is like separating “service is down” from “policy blocks access,” a distinction emphasized in risk-aware monitoring and automation strategy.

Safe operational steps for DNSSEC changes

Do not rotate keys casually without a plan. Lower TTLs before a rollover, confirm that the new DNSKEY is published, update the DS record at the parent if needed, and verify validation from multiple public resolvers. After the rollover, remove stale keys only when you are sure old caches have aged out. If you are unsure whether DNSSEC is the culprit, temporarily testing without validation is a diagnostic step, not a permanent fix.

Pro Tip: If a domain works on your laptop but fails for customers, test one resolver that validates DNSSEC and one that does not. A split result is one of the clearest signals that the chain of trust is broken rather than the record data itself.

6) Separate registrar problems from nameserver problems

Registrar controls delegation, nameservers serve the zone

The registrar decides which nameservers the parent zone points to, but the authoritative nameservers store and serve the actual DNS records. If the registrar has the wrong NS set, all the correct zone data in the world is irrelevant because resolvers will ask the wrong place. If the registrar is correct but the authoritative zone is missing or stale, the delegation works and the zone data does not. This distinction is the difference between control plane and data plane, and it is essential for fast incident response.

Use dig +trace to see where the chain stops. If the trace reaches the parent but lands on dead or unexpected nameservers, fix registrar delegation. If the trace reaches the proper authoritative servers but the answers are wrong, fix the zone contents. This division of responsibility is similar to how you would isolate whether a problem is in infrastructure or application logic, a pattern reflected in long-term build workflows and growth-stage automation decisions.

Glue records and in-bailiwick nameservers

If your nameservers are inside the same domain they serve, glue records become part of the story. For example, if ns1.example.com is itself a nameserver for example.com, the parent zone must provide an address so resolvers can reach the nameserver before they can resolve the domain. Missing or stale glue can create resolution loops or dead ends. This is especially common during migrations or when moving from one provider to another without updating both the nameserver and glue layers.

Verify that the registrar shows the correct NS set and glue IPs, and confirm that those IPs answer authoritatively from the internet. Do not assume that because the registrar UI says “updated” the changes are live everywhere. In practice, you should validate this the way high-discipline teams validate supply-chain or vendor transitions, like the methods in enterprise risk playbooks and recovery planning guides.

7) Build a step-by-step diagnostic flow you can reuse

Flow 1: domain not resolving

When a domain does not resolve at all, start with a trace. Use dig +trace example.com and check whether the root referrals, TLD referrals, and authoritative nameservers all appear. If the trace fails at the TLD or parent step, the registrar delegation may be wrong or the domain may be expired. If the trace reaches authority but returns no answer, the zone may be empty, misconfigured, or signed incorrectly. If the trace works but your recursive resolver does not, cache or filtering is the likely issue.

Once you have the first failing hop, stop and focus there. Chasing the web server, the CDN, or the application before proving DNS is correct usually wastes time. This is the same triage principle found in systematic vetting and auditable infrastructure: identify the first broken link, not the loudest symptom.

Flow 2: changes not visible after a DNS update

If you changed a record and some users still see the old value, compare authoritative and recursive answers. Query the authoritative nameserver directly and then query several public resolvers. If authoritative shows the new record, the problem is almost certainly cache-related. If authoritative still shows the old record, the update either did not land or landed on the wrong zone/provider. Also check whether the record’s TTL was previously very high; that determines how long stale data may persist.

For edge cases, look for multi-layer caching. A CDN, reverse proxy, local OS resolver cache, browser cache, and recursive DNS cache can all mask each other. You may be changing DNS correctly while an adjacent layer continues to serve old behavior. For organizations that care about repeatability, the lesson is the same as in analytics baselines and operational control loops: know where each cache sits and how long it lasts.

Flow 3: SERVFAIL on validating resolvers

If users report intermittent SERVFAIL, check DNSSEC first. Compare a validating resolver and a non-validating resolver for the same query. If the non-validating answer succeeds but validation fails, inspect DS and DNSKEY consistency, key expiration, and signature timing. If signatures are expired, re-sign the zone; if DS mismatches the key, update the parent or restore the previous key. Do not waste time editing A records until the validation chain is healthy.

If you need to explain this to non-DNS specialists, describe it as a trust failure rather than a data failure. The records may be present, but the resolver refuses to accept them as authentic. That distinction is important in any environment where security controls can look like outages, a theme echoed in policy-change analysis and signal authenticity checks.

8) Prevent repeat incidents with better DNS operations

Keep a DNS change checklist

A strong change checklist prevents most DNS incidents. At minimum, require: exact hostname, record type, intended target, TTL, authoritative verification, registrar delegation check, rollback plan, and validation against DNSSEC if enabled. For changes involving nameserver moves, add glue record verification, DS record review, and a confirmation window for cache expiry. If your team uses tickets or runbooks, store the checklist alongside the change so future responders can compare the intended and actual state.

Consider version-controlling zone files or exporting provider configuration on a schedule. This gives you a known-good baseline when a provider UI changes or when a colleague asks what was altered last week. The same discipline underlies guides like organization systems and compliance traceability: if you cannot reconstruct the change, you cannot trust the environment.

Standardize low-risk migration patterns

For any planned migration, lower TTLs early, verify the new zone before switching delegation, monitor public resolvers during the cutover, and keep the old provider active until caches have had time to age out. Make the migration plan explicit about which hostname owners need to sign off and what success looks like. A clean migration is not just a “DNS change”; it is a controlled switchover across recursive caches, authoritative servers, and any dependent services such as CDNs, mail, and SSO.

Teams that standardize this process reduce time spent on repeat incidents and increase confidence in future changes. This is the operational equivalent of the planning frameworks used in scenario planning and disruption handling: predictable systems require predictable steps.

Monitor the signals that matter

Set up lightweight checks for authoritative availability, recursive resolution, DNSSEC validation, and name-specific health for your most important domains. Alert on unexpected NS changes, expired domains, and SOA serial anomalies if you operate your own DNS. For cloud-managed DNS, include provider status pages and API health in your observability runbook. You do not need a massive monitoring stack to catch most DNS issues; you need the right few signals and a response path that is documented and tested.

Operationally, this is the same mindset as tracking performance and risk in other domains: choose a handful of metrics that map directly to user impact. For broader thinking on that principle, see KPI design, interval baselines, and risk reporting. In DNS, the metrics are resolution success, validation success, and authoritative correctness.

9) Quick reference: what to check first by symptom

Symptom-to-test mapping

Use this as the first five minutes of any DNS incident. If the domain does not resolve, trace delegation. If the record is stale, compare authoritative versus recursive responses. If only secure resolvers fail, inspect DNSSEC. If email fails, check MX and the target hostnames. If only one region is impacted, suspect resolver caches or network policy rather than a zone-wide failure.

The fastest responders treat DNS like a layered system, not a magical black box. That approach saves hours because it narrows the search immediately. When the team has a written playbook, every future incident becomes easier to solve than the last one.

10) FAQ and closing guidance

DNS problems are rarely solved by random edits. They are solved by understanding which layer owns the truth, which layer is caching an old truth, and which layer refuses to trust the truth you published. When you diagnose systematically, you can move from “the site is down” to “the registrar still points at old nameservers” or “the DS record is stale” in minutes instead of hours. For more operationally grounded reading, review our guides on network reliability, end-to-end transitions, and incident response sequencing.

FAQ: DNS troubleshooting playbook

How do I tell if the problem is propagation or a bad record?

Query the authoritative nameserver directly and compare that answer with a public resolver. If the authoritative server shows the new data but public resolvers do not, you are dealing with cache lifetime or negative caching. If the authoritative server still shows the wrong data, the record itself is misconfigured or the change was applied to the wrong zone. This is the single most useful distinction in DNS troubleshooting.

Why does my domain work for me but not for customers?

Your local resolver may have a different cache state than your customers’ resolvers. You may also be behind a VPN, corporate DNS filter, or ISP resolver that caches differently. Check multiple public resolvers and a validating resolver to see whether the failure is universal or path-specific. Regional differences usually point to cache behavior, delegation inconsistency, or resolver policy.

What is the most common DNSSEC mistake?

The most common mistake is leaving a DS record at the parent that no longer matches the child zone’s DNSKEY. Another frequent issue is forgetting to re-sign records after a key rollover. DNSSEC failures often appear as SERVFAIL on validating resolvers while non-validating resolvers still work, which is why side-by-side testing matters.

Can a CNAME be used at the root of a domain?

Usually not in standard DNS because the apex must host SOA and NS records. Some providers offer alias or ANAME-style flattening to simulate this behavior, but that is provider-specific. If you try to place a plain CNAME at the apex in a system that does not support it, resolution can fail or the provider may silently replace it with another mechanism.

How long should I wait after a DNS change before escalating?

It depends on the TTLs involved. If the previous TTL was 300 seconds, many clients should pick up the change within minutes after caches expire. If the TTL was 86400 seconds, stale answers can last much longer. Always check the live TTL on the old record before making assumptions about how quickly the internet should converge.

What command is best for beginners?

dig +trace is the best all-around starting point because it shows the delegation path from root to authoritative server. If you only need a quick answer, use dig name type. If you suspect DNSSEC, add +dnssec. The important habit is not the exact syntax, but the sequence: trace, compare, verify authority, then inspect caching and signing.

Advertisement
IN BETWEEN SECTIONS
Sponsored Content

Related Topics

#dns#networking#troubleshooting
A

Alex Mercer

Senior Technical Editor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
BOTTOM
Sponsored Content
2026-05-05T00:41:46.180Z