Troubleshooting Common DNS and Domain Issues: A Practical Checklist for IT Admins
A practical DNS troubleshooting checklist for IT admins: records, TTL, propagation, delegation, DNSSEC, and diagnostics.
Why DNS Troubleshooting Still Breaks “Simple” Launches
DNS failures are deceptively small problems with outsized blast radius. A single miswired record can take down a website, break email delivery, stall SaaS verifications, or keep a new app environment invisible to users. That is why experienced admins treat DNS troubleshooting like a checklist-driven incident response process rather than a guessing game. If you need a broader operational mindset for recurring issues, see our guide on effective audit techniques for small DevOps teams and the broader approach in post-mortem resilience practices.
In practice, most DNS incidents fall into a few predictable buckets: bad record values, stale caching, delegation mistakes, DNSSEC validation errors, or mismatches between record type and target behavior. The fastest way to restore service is to isolate which layer is failing: the authoritative zone, recursive resolvers, network path, browser cache, or the application itself. This guide walks through each layer in order, with commands, checks, and examples you can use on live systems.
For teams maintaining websites and SaaS properties, this is part of the same discipline as technical operations elsewhere. The same rigor that helps with technical SEO at scale and privacy-safe operational workflows also helps prevent DNS mistakes from turning into user-facing outages.
Start With the Symptom: What Exactly Is Broken?
Map the failure to a DNS layer
Before you query anything, identify the symptom carefully. “The site is down” could mean the domain does not resolve, resolves to the wrong IP, resolves only for some users, or loads in some browsers but not others. These are different failures with different causes. DNS troubleshooting gets much easier when you separate resolution issues from hosting, SSL, and application problems.
A good triage pattern is to ask: does the authoritative answer exist, does recursive resolution return it, and does the browser actually use it? If the site is reachable by IP but not by hostname, you are probably dealing with records, delegation, caching, or DNSSEC. If it resolves everywhere except one office, VPN, or ISP, suspect cache or resolver inconsistencies. If the browser is the only place failing, look at browser DNS cache, HSTS, or mixed-content behavior, not only the zone file.
Confirm the hostname, zone, and expected target
Administrators often discover the real issue is not DNS at all, but ambiguity in the request. Make sure you know the exact hostname, the expected record type, the provider that hosts the zone, and the intended target. “Root domain” and “www” often point to different infrastructure, and the wrong assumption can waste hours.
Documenting the expected state matters, especially in multi-team environments. If you are standardizing internal runbooks, pair this checklist with your documentation habits from behind-the-scenes documentation practices and audit cadence planning so the fix does not vanish after the incident. The best DNS runbooks list the hostname, provider, record type, TTL, and rollback path in one place.
Use a known-good resolver for baseline testing
Start with a public resolver such as 1.1.1.1 or 8.8.8.8 to separate authoritative DNS problems from local resolver issues. If a public resolver returns the right answer and your corporate resolver does not, the issue is likely cache, filtering, split-horizon DNS, or forwarding. If both fail, focus on the zone and delegation. If one resolver works and another doesn’t, compare TTLs and caching timeframes before changing the record again.
Pro Tip: Always test with at least two resolvers and one direct authoritative lookup. That three-point comparison catches most cache and delegation problems faster than browsing the zone editor.
Verify the DNS Records First: Type, Value, and Intent
A and AAAA records: confirm the destination IP
For simple web hosting, A records map to IPv4 addresses and AAAA records map to IPv6 addresses. A very common failure is updating only one of them while the environment expects both, or leaving an old AAAA record in place after moving hosting. If IPv6 is enabled at the edge but misconfigured behind the load balancer, some clients may fail while others succeed, which looks random until you test both address families.
Use dig example.com A and dig example.com AAAA to compare answers. If the A record points to the correct host but AAAA points to an obsolete address, clients that prefer IPv6 may fail first. For hosting migrations, treat the A/AAAA pair as a linked configuration item, not two independent entries.
CNAME vs A record: know what each can and cannot do
The CNAME versus A decision is one of the most common DNS mistakes during site launches and platform changes. A CNAME points a hostname to another hostname, while an A record points directly to an IP address. Root/apex domains traditionally cannot be CNAMEs in standard DNS, which means you usually need an A record, ALIAS/ANAME support, or a provider-specific apex flattening feature.
Use a CNAME for services like www that must follow a vendor hostname, and use A/AAAA when you control the destination IP directly. If a SaaS platform asks you to “point the domain to us,” verify whether it expects a CNAME or an A record. Mixing them up is a classic cause of validation failures and broken redirects. For more infrastructure context, the playbook in NextDNS at scale is useful when you need to understand resolver-side behavior too.
MX, TXT, and verification records: do not overlook the non-web paths
Even when your website looks fine, mail or verification workflows can still fail. MX records control email delivery, while TXT records often carry SPF, DKIM, DMARC, and domain ownership verifications. These records are frequently edited during migrations and then accidentally truncated, duplicated, or placed at the wrong hostname. A missing TXT value can cause onboarding to fail in SaaS platforms long before users notice the website issue.
Check for accidental quote changes, extra spaces, and record collisions. A TXT record used for domain verification should be preserved when adding a second TXT for SPF or a new vendor. If your org has many service integrations, a disciplined vendor inventory approach like the one in vendor due diligence for analytics helps prevent these hidden dependencies from surprising you later.
Propagation, TTL, and Caching: Why “It’s Fixed” Still Looks Broken
What DNS propagation really means
“Propagation” is usually shorthand for recursive resolvers refreshing cached data. DNS changes are often visible immediately at the authoritative server, but users still see the old answer until caches expire. That is why propagation can appear inconsistent across regions, ISPs, and devices. There is no single global switch that flips all at once.
The practical question is not “has DNS propagated?” but “which resolvers still cache the old answer, and for how long?” Confirm the TTL on the old record before making changes, because that TTL dictates how long stale answers may persist. If you lowered TTL too late, the old value may remain in resolvers until the previous TTL window ends. This is why change planning matters as much as the fix itself.
How TTL shapes rollback speed and outage impact
TTL, or time to live, tells resolvers how long to cache a record. Lower TTLs make changes spread faster, but they also increase query volume and can expose flapping configurations more quickly. Higher TTLs reduce lookup traffic but make reversals slower during incidents. The right TTL depends on how often a hostname changes and how critical a rapid rollback is.
A useful operational pattern is to lower TTL 24 to 48 hours before a planned migration, then change the record, verify, and restore a more stable TTL afterward. Common web records often sit in the 300 to 3600 second range, though some providers or CDNs recommend different values. If your team also manages content or delivery infrastructure, lessons from delivery reliability and container design may sound unrelated, but the operational theme is the same: the delivery layer affects the user experience more than people expect.
How to test cache behavior in the wild
Use multiple viewpoints. Query public resolvers, corporate resolvers, and if possible a host on another network. Then compare the authoritative answer to the recursive answer. If the authoritative server already shows the new record but recursive resolvers do not, you are dealing with caching, not a bad zone entry. If the authoritative server still shows the old value, the update did not land where you thought it did.
Use dig @authoritative-ns example.com to bypass recursion and compare directly. Add the +trace option when you suspect delegation problems, because it exposes each hop from the root to the authoritative server. You can also test with browser-based DNS checkers, but command-line results are usually more trustworthy for incident work.
| Issue area | Typical symptom | Best first command | What “good” looks like |
|---|---|---|---|
| Wrong A record | Site resolves to old IP | dig example.com A | Expected IP returned |
| Stale cache | Some users see old site | dig @1.1.1.1 example.com | Matches authoritative answer after TTL |
| Delegation issue | Domain does not resolve at all | dig +trace example.com | Trace reaches correct NS set |
| DNSSEC failure | SERVFAIL on validating resolvers | dig +dnssec example.com | RRSIG/DS chain validates |
| CNAME misuse | Apex record won’t save or verify | Inspect zone file/provider UI | Proper apex method used |
Run the Core Diagnostic Commands in the Right Order
dig: the primary tool for authoritative truth
For most administrators, dig should be the default DNS diagnostic tool. It exposes the query path, answer section, authority section, and TTL data in a way that is easier to trust than a browser or OS UI. Start with the exact hostname and record type, then expand into authoritative and trace queries. If you need a broader operational comparison mindset, the methodology in prioritizing technical fixes at scale is surprisingly similar: isolate one layer at a time and confirm each assumption.
Helpful commands include dig example.com A, dig example.com AAAA, dig example.com MX, and dig example.com TXT. Add +short when you want a fast answer, +trace when delegation is suspect, and @server when testing a specific resolver or authoritative host. Save output during incidents so you can compare pre-change and post-change states later.
nslookup: useful, but interpret carefully
nslookup still has value, especially for quick cross-platform checks, but its output is less precise and can be misleading if you do not know which resolver it used. On some systems, it may hide TTL and DNSSEC context that matter during incidents. Use it when you need a fast sanity check or when an older admin workflow depends on it, but prefer dig for the canonical record inspection.
A practical habit is to verify the resolver listed by nslookup before trusting the answer. If the answer seems odd, repeat the test with a known public resolver or the authoritative NS. That simple step avoids false conclusions caused by local DNS configuration, VPN tunneling, or split-horizon behavior.
Browser and OS checks: clear caches, inspect names, and test more than one client
Browsers can cache DNS differently from the operating system, and some platforms cache aggressively. If a host works in terminal tests but not in Chrome or Safari, clear the browser DNS cache and retest in a private window after restarting the browser. Also verify the URL itself, because an apparent DNS issue may actually be a redirect loop, HSTS conflict, or certificate mismatch after the hostname resolves successfully.
On Windows, macOS, and Linux, OS-level cache flushing varies by version. Instead of memorizing only one command, keep platform-specific snippets in your runbook. If your team has mixed environments, treat browser tests the same way you would any other endpoint verification workflow, much like the practical checks in protecting systems from environmental hazards: eliminate local variables before assuming the upstream service is broken.
DNS Delegation: The Hidden Cause Behind Domains That “Just Don’t Work”
Check the nameserver chain from root to authoritative
Delegation issues occur when the parent zone points to the wrong nameservers, missing glue records, or stale NS values. In these cases, the child zone may be perfectly configured, but the world cannot find it because the path to it is broken. Use dig +trace to walk the delegation chain and see exactly where resolution stops. This is one of the fastest ways to tell whether the problem is inside your zone or above it.
If the registrar NS settings differ from the actual authoritative nameservers, you have a mismatch that can persist even after the zone itself is fixed. In a migration, this often happens when a domain is moved to a new DNS provider but the registrar was never updated, or the old provider was decommissioned too early. Delegation problems are especially painful because they can look like total outages while the underlying records appear correct in the provider dashboard.
Understand glue records and when they matter
Glue records become relevant when the nameserver itself lives under the delegated domain. For example, if ns1.example.com is a nameserver for example.com, the parent zone may need glue to avoid circular dependency. Without correct glue, resolvers may not be able to reach the authoritative server in the first place. This is a classic “I changed everything and it still fails” scenario.
When troubleshooting, compare the parent-side NS data with the authoritative zone’s NS set. If the parent shows old nameservers, fix the registrar or parent delegation. If the parent is correct but the child zone returns different NS records, check provider sync, zone transfer, or split-brain configurations. For teams used to planning operational handoffs, this is similar to the discipline behind retention-friendly operating environments: the transition only works if every dependent layer is updated together.
Watch for split-horizon and internal DNS exceptions
Some organizations intentionally serve different DNS answers internally and externally. That is valid for private apps, internal SaaS, or migration staging, but it creates confusion during incident response because one engineer sees the internal answer while another sees the public one. Make sure you know which resolver a workstation is using and whether VPN policies or corporate filters alter the view.
If you suspect split-horizon behavior, test from outside your network and from a public vantage point. Comparing results from a laptop on mobile hotspot versus a machine on the office network often reveals whether the problem is truly global. This same “compare segments, not just averages” mindset appears in pricing and distribution strategy analysis, where the answer depends on the channel you inspect.
DNSSEC: When Security Controls Break Valid Resolution
Recognize the signs of DNSSEC validation failure
DNSSEC is powerful, but when misconfigured it can cause resolvers to return SERVFAIL even though the zone itself looks healthy. Common causes include an expired signature, mismatched DS records at the parent, incorrect key rollovers, or incomplete signing after a DNS provider change. If the domain fails only on validating resolvers and works on non-validating ones, DNSSEC is a prime suspect.
Use dig +dnssec example.com to inspect DNSSEC-related records, including RRSIG and DNSKEY. If the parent DS record does not match the child’s current key, the chain of trust is broken. In migration windows, this often happens when a new provider signs the zone but the registrar still publishes the old DS record, or the DS was removed too soon.
Plan key rollovers carefully
Key rollovers should be scheduled and documented, not improvised during an outage. Treat the rollover as a controlled change with prechecks, overlap periods, and verification on multiple validating resolvers. Keep a rollback path ready, because a bad DNSSEC change can be more disruptive than a simple record typo. Teams that already maintain formal security routines, like those in security audit techniques, usually adapt well to DNSSEC governance.
After any DNS provider migration, confirm both signature validity and parent DS state. A zone may “look fine” in the provider UI while resolvers still reject it because the trust chain is stale. Your checklist should always include the validation step, not just the record update.
Decide when to disable DNSSEC temporarily
In some emergency migrations, teams choose to disable DNSSEC temporarily to restore service, then re-enable it after stabilization. This can be the right call when user impact is severe and the team cannot quickly repair the trust chain. However, disabling DNSSEC should be treated as a controlled fallback, not a casual fix, because it reduces protection against spoofing. Document the decision, the justification, and the re-enable deadline.
Pro Tip: If a domain fails only on validating resolvers, check DNSSEC before changing A or CNAME records. Many teams waste time “fixing” the wrong layer while the real issue is a DS mismatch.
Browser-Level and Hosting-Level Checks You Should Never Skip
Confirm HTTP and TLS after DNS resolves
DNS may be healthy while the site still appears broken. Once resolution works, test the actual service with curl -I https://example.com or a browser hard refresh. Look for certificate mismatch, redirect loops, stale CDN targets, or origin downtime. This is especially important when DNS points to a load balancer, CDN, or managed hosting provider, because the record can be correct while the edge configuration is not.
If your domain is tied to a web build system or hosting platform, verify that both the DNS target and the platform’s domain verification are complete. The checklist mindset used in hybrid simulation workflows applies well here: the control plane and runtime both have to agree before the service behaves correctly.
Verify apex, www, and redirect behavior together
Many incidents are actually caused by inconsistent canonicalization. The apex domain might resolve one way, www another, and redirects may point back and forth. That can confuse browsers, search engines, and users, while making the issue look like a DNS error. Check both hostnames and confirm where each one lands after the browser follows redirects.
Best practice is to define a single canonical hostname and document the redirect policy. Then configure DNS and hosting to support that policy consistently. If you are doing content, product, or site restructuring, the transition planning concepts in cadence-based audit workflows help keep those hostnames aligned over time.
Use DNS tools and browser tools together
Developer tools in modern browsers can reveal whether the page is failing before network requests begin or after the DNS lookup finishes. Combine that with command-line tools so you can narrow the failure scope quickly. A browser saying “DNS_PROBE_FINISHED_NXDOMAIN” means something different from a browser timing out after a correct DNS answer. Those are not interchangeable signals.
Where possible, test from at least two devices and two networks. That extra effort often reveals whether the local machine, corporate resolver, or upstream zone is responsible. Think of it as the same style of practical validation that good operators use in network-level filtering deployments: the client, resolver, and policy layers all matter.
A Practical Incident Checklist for IT Admins
Pre-change checklist
Before making a DNS change, record the current state. Capture the exact record values, TTL, authoritative nameservers, and any DNSSEC settings. Lower TTLs in advance for migrations, confirm registrar access, and identify who owns rollback decisions. If the change touches verification records, email, or third-party platforms, notify those owners before the change window.
Also verify the target service is ready to accept traffic. DNS can be perfect and the cutover still fail if the application, origin server, or CDN is not prepared. Teams that use structured procurement and vendor review habits, like those described in vendor due diligence for analytics, tend to have fewer surprises because dependencies are documented ahead of time.
During-change checklist
Change one variable at a time whenever possible. Update the record, verify from authoritative and recursive resolvers, then confirm in a browser and from the hosting platform. If you change NS records or DNSSEC at the same time as web records, troubleshooting becomes much harder. Keep screenshots or exported zone data so you can compare old versus new states.
When a record does not behave as expected, stop and compare the live authoritative answer to the intended configuration. If those differ, the problem is within the provider or delegation path. If those match, look downstream at cache, browser, TLS, or hosting. That disciplined separation will save more time than any single tool.
Post-change checklist
After the change, confirm TTL decay, resolver convergence, and application health. Check whether the old record still appears anywhere after the expected TTL window. Confirm analytics, email delivery, and monitoring alerts are back to normal. Finally, update the runbook so the next admin does not repeat the same investigation from scratch.
Good operational notes are part of resilience. If a fix required an unusual workaround, document it in the same way you would document a notable operational event in post-mortem planning. The goal is not just to restore service, but to reduce repeat incidents.
Common Failure Patterns and How to Diagnose Them Fast
The domain resolves, but the site is still dead
This usually points to hosting, TLS, redirect, or CDN misconfiguration rather than DNS. Verify the hostname reaches the expected origin and that the certificate matches the domain. If the app is behind a load balancer, confirm health checks and backend pools. DNS has done its job once it resolves correctly; at that point the problem is usually elsewhere.
Some users see the new site and some see the old site
That is almost always cache or resolver variance. Check TTLs, stale recursive resolvers, and browser cache behavior. If you recently changed an apex record or moved between providers, compare the old and new authoritative answers carefully. In many cases, the fix is simply waiting out the previous TTL plus a small buffer, but confirming that is better than guessing.
The domain fails only in certain regions or offices
Regional or office-specific failure often indicates split-horizon DNS, VPN policy, ISP issues, or local resolver filtering. Compare answers from the affected network with answers from public resolvers and external probes. If the office uses a managed DNS product, review policy rules and forwarding behavior before touching the zone itself. Pattern recognition matters here, much like the analysis in exclusive content distribution, where access depends on where and how the request is made.
Quick Reference Table: Commands, Purpose, and What to Look For
| Command | Purpose | Key output to inspect | Common takeaway |
|---|---|---|---|
dig example.com A | Check IPv4 destination | Answer section, TTL | A record points to expected host |
dig example.com AAAA | Check IPv6 destination | Answer section, TTL | IPv6 target is correct or absent by design |
dig example.com CNAME | Check alias behavior | CNAME target | Hostname chains correctly |
dig +trace example.com | Inspect delegation | Root-to-authoritative path | Shows where resolution breaks |
dig +dnssec example.com | Inspect validation state | RRSIG, DNSKEY, DS | Chain of trust is intact or broken |
nslookup example.com | Quick sanity check | Resolver used, answer | Useful, but less detailed than dig |
curl -I https://example.com | Verify app/HTTP layer | Status code, redirect chain | DNS may be fine; app may not |
FAQ
Why does DNS still show the old IP after I changed it?
Most likely because a recursive resolver or browser cache still has the old answer until TTL expires. Check the authoritative server directly with dig @authoritative-ns to confirm the new value exists. If the authoritative zone is correct, wait for cache expiry and retest from multiple resolvers.
Should I use CNAME or A records for my main website?
Use A/AAAA for direct IP targets and CNAME for hostnames that should follow another hostname. For apex domains, standard DNS does not allow a CNAME at the root, so you need A/AAAA or a provider feature like ALIAS/ANAME flattening. Choose based on your hosting architecture, not preference.
How do I know if DNSSEC is the problem?
If a domain works on non-validating resolvers but fails with SERVFAIL on validating ones, DNSSEC is a strong suspect. Check DS, DNSKEY, and RRSIG alignment using dig +dnssec. A mismatched DS record at the parent is one of the most common failure modes after provider changes.
What is the fastest way to check delegation?
Use dig +trace domain.tld. This shows the full path from the root servers through the TLD to the authoritative nameservers. If the trace stops early, the issue is usually registrar or nameserver delegation, not the zone contents.
Why does the site work on mobile data but not in the office?
That usually means the office resolver, VPN, or security policy is altering DNS responses. Compare the office answer to a public resolver and check whether the office uses split-horizon DNS or filtered forwarding. If only one network fails, do not assume the zone itself is broken.
When should TTL be lowered?
Lower TTL before a planned migration, ideally 24 to 48 hours in advance. That gives resolvers time to cache the shorter value before the cutover. Lowering TTL after the change helps less, because caches may still hold the older, longer TTL.
Final Checklist: The Order That Saves the Most Time
When an incident hits, follow this order: confirm the exact hostname, check A/AAAA/CNAME values, verify authoritative answers, inspect delegation with dig +trace, review TTL and cache behavior, validate DNSSEC, and then test browser and HTTP behavior. That sequence keeps you from changing random records before you understand the failure layer. It also reduces the chance of turning a simple DNS fix into a broader outage.
For teams building durable operational knowledge, combine this guide with your internal standards for audits, vendor management, and incident notes. A strong DNS runbook is not just a list of commands; it is a repeatable decision tree. If you maintain that rigor, DNS stops being a mysterious fire drill and becomes a manageable operational routine. For additional adjacent operational reading, see security audits for small DevOps teams, DNS filtering at scale, and resilience-focused post-mortems.
Related Reading
- Navigating Security: Effective Audit Techniques for Small DevOps Teams - A practical framework for repeatable operational checks.
- NextDNS at Scale: Deploying Network-Level DNS Filtering for BYOD and Remote Work - Useful when resolver policy is part of the problem.
- Post‑Mortem 2.0: Building Resilience from the Year’s Biggest Tech Stories - Good patterns for documenting incident learnings.
- Prioritizing Technical SEO at Scale: A Framework for Fixing Millions of Pages - Helpful for structured triage and change sequencing.
- Vendor Due Diligence for Analytics: A Procurement Checklist for Marketing Leaders - Strong model for dependency tracking and ownership clarity.
Related Topics
Daniel Mercer
Senior Technical Editor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you