A robots.txt file is one of the simplest crawl-control tools on a website, but it is also one of the easiest to misuse. This beginner-friendly robots.txt guide explains what the file does, what it does not do, how to allow or block crawling safely, and how to test changes before they cause indexing problems. Use it as a reusable checklist whenever you launch a site, redesign sections, move platforms, create a staging environment, or troubleshoot unexpected crawl behavior.
Overview
What you will get from this guide: a practical mental model, copyable examples, and a checklist you can return to before editing robots.txt.
The robots.txt file lives at the root of a site, typically at https://example.com/robots.txt. Its job is to give crawl instructions to bots. In plain terms, it tells compliant crawlers which paths they should avoid or where they are allowed to go. It is useful for reducing wasted crawl activity, keeping low-value sections out of the crawl path, and helping search engines discover your sitemap.
Just as important is understanding what robots.txt does not do:
- It does not guarantee a page will stay out of search results.
- It is not a security tool.
- It does not replace authentication, noindex controls, or access restrictions.
- It does not remove URLs that are already known elsewhere.
That distinction matters because many site owners try to use robots.txt to hide private areas. If content is truly private, use proper access controls. Robots.txt is a public file and anyone can view it.
Here is the basic syntax most beginners need to know:
User-agent:identifies the crawler the rule applies to.Disallow:tells that crawler not to request matching paths.Allow:can permit a more specific path inside a blocked area.Sitemap:points crawlers to an XML sitemap.
A simple starting example looks like this:
User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xmlThis example allows all crawling because Disallow: is blank. It also provides a sitemap location.
A basic block example looks like this:
User-agent: *
Disallow: /private/
Disallow: /tmp/
Sitemap: https://example.com/sitemap.xmlThat tells compliant crawlers to avoid URLs that start with /private/ or /tmp/.
If you run WordPress, e-commerce software, a static site, or a custom application, the principles stay the same: only block what you are confident should not be crawled, and test every rule against real URLs before publishing. If you are making wider site changes, it also helps to test in a safe environment first. See How to Create a Staging Site for WordPress and Test Changes Safely.
Checklist by scenario
What you will get in this section: scenario-based guidance for the most common robots.txt decisions, with examples you can adapt.
Scenario 1: A normal public website that wants search visibility
If your site is meant to be found in search, keep robots.txt minimal. Do not start by blocking folders unless you have a clear reason.
Use this checklist:
- Allow the main site to be crawled.
- Add your sitemap URL.
- Block only truly low-value utility areas if needed.
- Check that key sections such as product, service, blog, and documentation pages are not blocked.
Example:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Sitemap: https://example.com/sitemap.xmlFor many small sites, even this may be more than necessary. A clean file that only lists the sitemap is often enough.
Scenario 2: You want to block admin or system paths from crawling
This is common on CMS-driven sites. The goal is usually to reduce crawler activity in areas that are not useful in search results.
Checklist:
- Block system paths that should not be crawled.
- Do not block assets your pages need to render unless you are certain they are safe to block.
- Verify that blocked paths are not linked as important public pages.
Example:
User-agent: *
Disallow: /admin/
Disallow: /login/
Disallow: /cart/
Disallow: /checkout/
Sitemap: https://example.com/sitemap.xmlBe careful with broad patterns. Blocking /assets/, /js/, or /css/ can make it harder for crawlers to understand your pages properly.
Scenario 3: You need a staging site hidden from crawling
This is where many teams make risky decisions. Blocking a staging site in robots.txt is better than leaving it open, but it should not be your only protection.
Checklist:
- Use password protection or IP restrictions if possible.
- Add a restrictive robots.txt file as a secondary measure.
- Prevent internal links from pointing users and crawlers there.
- Check that the staging site is not accidentally included in your sitemap.
Example:
User-agent: *
Disallow: /This tells crawlers not to request any path on the site. Again, this is not security. Treat it as a backup instruction, not your primary protection.
Scenario 4: You want to block search result pages or filtered URLs
On larger sites, internal search pages, layered navigation, and parameter-heavy URLs can generate large volumes of low-value crawl targets.
Checklist:
- Identify patterns that create near-duplicate or thin pages.
- Block crawl paths only after confirming they do not contain important indexable content.
- Keep one consistent version of important category and landing pages crawlable.
Example:
User-agent: *
Disallow: /search
Disallow: /filter/
Disallow: /*?sort=
Disallow: /*?sessionid=Pattern support can vary by crawler, so test your rules carefully. If your platform creates many URL combinations, document exactly which patterns are safe to limit.
Scenario 5: You need to allow one folder inside a blocked section
This is where Allow becomes useful. You may want to block an entire directory except for a public subfolder or file.
Example:
User-agent: *
Disallow: /media/
Allow: /media/public/
Sitemap: https://example.com/sitemap.xmlUse this only when your path logic is clear. If you are not sure how your crawler rules are interpreted, simplify the structure rather than building a complicated exception tree.
Scenario 6: WordPress-specific crawl control
WordPress sites often contain admin, login, search, tag, attachment, and parameterized archive URLs that deserve review. But avoid copying a generic robots.txt template without understanding it.
Checklist:
- Do not block
/wp-content/by default if your theme or media assets live there. - Review whether internal search URLs should be crawled.
- Check plugin-generated paths after installing SEO, membership, or e-commerce tools.
- After changing permalink structure or moving hosts, test robots.txt again.
If your site is mid-migration or cleanup, these guides may help: How to Migrate a WordPress Site to a New Host Without Downtime, How to Fix the WordPress 404 Error, and How to Speed Up a WordPress Site.
Scenario 7: A static site or Jamstack deployment
Static sites usually have simpler crawl patterns, which makes robots.txt easier to manage.
Checklist:
- Allow all public pages unless there is a clear reason not to.
- Block draft, preview, or build-output paths only if they are exposed publicly.
- Confirm the production robots.txt file is different from any preview or test deployment file.
If you publish via platforms such as Git-based deploy workflows, keep robots.txt in version control and review changes as part of deployment. For related setup decisions, see How to Deploy a Static Website and Git Cheat Sheet for Everyday Commands.
What to double-check
What you will get here: a pre-publish review list to reduce the chance of accidental deindexing or crawl waste.
Before saving a robots.txt change, work through these checks:
1. Confirm the file is in the right place
It should be available at the root of the host you want to control. A file on one subdomain does not automatically control another subdomain. If your setup includes multiple hosts, review each one separately. This often matters when choosing between subdomains and subdirectories; see Subdomain vs Subdirectory for SEO and Site Structure.
2. Test real URLs, not just patterns in your head
Pick examples from each important section of the site:
- Homepage
- Main category pages
- Blog posts
- Product or service pages
- Media files if they matter
- Admin, cart, account, and search pages
Compare what should be crawlable with what your rules would block.
3. Check for environment mistakes
Many indexing disasters come from moving a restrictive staging robots.txt file into production. After launches, migrations, redesigns, or host changes, verify the live file immediately.
4. Make sure the sitemap URL is current
If your domain changed, your SSL setup changed, or you switched SEO plugins, your sitemap location may have changed too. An outdated sitemap line is easy to overlook.
If you recently enabled HTTPS, also check for related URL consistency issues with How to Fix Mixed Content Errors After Enabling HTTPS.
5. Review interactions with noindex and canonical logic
Robots.txt, noindex, canonicals, and internal linking each solve different problems. If your strategy is “this page can be crawled but should not appear in search,” robots.txt may be the wrong tool. If your strategy is “this URL should never be requested by bots,” robots.txt may be appropriate.
6. Check server response and formatting
Your robots.txt file should load cleanly and not return an error page, redirect loop, or authentication prompt on the public site unless that is intentional for a non-public environment. Keep formatting simple. Small syntax mistakes can create confusing outcomes.
7. Keep a rollback copy
Because the file is short, it is easy to version and compare. Save a known-good copy before changing it. This is especially useful during redesigns and CMS updates.
How to test robots.txt changes in practice
A safe testing workflow looks like this:
- List the URLs you want to allow and block.
- Write the simplest possible rules.
- Validate the file manually for path accuracy and obvious syntax errors.
- Place the file in staging or a safe preview environment where possible.
- Use available crawler testing tools in your search platform or SEO toolkit to check sample URLs.
- Deploy to production.
- Re-test the live file at
/robots.txt. - Monitor crawl behavior and indexing trends over the next few days and weeks.
The key principle is not to treat robots.txt as “set once and forget forever.” It is closer to infrastructure configuration: small changes can have wide effects.
Common mistakes
What you will get in this section: the errors that cause the most avoidable trouble, so you can spot them quickly.
Blocking the entire site by accident
The classic error is publishing this on a live site:
User-agent: *
Disallow: /That line is appropriate for a private staging site, not for a normal production website. Always confirm environment before deployment.
Using robots.txt to hide sensitive content
Because robots.txt is public, it can reveal paths you would rather not advertise. Use authentication and server-level access control for anything private.
Blocking pages you actually want indexed
Sometimes teams block category pages, blog folders, or media paths while trying to tidy up crawl activity. Review your revenue pages, lead pages, documentation, and high-value content before making broad disallow rules.
Blocking CSS, JavaScript, or image paths without a good reason
If crawlers cannot fetch resources needed to render or understand a page, evaluation of that page may be weaker or less accurate.
Copying a template from another site
Robots.txt files are highly context-specific. A pattern that makes sense for one platform can be harmful on another.
Forgetting that each host is separate
www.example.com, example.com, blog.example.com, and staging.example.com can each need their own review.
Overcomplicating the file
If you need many layered exceptions, it may be a sign your information architecture or URL design needs cleanup. Simpler path structures are easier to crawl, maintain, and troubleshoot.
When to revisit
What you will get here: a practical schedule for reviewing robots.txt so it stays accurate as your site changes.
Revisit your robots.txt file whenever the underlying structure or publishing workflow changes. In practice, that usually means reviewing it at these moments:
- Before a new site launch or redesign
- After moving to a new host or CDN
- After changing domain, subdomain, or HTTPS configuration
- When installing or removing major CMS plugins or commerce features
- When creating a staging, preview, or headless environment
- When crawl reports show unexpected blocked URLs or low-value crawl activity
- Before seasonal planning cycles if content sections are added, retired, or reorganized
- When your deployment workflow changes
A simple recurring checklist is enough:
- Open the live
/robots.txtfile. - Confirm the domain and sitemap lines are correct.
- Scan for broad disallow rules.
- Test five to ten important URLs from across the site.
- Check staging and subdomains separately.
- Save a dated copy of the current file.
If you do only one thing after reading this guide, do this: open your live robots.txt file today and verify that it matches your current site, not an older version of your workflow. That small review can prevent a surprising number of crawl and indexing problems.