Creating a website is a monumental task. On top of creating engaging content and showcasing all your wonderful services, you have to remember elements like design and media. However, when a user searches for your content, you might not want all your media files to appear individually on the search engine results pages (SERPs). That’s where robots.txt files come in. What is a robots.txt file, you ask? Let’s dive into it.
What Is Robots.txt?
Robots.txt serves as a kind of instruction manual for crawlers. It lets web crawlers know what pages and files can be crawled for indexing and which ones can’t. Generally, robots.txt is used to control crawler traffic to your website, so it doesn’t become overwhelmed with crawler requests. The best use for robots.txt files is for hiding website elements like audio or script files from appearing on Google. It’s important to note that robots.txt files are not meant to be used as a way to hide pages from Google. If your goal is to keep content from being crawled, you’ll want to use the noindex function, but we’ll get into that later.
The Robots.txt Format
When it comes to robots.txt formatting, Google has a pretty strict guideline. Every website is only allowed one robots.txt file, and that file has to follow a specific format. The highest priority when it comes to creating a robots.txt file is to make sure it’s placed in the root of your domain. For example, a Markitors robots.txt file would look like https://markitors.com/robots.txt. Here, the robots.txt file is connected directly to the domain as opposed to being hidden within a section of the website like our blog or services pages.
You’ll want to have a clear list of what content you want or don’t want crawlers to find before creating your robots.txt file. Once you’ve decided on that, the next part is formatting the text in your robots.txt file. Each file consists of a set of rules for crawlers to follow when they find your website. Each rule, or group, will come will include a user agent and a command. In most cases, the user agent will be a crawler, like Googlebots, and the command will be to allow or disallow these crawlers to certain files on your site. Your website’s one robots.txt file should house all the rules you have for your site. An example of what your file might include:
Group #1
Useragent: [User agent name, like “Googlebots”]
Disallow: [URL to file that you don’t want crawled, like “/SEO-video”]
How To Use Robots.txt Commands
We’ve figured out how to format our robotx.txt file, but what exactly do all the commands mean? Well, they’re just your way of letting Google know what content their crawlers of access to and where you’re placing a “No Trespassing” sign. There are three major commands to know, and we’ll be covering them all.
Disallow
As mentioned before, there may be certain files that you might not want Google to index. Robots.txt disallow was created for this purpose. The disallow option lets crawlers know that the page or file you’ve specified is off-limits. Use the disallow command to keep features such as multimedia and design elements from being indexed and appearing on Google. And remember—you don’t want to use a robots.txt file to hide pages.
Allow
While the primary purpose of a robots.txt file is to inform crawlers what not to scan, it can also be beneficial to let them know what they should scan. The robots.txt allow command is typically used when a page has been given a disallow command but said page might have certain elements you want to crawl, such as if you were to disallow crawlers to scan your overall homepage but want them to scan a specific blog.
Noindex
A noindex command follows its own set of rules. Instead of being placed in the robots.txt file, it is embedded into the meta tag. Before Google even finds your robots.txt file, it may come across your noindex command and know exactly what content to skip over. Adding a noindex command to a page can be particularly helpful if someone has backlinked to that content.
There are one of two ways you can go about adding a noindex command to your pages. The first is to tell all crawlers what pages not to index by including <meta name =”robots” content=”noindex”> in your header. The second is to specify the type of crawlers that are not welcome to scan your site. If you’re looking to keep content off of Google specifically, you can include <meta name =”googlebot” content=”noindex”> instead.
Robots.txt files can be a little confusing, but we here at Markitors hope we’ve helped clarify them for you. Contact us today to find out how a robots.txt file can improve your website and overall SEO quality.