What is it? How important is it?
You want to learn ‘how rocks form’, so you open your encyclopedia – but where do you look? The index of course! Search engines “crawl” the internet and make copies of websites to create their own index.
“Crawlers look at webpages and follow links on those pages, much like you would if you were browsing content on the web. They go from link to link and bring data about those webpages back to Google’s servers. We take note of key signals — from keywords to website freshness — and we keep track of it all in the Search index.”
Now we know how Google crawls and indexes, let’s take a look at:
The Basics of Optimising Crawling
1. Site & Link Structure
Internal: Crawlers are looking to visit every page of your website. To make its job easier, it would be helpful to link pages to each other in a web-like fashion to increase the discoverability.
External: Crawlers are looking for support that the pages on your website are of sufficient quality. This is best achieved with external links to your pages to aid discoverability.
- The UX Design, Usability & Navigation. Use a breadcrumb trail to leave links reflecting the structure of your website. XML Sitemaps work in favour according to Google’s documentation.
- Avoid pages that too similar in content to each other. This prevents Google from figuring out which page is most important.
- Fresh content. Keep up with new fresh content and remove outdated and redundant content. (Content: Products, blogs, videos etc…).
- Categorising & using tags
- Avoid overusing keywords.
A case study of 27,000 competitive keywords conducted in 2019 by Moz found there is a clear indication that the nature and the quality of links on a site influence the page ranking. The total links, Domain Authority & Page Authority measured by the Mozbar chrome extension found to have a correlation factor of 0.293, 0.327 & 0.321 respectively. Where a correlation factor above 0.2 indicates a strong relationship to page ranking.
When it comes to fresh content
- The initial date of the post
- How recently it was updated
- Changes made in the core content of the page matter the most
- The rate of changes in the content
- The addition of the pages & fresh links
- The traffic to ones site
- Anchor text should remain fairly constant
2. Looped Redirects & Server Errors
The first action a crawler takes is to look at the HTTP header of your website. Here it will find a status code such as 404, 305 and 202. Googlebot will use this to determine the health of your page. The goal is to maintain the health of these status codes. You can view what each status code means at HTTPstatuses.com and track the health of a link using MozBar.
3. Scripts and Technology Factors
- Allow enhancement and validation of Accelerated mobile pages (AMP). Using Google AMP.
- Build, Test, and Release Your Structured Data using Google structured data testing tool.
- Use the WebP image format.
4. Blocking Web Crawler Access
Robots.txt is a file that prevents crawlers from certain pages. This may be used to hide certain parts of a website to lead a crawler to fresh content or many other applications. Google announced on July 1st, 2019, that the robots.txt protocol is working towards becoming an Internet standard.
In an early 2019 case study by tl:dr SEO found that Google treats the disallow as a directive that doesn’t allow Googlebot to crawl the page. An important note to remember is “disallowing” via robots.txt will keep the URL in the index, however, using “noindex” will entirely remove the URL from the index but can still be crawled.
In an announcement on the Google Webmaster blog, as of September 1st 2019, Google stopped supporting unsupported and unpublished rules in the robots exclusive protocol. Best practice would be to update robot.txt files that are using nofollow or crawl-delay commands. The most up-to-date validations of robot.txt can be found here.
What is Caffeine?
- Test how Google crawls your website. You can do this by using Googles URL inspection tool or Mobile-Friendly Test.
- Avoid user permission requests. Let’s say you have a microphone API on your page. Googlebot can’t provide a microphone, hence you should provide an alternative route.
- Avoid data persistence to serve content. There are two key points to take away from this. When Googlebot crawls your website:
1. Local Storage and Session Storage data are cleared across page loads.
2. HTTP Cookies are cleared across page loads.
This means your website should avoid feeding content based on the data of the user.
- Feature detection. The web application should have a fallback behaviour or polyfill, if applicable. More about Implementing feature detection.
- Describe your page with unique titles and snippets.
1. Make sure every page on your site has a title specified in the <title> tag.
2. Descriptive and concise. Avoid generic titles like “Home” and “Profile”.
3. Avoid keyword stuffing.
4. Avoid repeated or boilerplate titles. This means avoid using the same description and title over again.
5. Brand your titles. Give it some unique catchy additional information.
6. The no index directive. Google can still find your page even if you use robots.txt through external links. Using ‘noindex’ will prevent this.
- Write compatible code.
- Use meaningful HTTP status codes. As previously mentioned, using the correct HTTP status will ensure crawlability.
- Fix images and lazy-loaded content. Images are taxing on bandwidth and hence the use of the WebP image format, as well as “lazy loading”, will improve load times. To ensure Googlebot recognises all the lazy loading content on your page, you can look at these guides:
1. IntersectionObserver API and a polyfill
Always make sure to test your implementation using puppeteer.
To learn more, or to explore any other SEO-related topics, check out SEO Company Melbourne – the experts in anything and everything SEO