The 4 steps of search that all SEOs should understand are What is the difference between crawling, rendering, indexing, and ranking?
As fundamental as it may appear, it is not unusual for some practitioners to mix up the basic steps of search and thoroughly muddle the process.
This article will review how search engines function and go through each process stage.In his expert assessment, he also made many vital errors in defining Google’s operations, claiming that:
- Web crawling was used for indexing.
- Search bots would tell search engines how to rank pages in search results.
- Search bots might potentially be “taught” to index pages based on specific keywords.
- An essential defense in litigation is to try to exclude a testifying expert’s conclusions, which can happen if one can show the court that they lack the minimum credentials required to be regarded seriously.
Because their expert was plainly unqualified to testify on SEO topics, I submitted his inaccurate explanations of Google’s procedure as proof that he lacked the necessary qualifications.
This may sound harsh, but this untrained expert made several fundamental and apparent errors in presenting facts to the court. He erroneously represented my client as engaging in unfair trade practices using SEO while disregarding the plaintiff’s inappropriate activity (who was blatantly using black hat SEO, whereas my client was not).
This misunderstanding of the phases of search utilized by the top search engines is not unique to the opposing expert in my legal case.
There have been instances where renowned search marketers have confounded the steps of search engine procedures, resulting in inaccurate diagnoses of underperformance in the SERPs.
I’ve had people say things like, “I assume Google penalised us, therefore we can’t be in search results!” when, in reality, they had overlooked a critical setting on their web servers that rendered their site content unavailable to Google.
Automated penalties might have been classified as part of the ranking step. In actuality, these websites experienced crawling and rendering flaws that made indexing and ranking difficult.
When there are no alerts of a manual action in the Google Search Console, one should first focus on frequent faults in each of the four steps that influence how search works.
The four stages of searching
There are several procedures involved in getting online material into your search results. In some ways, saying there are only a few separate processes to make it happen is an oversimplification.
Each of the four steps I discuss here has several subprocesses that can occur inside it.
Furthermore, there are significant processes that can be asynchronous to these, such as:
Spam policing types
- Elements are included in the Knowledge Graph, and knowledge panels are updated with the information.
- Image optical character recognition processing
- Processing audio and video data to convert them to text.
- PageSpeed data evaluation and application
- And even more.
The following are the initial phases of search necessary to get web pages to appear in search results.
Crawling happens when a search engine requests web pages from the servers of websites.
Assume Google and Microsoft Bing are at a computer, typing or clicking on a link to a webpage in their browser window.
As a result, the search engines’ computers visit websites in the same way that you do. When a search engine views a webpage, it saves a copy of that page and records all of the links discovered. After collecting that webpage, the search engine will go to the following link on its list of links still to be viewed.
This is described as “crawling” or “spidering,” which is appropriate given that the web is metaphorically a massive, virtual web of interconnected links.
Search engine data-gathering systems are known as “spiders,” “bots,” or “crawlers.”
“Googlebot” is Google’s principal crawling software, whereas “Bingbot” is Microsoft Bing’s. Each has its own set of specialized bots for accessing advertising (such as GoogleAdsBot and AdIdxBot), mobile sites, and so on.
This level of the search engines’ analysis of web pages appears easy, yet a lot is going on in this stage alone.
Consider how many web server systems may exist, each running a different operating system and version, as well as diverse content management systems (e.g., WordPress, Wix, Squarespace), and then each website’s unique modifications.
Many flaws can prevent search engine crawlers from crawling sites, which is why it is critical to understand the specifics involved in this stage.
Before the search engine can request and see the page, it must first locate a connection somewhere. (Under some conditions, search engines have been known to detect further concealed links, such as one step up in the link structure at a subdirectory level or through some limited website internal search forms.)
The links to web pages can be discovered by search engines using the following methods:
- When a website operator sends the link directly to the search engine or makes a sitemap available to the search engine.
- When other websites provide a link to the page.
- If the website already has some pages indexed, through links to the page from inside its website.
- Posts on social media.
- Documents include links.
- URLs discovered in written text but not hyperlinked
- Through the metadata of many types of files.
- And even more.
- Sometimes, a website will advise search engines not to crawl one or more sites using its robots.txt file, which is situated at the domain and web server’s root.
Robots.txt files can include several directives that advise search engines not to scan certain pages, subdirectories, or the whole website.
Instructing search engines not to crawl a page or segment of a website does not rule out the possibility of such pages appearing in search results. Preventing sites from being crawled in this manner can significantly influence their potential to rank highly for their keywords.
In some circumstances, search engines may have difficulty crawling a page if the site automatically prohibits bots. This can occur when the website’s systems identify that:
- The machine requests more pages in less time than a person could.
- The bot asks numerous sites at the same time.
- The server IP address of a bot is geolocated inside a zone that the website has been set to exclude.
- The bot’s and/or other users’ requests for pages overload the server’s resources, causing carrier serving to slow down or fail.
- When the server struggles to keep up with demand, search engine bots are configured to adjust the delay rates between queries automatically.
For more prominent websites and those regularly changing the information on their pages, “crawl budget” can influence whether search bots scan all pages.
The web is an unlimited space of web pages with different update frequencies. Because search engines may not be able to visit every website on the internet, they prioritize which pages they will crawl.
Websites with many pages or that reply slowly may exhaust their crawl budget before having all of their pages crawled if they have a lower ranking weight than other websites.
If the extra resources that contribute to the composition of the webpage are inaccessible to the search engine, it might alter how the search engine understands the content.
Search engines classify the rendering stage as a subprocess inside the crawling stage. I categorized it as a separate stage in the process since getting a webpage and then parsing the material to determine how it would display in a browser are two independent operations.
Google employs the same rendering engine as the Google Chrome browser, known as “Rendertron,” based on the open-source Chromium browser system.
Google keeps compressed copies of the pages in its repository. Microsoft Bing appears to do the same (but I have not found documentation confirming this). Some search engines may save a simplified version of a webpage that contains only the viewable text and no formatting.
I’ve also seen infinite-scrolling category pages on ecommerce websites perform poorly in search engines because the search engine couldn’t see as many product links.
Other factors can also obstruct rendering. For example, if one or more JaveScript or CSS files are unavailable to search engine bots owing to being in subdirectories forbidden by robots.txt, the page cannot be processed appropriately.
Googlebot and Bingbot will primarily ignore pages that need cookies. Pages that conditionally provide essential components dependent on cookies may also fail to render entirely or effectively.
Once a page has been crawled and produced, search engines scan it further to determine whether or not it will be saved in the index and to grasp what the page is about.
The search engine index functions similarly to a word index found at the end of a book.
The book’s index will contain all the essential terms and subjects found in the book, alphabetically, along with a list of the page numbers where the words/topics will be located.
A search engine index comprises many keywords and keyword sequences, as well as a list of all the web pages where the keywords may be discovered.
The index is conceptually similar to a database lookup table, which may have been the initial structure used for search engines. However, the main search engines will likely utilize something a couple of generations more advanced to look up a term and return all URLs relevant to the word.
Using functionality to query all sites connected with a term is a time-saving architecture since searching all webpages for a keyword in real-time each time someone searches for it would take highly impractical amounts of time.
For various reasons, not all crawled pages will be retained in the search index. For example, if a page has a robot’s meta tag with a “noindex” directive, the search engine is instructed not to include the page in the index.
Similarly, a web page’s HTTP header may include an X-Robots-Tag instructing search engines not to index the page.
In some cases, a web page’s canonical tag may inform a search engine that a different page from the current one is to be regarded as the primary version of the page, resulting in the removal of other, non-canonical versions of the page from the index.
Google has also said low-quality web pages may be removed from the index (duplicate content pages, thin content pages, and pages containing all or too much irrelevant content).
There is also a lengthy history of websites with low aggregate PageRank not having all of their webpages scanned, implying that larger websites with insufficient external links may not be indexed appropriately.
A website may not have all of its pages indexed if its crawl budget is insufficient.
Diagnosis and correction of non-indexed sites is a critical component of SEO. As a result, it is advisable to properly investigate the many difficulties that might impede webpage indexing.
The ranking is the level of search engine processing that is most likely the most focused on.
Once a search engine has a list of all the web pages related to a specific keyword or keyword phrase, it must decide how to organize those pages when a search for the keyword is performed.
If you work in the SEO sector, you are probably already aware of some aspects of the ranking process. An “algorithm” is another term for the search engine ranking process.
The complexity associated with the ranking step of the search warrants several articles and books to discuss.
Several factors might influence a webpage’s ranking in search results. According to Google, their algorithm employs more than 200 ranking variables.
Within several of those criteria, there may be up to 50 “vectors” – items that can modify the impact of a single ranking signal on ranks.
PageRank was Google’s first ranking algorithm, developed in 1996. It was based on the idea that links to a webpage, as well as the relative relevance of the sources of the links leading to that webpage, could be assessed to establish the page’s ranking strength compared to all other sites.
A metaphor for this is that links are viewed much like votes, and pages with the most votes will rank higher than those with fewer links/votes.
In 2022, much of the original PageRank algorithm’s DNA remains in Google’s ranking algorithm. That link analysis technique also impacted many other search engines that developed similar approaches.
The former Google algorithm approach required repeatedly processing the web’s links, sending the PageRank value across pages hundreds of times before the ranking process was complete. This repetitive calculating cycle might take nearly a month to complete over millions of pages.
Nowadays, new page links are added daily, and Google calculates rankings in a drip way, allowing pages and modifications to be weighed in much more quickly without needing a month-long link calculation procedure.
Furthermore, links are evaluated in a complex manner, with purchased links, traded links, spam links, non-editorially supported links, and other factors being revoked or reduced in ranking power.
A wide range of criteria other than connections impacts the rankings, including:
E-A-T stands for Expertise, Authoritarianism, and Trustworthiness.
Personal Internet Search History
The “HTTPS” URL prefix distinguishes between encrypted and unencrypted (the usage of Secure Socket Layer, or SSL) delivery of webpages.
Page loading time.
And even more.
understanding the essential search stages is a must if you want to work as an SEO specialist.
Some social media influencers believe that not employing a candidate because they don’t understand the differences between crawling, rendering, indexing, and ranking is “going too far” or “gate-keeping.”
It’s a good idea to understand the differences between these procedures. However, I would not consider having a fuzzy knowledge of such words to be a deal-breaker.
SEO specialists come from a range of backgrounds and degrees of expertise. What matters is that they are trainable enough to learn and achieve an essential degree of knowledge.