Here is the short version of the actual mechanics of a search engine. This falls in the category of a good thing to know if you’re working on SEO; however, in practice this information is not necessary. You can know how to rank a site well in a search engine without understanding how they crawl and index the web.
Google has an automated program, called Googlebot, which crawls the web. Googlebot has a list of every URL that it knows of and it regularly checks back with each of those URLs. For each URL on the list, it downloads the entire page from the server and passes that along to the Google Indexer. Then Googlebot looks over the page for links and adds any new links to its list of URLs to visit.
For URLs that it has discovered before, Googlebot has a schedule of how often to revisit the page. In general, the more links there are to a page, and the more often the page changes, the more frequently Googlebot will visit the page. I’ve had blogs that received a constant stream of comments where Googlebot visited every five minutes to check for updates. I’ve worked on other sites that Googlebot will visit most pages just once a month. The decision is made on a page by page basis: so one page on a site could get visited every day, while another is visited only once a month.
In general you can expect most pages to be visited at least once every 30 days. If the page is substantially different when Googlebot visits again, it will make a note of that and maybe consider visiting sooner next time. If the page is mostly the same as the previous crawl, Googlebot might consider waiting longer before the next crawl (though 30 days seems a common cap even for pages that never change, as long as they have decent links pointing to them). In this way Google tries to find the right balance between conserving crawling resources, while still having the most up to date index.
As an interesting aside, Google deliberately slows down Googlebot and prevents it from crawling pages as fast as it can, because it doesn’t want to stress website servers. In talking about their crawling capacity, a Google engineer once mentioned in passing that they probably have the ability to take down the entire internet if they really unleashed their crawler. Google is filled with simultaneously fun and terrifying facts like that.
Once Googlebot has downloaded a page, it passes that along to the Indexer. The Indexer looks through the page and makes a note of just about all the words of text it finds (converting them all to lowercase while it goes through).
These words all get stored in the index database: each word could be a keyword in a search, so the database stores every word of the document (except stop works like and, of, or, the, etc.) with a note of what page it was found on, and where in the page it was found.
Thus the database of the Indexer includes a massive list of every word there is (and many that aren’t, given the number of typos on the web). Each word is associated with a list of every web page on the internet that includes that word. This is a really, really big database.
It is possible to build an entire interactive website in flash, but the only thing that shows up in the code is a call to a single file. Sites like these have no chance of ranking, because they literally don’t have a single word to be stored in the Indexer database.
Images have a similar problem. Either through a desire to have more control over their graphic design or through simple ignorance, many sites use image files to display text on their site. Rather than actually having HTML with CSS formatted text, they type into a graphics program and save this as an image file. The Google Indexer can see that you have an image there, and it can see what the image name is, but has a very limited idea what kind of text is on the image. Thus any text on that image doesn’t get included in its database. Google has made some progress on this front, and now has some ability to actually read text on images, but this appears not to carry the same weight as text on the page – it’s not clear if this text makes it into the database, or is used elsewhere in the algorithms (possibly only for image search). For all practical SEO purposes, text on images does not exist.
To get an idea of what Google’s Indexer sees, you can directly view the source code of any webpage. Depending on your browser, you can view the source code of a page by pressing Control-U, by clicking View > View Source, or by right-clicking on a page and selecting View Source.
Alternatively you can use a text-based browser like Lynx (or the online version, Lynx Viewer).
The final part of the search engine is where the magic happens. A user enters a search query into the search box, then Google’s Query Processor goes to the index to retrieve every page that seems to match the query. Then some magic happens where Google decides what order to rank the results.
It’s not magic of course, and it’s not even unknown what happens at this stage. Google says they use over 200 ranking factors to determine how to rank search results. We’ll discuss what the major factors are later. In fact, this entire site is really mostly an explanation of what the “magic happens” part of Google really is.
The truly impressive thing about the Query Processor is its speed. Using only normal desktop computers (but lots and lots and lots of them) Google is able to take your query, check the index, get a list of millions of pages that include the same words, figure out which are probably best for you, and rank every one of the millions of results in order. And Google does this in a fraction of a second.
Google is obsessed with speed, by the way. Early experiments at Google showed them that delays of even a fraction of a second resulted in users making fewer searches. Google believes that speed is crucial to a good user experience — and it thinks that true of your site as well.
The Search Engines
|All SEO Lessons||Next >>