Google’s own vice-president for core search was recently quoted as saying that it takes a good engineer two years to understand search. This brief explanation therefore can only be the simplest possible overview of search algorithms.
There were many different attempts at crafting useful search engine results in the world before Google: rankings by visitors, traffic, and users; paid placement; human categorization, keyword analysis, authoritativeness, and many others. The primary revolution that Google brought to search was to treat the web as a social network.
Imagine for a moment that you want to hire a gardener to maintain the front yard of your home. For the sake of argument, you do not have an Internet connection or a phone book. How do you find a gardener that is reputable, reliable and affordable?
<p>I like <a href="http://dave.com">Dave</a>, I guess.
He’s a pretty good <a href="http://dave.com/teaching">teacher</a>.
Versus:
<p>I really don’t like <a href="http://dave.com">Dave</a>’s
<a href="http://dave.com/teaching">classes</a>.
The obvious strategy is to start asking people: your neighbours and friends. Walking through your neighbourhood you might see a particularly fetching arrangement of flowers and a well-maintained lawn and ask the owners who their gardener is. Over time, one name will likely come up in conversation more than others. That’s likely to be the gardener you will first approach.
In other words, your social network creates its own recommendations through word-of-mouth. It is not perfect: there will be prejudices, information asymmetry, and biases. But for all its faults the system is useful, and very powerful.
People still talk about each other on the web, but the actual connection is supplied by a link – the <a>
tag, with its associated href
attribute value. At the simplest level, then, one can tell that a website is popular simply by counting the number of outside links directed to it. This method of search is successful because it works like a real social network – it is robust, redundant, and difficult to trick, since the links have to come in from outside sources, and it is difficult and expensive to pay people to hype your website.
The context of the link is derived by understanding the content between the opening and closing <a>
tags. By looking at the words used, we can surmise what the link is about. For example, if the majority of links to this blog used the words “Dudley Storey”, “web development” and “teacher” as content, we might be able to guess that thenewcode.com was a site written by Dudley Storey, a web developer teacher.
(Note that this word association technique can be used for nefarious purposes, too. From 2002 ~ 2006, typing the words “miserable failure” into Google would bring up www.whitehouse.gov as the first search result. This was achieved through a blogging campaign: any time George W Bush was mentioned in a blog post or comment, they would use the term “miserable failure” and link those words to the White House website. Over time, Google associated “miserable failure” with the Bush White House. This technique was known as “Google-bombing”, and the company took active steps to counter it, such as punishing sites that leave comment spam in unrelated pages.
Google cannot track every incoming link: some content and links hidden behind paywalls and login pages are invisible to Google.
Enjoy this piece? I invite you to follow me at twitter.com/dudleystorey to learn more.