Sunday, August 22, 2010

Search Engine Basics

What do you do when you need to find something on the Internet?
In most cases, you pop over to one of the major search engines
and type in the term or phrase that you’re looking for and then
click through the results, right? But of course search engines weren’t always
around.
In its infancy, the Internet wasn’t what you think of when you use it now. In
fact, it was nothing like the web of interconnected sites that’s become one of
the greatest business facilitators of our time. Instead, what was called the
Internet was actually a collection of FTP (File Transfer Protocol) sites that
users could access to download (or upload) files.
To find a specific file in that collection, users had to navigate through each
file. Sure, there were shortcuts. If you knew the right people — that would be
the people who knew the exact address of the file you were looking for — you
could go straight to the file. That’s assuming you knew exactly what you were
looking for.
The whole process made finding files on the Internet a difficult, timeconsuming
exercise in patience. But that was before a student at McGill
University in Montreal decided there had to be an easier way. In 1990,
Alan Emtage created the first search tool used on the Internet. His creation,
an index of files on the Internet, was called Archie.
If you’re thinking Archie, the comic book character created in 1941, you’re
a little off track (at least for now). The name Archie was used because the
file name Archives was too long. Later, Archie’s pals from the comic book
series (Veronica and Jughead) came onto the search scene, too, but we’ll get
to that shortly.
Archie wasn’t actually a search engine like those that you use today. But at the time, it was a program
many Internet users were happy to have. The program basically downloaded directory listings for all
of the files that were stored on anonymous FTP sites in a given network of computers. Those listings
were then plugged into a searchable database of web sites.
The search capabilities of Archie weren’t as fancy as the natural language capabilities you’ll find in
most common search engines today, but at the time it got the job done. Archie indexed computer
files, making them easier to locate.
In 1991, however, another student named Mark McCahill, at the University of Minnesota, decided
that if you could search for files on the Internet, then surely you could also search plain text for
specific references in the files. Because no such application existed, he created Gopher, a program
that indexed the plain-text documents that later became the first web sites on the public Internet.
With the creation of Gopher, there also needed to be programs that could find references within
the indexes that Gopher created, and so Archie’s pals finally rejoined him. Veronica (Very Easy
Rodent-Oriented Net-wide Index to Computerized Archives) and Jughead (Jonzy’s Universal
Gopher Hierarchy Excavation and Display) were created to search the files that were stored in
the Gopher Index System.
Both of these programs worked in essentially the same way, allowing users to search the indexed
information by keyword.
From there, search as you know it began to mature. The first real search engine, in the form that we
know search engines today, didn’t come into being until 1993. It was developed by Matthew Gray,
and it was called Wandex. Wandex was the first program to both index and search the index of
pages on the Web. This technology was the first program to crawl the Web, and later became the
basis for all search crawlers. And from there, search engines took on a life of their own. From 1993
to 1998, the major search engines that you’re probably familiar with today were created:
Excite — 1993
Yahoo! — 1994
Web Crawler — 1994
Lycos — 1994
Infoseek — 1995
AltaVista — 1995
Inktomi — 1996
Ask Jeeves — 1997
Google — 1997
MSN Search — 1998


Today, search engines are sophisticated programs, many of which allow you to search all manner of
files and documents using the same words and phrases you would use in everyday conversations.
It’s hard to believe that the concept of a search engine is just over 15 years old. Especially considering
what you can use one to find these days!
What Is a Search Engine?
Okay, so you know the basic concept of a search engine. Type a word or phrase into a search box
and click a button. Wait a few seconds, and references to thousands (or hundreds of thousands) of
pages will appear. Then all you have to do is click through those pages to find what you want. But
what exactly is a search engine, beyond this general concept of “seek and ye shall find”?
It’s a little complicated. On the back end, a search engine is a piece of software that uses applications
to collect information about web pages. The information collected is usually key words or
phrases that are possible indicators of what is contained on the web page as a whole, the URL of
the page, the code that makes up the page, and links into and out of the page. That information
is then indexed and stored in a database.
On the front end, the software has a user interface where users enter a search term — a word or
phrase — in an attempt to find specific information. When the user clicks a search button, an
algorithm then examines the information stored in the back-end database and retrieves links to
web pages that appear to match the search term the user entered.
You can find more information about web crawlers, spiders, and robots in Chapter 14.
The process of collecting information about web pages is performed by an agent called a crawler,
spider, or robot. The crawler literally looks at every URL on the Web, and collects key words and
phrases on each page, which are then included in the database that powers a search engine.
Considering that the number of sites on the Web went over 100 million some time ago and
is increasing by more than 1.5 million sites each month, that’s like your brain cataloging every
single word you read, so that when you need to know something, you think of that word and
every reference to it comes to mind.
In a word . . . overwhelming.
Anatomy of a Search Engine
By now you probably have a fuzzy picture of how a search engine works. But there’s much more to it
than just the basic overview you’ve seen so far. In fact, search engines have several parts. Unfortunately,
it’s rare that you find an explanation for just how a search engine is made — and that information is
vitally important to succeeding with search engine optimization (SEO).
Query interface
The query interface is what most people are familiar with, and it’s probably what comes to mind
when you hear the term “search engine.” The query interface is the page that users see when they
navigate to a search engine to enter a search term.
There was a time when the search engine interface looked very much like the Ask.com page shown
in Figure 1-1. The interface was a simple page with a search box and a button to activate the search.
Today, many search engines on the Web have added much more personalized content in an attempt
to capitalize on the real estate available to them. For example, Yahoo! Search, shown in Figure 1-2,
allows users to personalize their pages with a free e-mail account, weather information, news, sports,
and many other elements designed to make users want to return to that site to conduct their web
searches.
One other option users have for customizing the interfaces of their search engines is a capability
like the one Google offers. The Google search engine has a customizable interface to which users
can add different gadgets. These gadgets allow users to add features to their customized Google
search home that meet their own personal needs or tastes.
When it comes to search engine optimization, Google’s user interface offers the most ability for you
to reach your target audience, because it does more than just optimize your site for search; if there
is a useful tool or feature available on your site, you can allow users to have access to this tool or
feature through the Application Programming Interface (API) made available by Google. This allows
you to have your name in front of users on a daily basis.
You can find more information about Google APIs in Appendix A in the section
“Optimization for Google.”
For example, a company called PDF24.org has a Google gadget that allows users to turn their documents
into PDF files, right from their Google home page once the gadget has been added. If the
point of search engine optimization is ultimately to get your name in front of as many people as
possible, as often as possible, then making a gadget available for addition to Google’s personalized
home page can only further that goal.
Crawlers, spiders, and robots
The query interface is the only part of a search engine that the user ever sees. Every other part of
the search engine is behind the scenes, out of view of the people who use it every day. That doesn’t
mean it’s not important, however. In fact, what’s in the back end is the most important part of the
search engine.
CROSS--REF There’s more in-depth information about crawlers, spiders, and robots in Chapter 14.
If you’ve spent any time on the Internet, you may have heard a little about spiders, crawlers, and
robots. These little creatures are programs that literally crawl around the Web, cataloging data so that
it can be searched. In the most basic sense all three programs — crawlers, spiders, and robots — are
essentially the same. They all “collect” information about each and every web URL.
This information is then cataloged according to the URL on which they’re located and are stored in
a database. Then, when a user uses a search engine to locate something on the Web, the references
in the database are searched and the search results are returned.
Databases
Every search engine contains or is connected to a system of databases, where data about each URL
on the Web (collected by crawlers, spiders, or robots) is stored. These databases are massive storage
areas that contain multiple data points about each URL.
The data might be arranged in any number of different ways, and will be ranked according to a
method of ranking and retrieval that is usually proprietary to the company that owns the search
engine.
Search algorithms
All of the parts of the search engine are important, but the search algorithm is the cog that makes
everything work. It might be more accurate to say that the search algorithm is the foundation on
which everything else is built. How a search engine works is based on the search algorithm, or the
way that data is discovered by the user.
In very general terms, a search algorithm is a problem-solving procedure that takes a problem, evaluates
a number of possible answers, and then returns the solution to that problem. A search algorithm
for a search engine takes the problem (the word or phrase being searched for), sifts through a database
that contains cataloged keywords and the URLs those words are related to, and then returns
pages that contain the word or phrase that was searched for, either in the body of the page or in a
URL that points to the page.
This neat little trick is accomplished differently according to the algorithm that’s being used. There are
several classifications of search algorithms, and each search engine uses algorithms that are slightly
different. That’s why a search for one word or phrase will yield different results from different search
engines. Some of the most common types of search algorithms include the following:
List search: A list search algorithm searches through specified data looking for a single
key. The data is searched in a very linear, list-style method. The result of a list search is
usually a single element, which means that searching through billions of web sites could
be very time-consuming, but would yield a smaller search result.
Tree search: Envision a tree in your mind. Now, examine that tree either from the roots out
or from the leaves in. This is how a tree search algorithm works. The algorithm searches a
data set from the broadest to the most narrow, or from the most narrow to the broadest.
Data sets are like trees; a single piece of data can branch to many other pieces of data, andthis is very much how the Web is set up. Tree searches, then, are more useful when conducting
searches on the Web, although they are not the only searches that can be successful.
SQL search: One of the difficulties with a tree search is that it’s conducted in a hierarchical
manner, meaning it’s conducted from one point to another, according to the
ranking of the data being searched. A SQL (pronounced See-Quel) search allows data
to be searched in a non-hierarchical manner, which means that data can be searched
from any subset of data.
Informed search: An informed search algorithm looks for a specific answer to a specific
problem in a tree-like data set. The informed search, despite its name, is not always the
best choice for web searches because of the general nature of the answers being sought.
Instead, informed search is better used for specific queries in specific data sets.
Adversarial search: An adversarial search algorithm looks for all possible solutions to a
problem, much like finding all the possible solutions in a game. This algorithm is difficult
to use with web searches, because the number of possible solutions to a word or phrase
search is nearly infinite on the Web.
Constraint satisfaction search: When you think of searching the Web for a word or
phrase, the constraint satisfaction search algorithm is most likely to satisfy your desire to
find something. In this type of search algorithm, the solution is discovered by meeting a
set of constraints, and the data set can be searched in a variety of different ways that do not
have to be linear. Constraint satisfaction searches can be very useful for searching the Web.
These are only a few of the various types of search algorithms that are used when creating search
engines. And very often, more than one type of search algorithm is used, or as happens in most
cases, some proprietary search algorithm is created. The key to maximizing your search engine
results is to understand a little about how each search engine you’re targeting works. Only when
you understand this can you know how to maximize your exposure to meet the search requirements
for that search engine.
Retrieval and ranking
For a web search engine, the retrieval of data is a combination activity of the crawler (or spider or
robot), the database, and the search algorithm. Those three elements work in concert to retrieve the
word or phrase that a user enters into the search engine’s user interface. And as noted earlier, how
that works can be a proprietary combination of technologies, theories, and coding whizbangery.
The really tricky part comes in the results ranking. Ranking is also what you’ll spend the most time
and effort trying to affect. Your ranking in a search engine determines how often people see your page,
which affects everything from revenue to your advertising budget. Unfortunately, how a search engine
ranks your page or pages is a tough science to pin down.
The most that you can hope for, in most cases, is to make an educated guess as to how a search
engine ranks its results, and then try to tailor your page to meet those results. But keep in mind
that, although retrieval and ranking are listed as separate subjects here, they’re actually part of the
search algorithm. The separation is to help you better understand how search engines work.
Ranking plays such a large part in search engine optimization that you’ll see it frequently in this book.
You’ll look at ranking from every possible facet before you reach the last page. But for now, let’s look
at just what affects ranking. Keep in mind, however, that different search engines use different ranking
criteria, so the importance each of these elements plays will vary.
Location: Location doesn’t refer here to the location (as in the URL) of a web page. Instead,
it refers to the location of key words and phrases on a web page. So, for example, if a user
searches for “puppies,” some search engines will rank the results according to where on the
page the word “puppies” appears. Obviously, the higher the word appears on the page, the
higher the rank might be. So a web site that contains the word “puppies” in the title tag will
likely appear higher than a web site that is about puppies but does not contain the word in
the title tag. What this means is that a web site that’s not designed with SEO in mind will
likely not rank where you would expect it to rank. The site www.puppies.com is a good
example of this. In a Google search, it appears ranked fifth rather than first, potentially
because it does not contain the key word in the title tag.
Frequency: The frequency with which the search term appears on the page may also affect
how a page is ranked in search results. So, for example, on a page about puppies, one that
uses the word five times might be ranked higher than one that uses the word only two or
three times. When word frequency became a factor, some web site designers began using
hidden words hundreds of times on pages, trying to artificially boost their page rankings.
Most search engines now recognize this as keyword spamming and ignore or even refuse to
list pages that use this technique.
Links: One of the more recent ranking factors is the type and number of links on a web
page. Links that come into the site, links that lead out of the site, and links within the site
are all taken into consideration. It would follow, then, that the more links you have on your
page or leading to your page the higher your rank would be, right? Again, it doesn’t necessarily
work that way. More accurately, the number of relevant links coming into your page,
versus the number of relevant links within the page, versus the number of relevant links
leading off the page will have a bearing on the rank that your page gets in the search results.
Click-throughs: One last element that might determine how your site ranks against others
in a search is the number of click-throughs your site has versus click-throughs for other
pages that are shown in page rankings. Because the search engine cannot monitor site
traffic for every site on the Web, some monitor the number of clicks each search result
receives. The rankings may then be repositioned in a future search, based on this interaction
with the users.
Page ranking is a very precise science. And it differs from search engine to search engine. To create
the best possible SEO for your site, it’s necessary to understand how these page rankings are made
for the search engines you plan to target. Those factors can then be taken into consideration and used
to your advantage when it’s time to create, change, or update the web site that you want to optimize.

No comments:

Post a Comment