Dear Internet visitor,
My intention is
to help you own a profitable Internet
Home Business.
Here is
Internet Home Business course that will help and guide
you to start off your Home Business on the Internet.
Lesson #33
INFORMATION
RETRIEVAL -
SEARCHING TEXT
This lesson will
introduce you to the advanced techniques of searching
online textual information. It will also address some
basic principles of Information Retrieval Science as
they relate to Website optimization.
WHERE DID
I PUT THOSE INSTRUCTIONS?
If you're like me, you're
always looking for something. I jot things down on
scraps of paper that get left around everywhere. I
create little tornados of paper scraps as I move around
the house hunting the one scrap of paper that I need at
the moment. When I started using computers, I carried
this bad habit over into my Cyberworld. I pull up
notepad files and jot things down when in a hurry and
save them to whatever location seems logical to me at
the time. Only problem is, when it comes time to find
it, I can never remember where I saved it.
Fortunately, it is easier
to find things on a computer than it is in the physical
world, but you have to know the fundamentals of text
searching to make it work.
Sometimes it is easy. If
you are searching for the notes of your last phone call
with a specific prospect, you can search through all the
files on your computer for the prospect's name (using
the "Search" selection on your Start button of Windows
and then selecting "for files or folders"), and it will
likely come up. Sometimes it is more difficult. If the
name doesn't bring it up, you will have to remember
something about the conversation and come up with an
exact phrase likely to appear in your notes.
(If you use Outlook
or Outlook Express as your e-mail client, there is also
a "find" feature that allows you to search through the
text of your e-mail to isolate that specific message
that you need to find.)
Whether you are searching
your own computer or the Internet, if a unique word is
involved, you can often search just for that word and
find what you need. Other times, you have to think about
combinations of more common words likely to appear in
the desired document. Then you compose a search to find
that combination of words.
Some people who do a lot
of searching install more sophisticated search programs
on their computers. Because words may show up in
different forms, there are advanced strategies that can
be used with these programs, such as "wildcards" and
synonyms, to make your searches more effective. For
example, if you are looking for the watering
instructions for a new exotic plant in your garden,
those instructions may use the word "water" or
"watering," but you can't remember which. By placing a
wildcard (usually an asterisk - * ) after the "r" in
water, in programs that will allow this type of
searching, you can pick up either word in a single
search. In other cases, you may have to search for all
possible synonyms of a word to make your search
exhaustive. For example, if you are searching for
information related to cars, various sources may use the
word "car" or the word "automobile" or the word
"vehicle." To get everything, you would have to search
for all these words. To do this in a single search, the
search program needs to have a synonyms operator.
Otherwise, you will need to use Boolean connectors,
which we will discuss below.
In our last lesson we
discussed databases. In doing so, we used the example of
creating database tables to keep up with our prospects.
Recall that we created a primary table containing name,
address, phone, etc. for each prospect. Then we created
a secondary table to keep up with the contact history
for our prospects. Let's now take that example a little
further.
Since relationship
building is so crucial to marketing, it is important to
remember the personal things that prospects tell us
about themselves in the course of our conversations with
them. The issues that these prospects may talk about
will be diverse, varying from prospect to prospect. Some
may tell you about their relationship issues, others may
speak of health issues, while others may tell you about
their hobbies. Many will talk about their children or
grandchildren. Thus, the personal "information" that
prospects may share with you will not be "similar" or
"structured." Therefore, it would appear difficult to
design database fields to keep track of this
information.
However, one "data type"
available in most databases that will be useful in this
situation is the "memo" field. A memo field is an
alphanumeric field that will accept an unlimited amount
of data in text format. Thus, in the "contact history"
table that we created in our last lesson, we may want to
add a memo field for "personal notes." In this field we
could type free form textual data about whatever we
discussed with the prospect. The next time we contact
that prospect; we can pull up our prior notes from this
field and review them.
However, what if we are
trying to remember with whom we discussed a certain
subject? We remember the subject but not the person. How
do we find the person when we can only remember the
subject? Simple database queries are not so helpful
here. This would require the same type of word searching
that we discussed above for finding files on your
computer. You will have to think of word combinations
likely to appear in your notes and then search the text
of these memo fields for that combination.
Even though you are
searching within a database in this example, you are
using the techniques of text searching.
When you want to find
information on the World Wide Web, you use the search
engines to search the text of Web pages.
As we have discussed
elsewhere in this course, search engines search more
than just the displayed text of a Web page. Some search
the contents of the meta tags (such as "title,"
"description," and "keywords"). Some engines search the
anchor text of incoming links (i.e. they search the
words used on the other Web page that links to the
page). Most also search the "alt" text that accompanies
the graphics on a Web page.
Understanding how these
searches work is crucial to both finding the stuff you
need with the search engines as well as designing your
site to be found by others on the search engines.
As we mentioned in Lesson
31, the Lexus/Nexus service, which began in the 1970s,
was the first popular commercial document retrieval
service based on text searching. It indexes most major
newspaper articles, magazine articles, statutory laws,
and court opinions. Rather than going to the library and
sweating over indexes, you can simply type a word search
into the Lexus/Nexus system and it would bring up all
the newspaper articles, magazine articles and/or court
cases using that particular combination of words. The
service is very expensive, however, and thus only
affordable to a few conducting important research. I
mention it here because it was a precursor for many of
the techniques that search engines now use to find
documents on the World Wide Web. As stated earlier,
Information Retrieval Science is not new. It has been
evolving for some time and will continue to evolve in
the future.
Wherever you may be
searching, the more refined your search, the fewer
documents or pages you will have to sort through to find
what you need. The more information you provide to the
search program, the fewer results you will get and the
more likely they will be on target.
When you go to the Yahoo!
search engine, there is a link next to the search box
titled "Advanced." At Google, there is a link titled
"Advanced Search." These links provide a user-friendly
way to employ a Boolean search. (You won't see the word
"Boolean" at Google or Yahoo, but that is the
traditional Information Retrieval Science term to
describe searches that allow you to search for
combinations of words using "connectors," such as AND,
OR, etc.)
Both Google's and
Yahoo!'s advanced search features give you the option to
use multiple words and then decide how the search will
treat them. At the time of this writing, those options
on both search engines are to show results with: all of
the words, the exact phrase, any of the words, and none
of the words.
In Boolean terminology,
the first option available on the advanced search (to
search for documents with all of the words) is
equivalent to placing the AND connector between the
words. Thus, if your search words are "bass guitar
players," this search will only bring up Websites
containing all three words. In Boolean terms, the
underlying search is read as..."Find me all documents
containing the word 'bass' AND the word 'guitar' AND the
word 'players.'" The search will not bring up documents
containing one or two of the words. This is quite handy.
Otherwise, your search would bring up sites that
contained any of the words. You would have to filter
through sites about bass fish, other types of guitars,
and all sorts of stuff that would come up with the word
"players." Using the Boolean AND connector helps to
narrow your search significantly. (Both Yahoo! and
Google allow you to type the AND connector directly into
the basic search box, rather than using the advanced
search page. Google calls this AND an "operator" rather
than a "connector.")
This search for pages
containing all three of the words will still bring up
some irrelevant documents, however. Some pages will have
all three words (bass, guitar, player) but they may be
far apart on the page and unrelated to each other. For
example, a page about a baseball player who enjoys bass
fishing and 12-string guitar playing might come up with
this search. Thus, this search can still besomewhat
inefficient. The second advanced search option will help
with this problem.
The second option Yahoo!
gives you is the "exact phrase" option. This narrows the
search even more than the AND connector. With this
option, all three of the words have to be right next to
each other and in the exact same order. In other words,
your entire search is treated as a single entity, and
documents are retrieved only if they contain that exact
phrase. This would eliminate pages that happen to have
all the words but apart from and unrelated to each other
(such as our bass fishing, 12-string guitar playing
baseball celebrity). In old-fashioned Boolean searches,
this type of search was created by typing quotation
marks around the words you wanted to be treated as an
exact phrase. In a Yahoo! advanced search, however, you
simply type the phrase into the indicated search box for
that type of search; quotation marks are not necessary.
(In both the Yahoo! and Google basic search forms, you
can create a phrase search by using quotation marks
without having to use the advanced search page.)
The third option in the
Yahoo! advanced search is the same as the basic search.
It searches for "any of these words." In Boolean terms,
this search uses the OR connector between the words.
Using our bass guitar players example, it would search
for sites that contain either the word "bass" OR the
word "guitar" OR the word "players." (This option is
included in the advanced search even though it is the
default in the basic search option because it can be
used here in combination with the other advanced search
criteria.)
The fourth box in the
Yahoo! advanced search is "none of these words." Here,
you can enter words you want to exclude in your search.
For example, if you wanted to search for bass (the
musical concept), you may want to exclude pages which
also contain the words "fish or fishing" to clarify your
search and eliminate sites discussing bass fishing.
Thus, to search for bass music but not bass fishing
pages with the advanced search form, you would put the
words "fish" and "fishing" in this fourth box.
In Boolean terms,
excluded words are usually designated with the BUT NOT
or the dash or minus sign (-), connector. Thus, to
search for bass music but not bass fishing in the basic
search form, you could search with "bass -fishing
-fish." Note that you must put a space before the dash
in the basic search box. It is not necessary to use the
dash in the advance search form. Any words typed into
the fourth text box of the advanced form will be used to
exclude pages that contain those words.
To clarify, on the
advanced search pages you do not have to type any of
these connectors we have mentioned. You use the
connectors (or operators) only if you are submitting an
advanced search using the basic search form. Thus, if
you just want to use one of these advanced features and
can remember the connector to use, you can use the basic
form. If you can't remember the connector to use or you
want to combine two or more of the advanced features for
a very refined search, you will need to use the advanced
search page. The advanced search form puts the
connectors in for you behind the scenes. You do not have
to type them in.
Google also has a synonym
operator, which is called the tilde (~). Putting a tilde
before a word will cause the engine to search for the
word and all of its synonyms. Thus, the search "online
~opportunity" would bring up pages with the keywords
"jobs" and "employment" in addition to the keyword
"opportunity," because Google considers these words to
be synonyms for the word "opportunity."
Interestingly, there are
limitations to the intelligence used in these searches.
For example, one would think that a search for "~online
opportunity" would include the word "Internet" as a
synonym for the word "online." I could see no evidence
that it did, however. Google does not seem to recognize
synonyms for the keyword "online." Thus, in this
particular situation, you would have to use the OR
connector and type in both words to create this search
in the basic search form (or use the advanced search
form and type both these words in the third box).
Despite minor
limitations, advanced searches are quite powerful.
Referring again to the Yahoo! advanced search page, you
can combine all of these search criteria into a single
search. You can search for pages that include all of the
words in the first box and include the exact phrase you
have typed into the second box, but exclude any sites
that contain any of the words you type in the fourth
box. With a little effort, you can create a very
exacting search to find just what you need while
excluding irrelevant sites that incidentally contain
many of the same words.
In both Yahoo! and Google
advanced searches, you can also eliminate outdated
sites, limiting your results to only pages that have
been updated within the time period that you specify.
You can also limit your results to just certain file
types. You can also use the "+" to force the search
engine to search for a word it would normally exclude
from the search. (Many common words are automatically
excluded because they appear on so many pages.) The
search engines also have many other features,
preferences, and shortcuts that you can use. You can
read about all of these features on the help pages of
the search engine you are using.
Given the number of
Websites on the World Wide Web today, there will likely
be more than one Web page that matches the search
criteria in any given search, regardless of how advanced
that search may be. Thus, the search engines have the
task of ordering the pages that do match one's search
criteria. That is, the search engines have to guess
which page you want to see first, second, third, and so
on among the many pages that match your search. This
becomes really important in less refined searches. Most
searchers do not take the time to create sophisticated
advanced searches. Most searches, therefore, result in a
very large number of pages that match the search
criteria (often in the millions)! The search engines are
left to guess what might be important to the searcher.
The method used by a particular search engine to make
this guess controls the order in which the search engine
displays the results of a search. Whether a page is
first in a list of 98,000 page results for a particular
search or last in that list becomes of extreme
importance to both the searcher and the Webmasters of
the pages that match the search criteria.
How the search engines
determine the order in which results will be displayed
for a given search is probably the most discussed issue
in Internet marketing. The exact methods used by the
search engines are closely guarded industrial secrets.
They are so closely guarded because the search engines
do not want anyone figuring out how to manipulate the
results unfairly.
Notwithstanding the
secrecy, there is a great deal that can be learned from
studying Information Retrieval Science and then
observing the behavior of the search engines in the
context of those scientific principles. An entire
industry of SEO's (Search Engine Optimizers) has arisen
to assist businesses in designing their Websites to be
ranked high in the results for a particular search.
While many of these experts do more harm than good,
there is a growing body of ethical and competent
companies in this new industry.
There has also been good
success by some do-it-yourselfers. Because most of the
legitimate SEO's charge large fees and work for the
larger companies, small businesses and home-based
entrepreneurs usually have little choice but to take the
do-it-yourself route with respect to Website
optimization.
In order to successfully
optimize your own Website, it helps to have some basic
familiarity with Information Retrieval Science. You also
need to carefully observe the search engines as they
change their methods and strategies from time to time
(to stay ahead of the manipulators). It also helps to
frequent the Websites, blogs, and discussion boards
where search engine methods are knowledgeably discussed.
It is beyond the scope of
this lesson and even this course to provide an in-depth
discussion of the science of Information Retrieval.
However, it will be helpful in laying the foundation for
future lessons on Website optimization, however, to
introduce a few of the basic terms and concepts. (First,
let me explain that while I have made a distinction
between database and text searching in the last two
lessons, the terminology of Information Retrieval does
not necessarily do so. In general, it often refers to
the collection of information to be searched as "the
database" regardless of how that collection is
structured.)
Here are some of the
common terms:
Term Frequency
(often represented mathematically as "tfi") is
the number of times that a word appears in a document.
Document Length
(often represented mathematically as
"Li") is the total number of words in a document.
Document Frequency of
a word or term (often
represented mathematically as "dfi") is the
number of documents containing the specific word or term
in question, within a collection of documents.
The total number of
documents in a collection
(whether or not they contain the word in question) is
often represented mathematically as "D."
Term Vector Theory
provides a couple of useful formulas, using the above
definitions. The first one determines Term Density
(a/k/a Keyword Density). The second one
determines Term Weight.
Let's start with the
Keyword Density Formula:
Keyword Density = KDi = tfi/Li
That is, Keyword
Density equals the Term Frequency divided by the total
number of words in the document. Despite the
scary looking math, this is really quite
straightforward. It is just a simple measure of the
concentration of a word in a document—the relative
frequency of that word to the total number of words in
the document. For example, if you have a document with
1,000 words total and uses the word "bass" 100 times,
the Keyword Density for that document for the word
"bass" is 100 divided by 1,000 or 1/10 or 0.1.
Many believe that search
engines measure the Keyword Density of your Web page in
ranking your page for a particular keyword.
Another formula brought
to us by Term Vector Theory is a little more
complicated.
Term Weight = wi = tfi * log(D/dfi)
That is, Term
Weight equals the number of times a word appears in a
document times the logarithm of a number calculated from
the total number of documents in the collection divided
by the number of documents in the collection containing
the word. (And you swore to your high school
algebra teacher that you would never have any practical
use for this stuff!)
Term Weight is used to
determine which of the words used in a search phrase
should be given the most weight in the search results.
For example, if you search for "a good fishing hole" on
Yahoo! or Google, the words "a" and "good" will not help
to order the results because they will appear in almost
all of the pages indexed in the search engines. These
words are too common to be useful in a search (unless
they are treated as part of an exact phrase containing
other less common words). Thus, for the search results
to be meaningful, these words have to be identified and
given very little, if any, weight in how the search
results are displayed. On the other hand, very rare
words used in a search phrase will be given high weight
because they are useful in identifying a small number of
documents containing the rare word which will then
appear near the top of the results.
Said another way, term
weight can be used to choose between words in a multiple
word (OR connected) search as to how those words will
affect the ordering of the results. If you are searching
for Web pages including any of the multiple words you
put in the search form (which is what the basic search
form does), the search engine will have to decide how to
display the results. In doing so, it will have to
determine which one of the words in your search is most
important, which one is next most important, and so on.
In doing that, search engines are believed to use
something similar to the "term weight "formula above.
It is really less
complicated than you may think. Basically, very common
words are given less weight in general. Rare words are
given more weight in general. If a document contains a
high frequency of a rare word, that document will be
given great weight with respect to that word. If a
document contains a low frequency of a common word, it
will have very little or no weight with respect to that
word.
The formula is most
helpful, however, when searches use words that are
neither particularly common nor particularly rare. The
formula may be used to help the search engines make
tedious choices in ranking the results from among the
millions of Web pages on the Internet that match the
search.
Another basic concept of
Information Retrieval Science is that of Term
Co-Occurrence (a/k/a Keyword Co-Occurrence)
Term Co-Occurrence
has to do with how often two words show up in the same
place together. The "place" referred to could be, among
other things: a sentence, a paragraph, a document, or a
Web page. For simplification, we will assume a Web page
in our discussion. Term Co-Occurrence is a factor in
"Semantic Connectivity"—how words relate to one another.
You can measure term
co-occurrence for two keywords fairly easily. Go to
Google and search for "bass." When I did that just now,
it came up with 18,800,000 pages. Now run a separate
search for "fishing." I came up with 25,600,000 pages.
These two searches each measure the number of pages
containing a single term, "bass" in the first and
"fishing" in the second. Now run a search for pages that
contain both "bass" and "fishing." This search, at the
time of this writing, resulted in 2,280,000 pages. You
can assign a value to the co-occurrence with the
following formula:
c = n12/(n1 + n2 - n12)
Where n1 is the number of
results containing the first word ("bass") and n2 is the
number of results containing the second word ("fishing")
and n12 is the number of results containing both words.
Plugging the numbers in,
we get c= 2,280,000/18,800,000 + 25,600,000 - 2,280,000.
If my math is correct, the answer should be
approximately .05 (which sounds about right because the
answer should be between 0 and 1).
Let's compare this with
the term co-occurrence for the keywords "bass" and
"guitar." Following the same procedure as above, I get
approximately 0.15 as the value for c in the above
formula.
Analyzing this, we see
that, while neither of these numbers is very close to 1
(the largest possible value for co-occurrence), "bass"
and "guitar" have a higher co-occurrence value than do
"bass" and "fishing" in the Google search engine. That
is, the word "bass" has a greater co-occurrence with the
word "guitar" than it does with the word "fishing"
within the Web pages indexed by Google.
This particular example
is not very useful for any purpose, but just an
explanation of how to work the formula. Does term
co-occurrence have any relevancy to page ranking? The
honest answer is that we don't know. But, we do know
that search engines hire Information Retrieval
scientists to help develop their ranking
algorithms...and that Term Co-Occurrence is a concept
that such scientists use in their research. Perhaps they
use this measurement in some fashion to evaluate the
relevancy of a page to a particular keyword.
Here is one way that the
search engines might use this. They could develop a set
of words that have a high co-occurrence with each of
many of the popular keywords. When ranking a page for a
keyword, they could look to see how many of these high
co-occurrence words are also prominently used on the
page. If they find a lot of words that have a high
co-occurrence with the keyword in issue, they may give
this page a higher ranking. The reasoning would be
something along these lines. When people are writing
naturally about a certain subject that is tied to the
keyword, they will use a high incidence of these other
semantically related words. If these other semantically
related words are not present in sufficient numbers, the
writing may be artificial, i.e. designed to manipulate
the search engines with respect to the keyword in
question.
If there are several
other words in prominent places on the page that have a
high co-occurrence with the keyword ranked for, perhaps
they give that page a higher ranking. If all the other
keywords on the page have a low co-occurrence value with
the keyword ranked for, perhaps they give it a lower
ranking.
Regardless of whether the
search engines actually use term co-occurrence in the
rankings, this concept does have some utility in
optimizing your pages for particular keywords. With some
investigation and some calculations, you can find words
that have a high co-occurrence with the keyword you are
targeting within a particular search engine. By using
these words liberally in your meta tags and content, you
can increase the likelihood of your page coming up for
someone interested in that subject who searches with one
of the other words. You will have determined other words
they are likely to use and place them in your page, all
due to your co-occurrence research!
Some familiarity with
basic concepts of Information Retrieval Science can help
you to perform better searches as well as help you
better optimize your own Website. You should explore and
experiment with the advanced search features of the
major search engines, such as Yahoo! and Google.
Identify the keywords you are targeting with your
Website and experiment with different searches using
these keywords. Examine the pages that rank high in the
results of your searches. See if you can calculate
keyword density, term weight, and keyword co-occurrence
for the pages that rank high in your searches.
Remember that keyword
density is a measure of the concentration of the use of
a word on a page. Term Weight is a measure of the use of
a word in one document to the use of the word in all
other documents in the collection. Term Weight can be
used to rank documents with relation to a particular
search. Term Co-occurrence is a measure of how often
words appear together in a particular document with
respect to how often those words usually appear
together. It is another way to test the relevancy of a
document to a particular keyword used in a search.
Understanding how
searches work behind the scenes is important to Internet
Marketing.
Stay tuned to upcoming
lessons in the Internet Income Course for detailed
discussions of timely and important topics in Internet
Marketing.
by George Little
Copyright (year) Panhandle On-Line, Inc.
License granted to Carson Services, Inc. for
distribution to SFI affiliates. No part of this work may
be republished, redistributed, or sold without written
permission of the author.
For more information on the Internet Income
Course and other works and courses by George Little, see
www.profitpropulsion.com.
For Web Hosting services specially designed for
SFI affiliates, see www.profitpropulsion.com.
>>> Back
To Course List <<<
Dear Internet friends,
We offer the most legitimate
Internet Home Business Opportunity to
help you start your own computer based
Internet home based business and earn
extra money online safely, from the
comfort of your own home!
Click here to register
and start you own Internet Home Business
and get your free
"Secrets of Internet
Millionaires" bundle...NOW!
|