by
Frederick Rustam
Part Two, THE DELIGHT OF BEING TOGETHER
Part One |
_____________________________________________________________________
Questor
Institute is a new, experimental technical school where
bright-but-poor
high-school graduates on full scholarships spend
two
years seeking to become wizards of Internet sorcery by studying
the
science and philosophy of information retrieval from textual
databases
such as the World Wide Web. In Part One, "A School for
Internet
Sorcery," two students, Kevin and Marylou, have their
aptitudes
tested, strike up a friendship, and attend the school's
first
Assembly, where they're welcomed by the Rector in a speech
setting
forth the unusual educational goals of Questor Institute.
_____________________________________________________________________
Co-occurrence, Correlation,
Context
Kevin
and Marylou had arrived early and were experimenting with their
high-speed
workstations when the teacher arrived. He was a tweedy man
in
his forties who wore a bowtie and parted his hair conspicuously
in
the middle.
"I'll
be giving you search examples from my personal experiences
in
Internet searching, mostly from the Web," the teacher began.
"Use
your computers to search my examples as I discuss them, if you
wish,
but don't forget to pay attention to what I'm saying. Okay?
Some
of the examples I discuss may seem trivial, especially by
comparison
with the complex subjects that professional searchers
have
to retrieve. But at this stage in your instruction, I prefer
to
use curiosity-satisfying examples which are easier to understand
and
which'll give us some enjoyment to pursue. You'll be searching
for
hard-to-find subjects you've never heard of, soon enough.
"There're
many search issues for us to deal with. Some will arise
as
we proceed, but I'll be unable to pursue them right then, and
I'll
say, 'We'll deal with that in detail, later.' If I stopped my
flow
of instruction and sidetracked us into every new search issue
which
reared its ugly head, I'd subvert your learning process.
"First,
let's define some basic terms. A textword is a word
used
in the text of a webpage or a Usenet posting. Textwords are
copied
to create index words. A searchword is a word
we
formulate in our minds to make a subject search, without knowing
for
certain that it exists as a textword. Practically speaking then,
a
searchword is a 'probable textword,' viewed from a searcher's
perspective.
Because the Web and Usenet are such immense databases,
our
searchwords will almost always be found as textwords somewhere
on
the Internet.
"A
term you'll often see on the Internet, in literature about the
Internet,
and spoken by most people is 'keyword.' This term is
overused---like
the much overused term, 'homepage.' As students
seeking
to be Questors, you'll mostly avoid 'keyword,' except to
understand
how others use it so that you can properly communicate
with
them. A keyword is, broadly, a word which is a key to finding
information.
It's our searchword of choice, it's the index word which
matches
our searchword, and it's the textword we seek in a document.
People
use 'keyword' to refer to all those things.”
The
teacher smirked in anticipation of the forthcoming reactions
his
students would have at his terminology.
"Three
other terms I'll use are offered as jargon for us Questors.
'Gold'
is a search-result item which is relevant to our search and
useful
to our purpose.... 'Chicken feed' is a search result which
is
technically relevant to our searchword but not useful for our
purposes.
A mere mention of something we're seeking---that's the
usual
textual form chicken feed takes.
"'Garbage'
is a collective term for search-result items which
aren't
at all relevant to our purposes, but which show up anyway.
"Let's
begin our study of the AND operator with a nice sentiment:
'The
Delight of Being Together.' This sentiment has served lovers
for
countless generations of human existence. But it also serves
those
of us who seek information from textual databases. When we put
together
several searchwords, we hope to retrieve relevant text where
our
words are found together in the same meaningful relationship they
were
in our minds when we chose them for searching. The delight of
being
together throughout the entire infotrieval process is not
easily
experienced, though.
"There
are three 'C's which we must understand: co-occurrence,
correlation,
context.
These are three fundamental realities
of
retrieval using the AND operator, by far the most-often used
logical
search operator. When we use several searchwords in most
search
engines, our words are ANDed to each other by default---
that
is, even if we don't actually type the word AND between them.
In
this way, complex, more-specific search subjects are expressed
by
using an increasing number of single words as building blocks,
just
as natural language phrases are constructed from words.
"To
illustrate these three realities, here's a retrieval situation
which
sprang from one of my casual curiosities:
I heard a know-it-all radio talkshow host
mention Charles
Martel, a medieval French leader, and he
added as an aside,
'That's Charlemagne.' I thought he was
wrong: Charles Martel
and Charlemagne (Charles the Great)
weren't the same man.
How
can I easily use the Web to prove my assumption?"
A
student said, "Search either guy's name, and do a page-search for
the
other name."
"Possible,
but not quick enough. I could spend a lot of time checking
webpages
about one man for a mention of the other. Let's construct a
logical
word relationship before we search." He turned and wrote on
the
easily-erasable whiteboard a search strategy:
<charles martel
charlemagne>
"Use
your computers now to make this exact search."
The
students pounded on their keyboards. This is kid stuff,
thought
Kevin. I know the point he’s making mused Mary Lou.
"When
we search this way, what do we retrieve?... The results may
surprise
you."
"Webpages
with both names on them," offered a girl, who was reading
the
search results as she spoke.
----------------------------------------------------------------------
Charles
Martel - Wikipedia
...
turned the tide of Islamic advance, and the unification of the
Frankish
kingdom under Charles Martel, his son Pepin the short,
and
his grandson Charlemagne ...
www.wikipedia.org/wiki/Charles_Martel - 12k - Cached -
----------------------------------------------------------------------
"Right.
We've used the search engine's default AND operator to
retrieve
webpages which have both names on them. We should have
searched
Charles Martel's entire name as a quoted phrase---set off
with
quotation marks---to retrieve his forename and surname only
when
they were 'juxtaposed,' next to each other in webpage text.
But
I wanted you to input his forename and surname separately to
illustrate
our second 'C,' correlation.... So what do we have in
the
search results for our three words?"
Mary
Lou was ahead of the pack. "Six of the ten items on the first
page
of search results state in their annotations either that Charles
Martel
was Charlemagne's grandfather, or that Charlemagne was Charles
Martel's
grandson. We don't even have to click on the links and read
the
webpages to find that out."
"Right.
These are very good search results---and there's a reason
for
it. When two words are searched with the AND operator, there
has
to be a co-occurrence of them on a webpage for the page to be
returned
in the search results. However, two co-occurring words
aren't
necessarily correlated, that is, semantically related to
each
other. Our three searchwords are highly correlated
in
the page's text.... Why?"
Silence.
The students weren't sure how to answer this question.
"Because
six webpage authors used all three of our searchwords
as
textwords in the same sentence!... Notice that in each of those
six
webpage annotations, each of our three searchwords is rendered
in
boldface by the search engine software to show us where they are.
Also,
some other words that occurred on either side of these boldfaced
words
in the webpage text have been excerpted from the page to show
the
'context' of our searchwords---the way they were used in the text
of
those webpages.
"If
each of our three searchwords had been uncorrelated with the
others,
each word's 'contextual excerpt' would have been isolated and
separated
from the other two excerpts by ellipses, those three dots
which
represent textwords not excerpted. You can see this in the other
result-items
where our three searchwords didn't occur so close to each
other
in the text. In six of the annotations, our three words are found
close
together in single sentences because, on those six webpages, the
page
authors wrote it that way.
"This
example shows us that simple facts can be teased from the Web
by
tickling it with its own words, so to speak. Knowing how to do
this
is a Questor skill, a skill you'll be glad you've learned.
Sites, Pages, Indexes
Let
me ask you a question: what do we retrieve with a Web search
engine?"
A
confident student piped up, "websites."
"No.
We retrieve webpages. Believe it or not, websites don't
exist
in the physical world. They're a mental construct, a way of
looking
at a single webpage or a collected group of webpages. The
webpage
does exist, physically, as a single file. It's the basic
retrieval-unit
of Web information. Webpages, not websites, are what
are
stored on servers. Even the so-called 'homepage' of a website---
the
main page which may have no discrete filename, just the site's
domain
name---is a single webpage file chosen to visually present
the
site when we first access it by its domain name.
"How
many webpages are there?... Nobody knows for certain. It's been
claimed
that there are currently about 36,000,000 registered websites
with
uncounted billions of pages. Some webpages don't have any text---
not
even captions, just graphics. Those pages are indexable only when
their
authors provide HTML 'Title,' 'Keyword,' or 'Description'
metatags, which are part of the page
but which are not normally
displayed
by Web browsers. We'll discuss metatags and image indexing
and
retrieval, later.
"An
index is a representation of the webpages it indexes. It's a
very
'deep' representation of webpages because it often contains all
the
words on the pages. Yes, I said all the words, even those termed
'nonsignificant,'
such as 'the,' 'of,' and 'in.'" He turned to the
whiteboard.
"If you doubt this, search a general engine for:
<"the war of the
pacific">
"If
that search engine doesn't index the 'nonsignificant' word, 'of'
(or
'the'), it can only search for:
<war pacific>
Then,
your search results will mostly be about 'the war in
the
Pacific'---World War II---and the few items about the 19th-century
war
between Chile, Peru, and Bolivia will be scattered among the many
result-items.
It's because the better general search engines now index
these
little words that we can search for exact phrases and sentences
and
retrieve them precisely.
"A
webpage index represents webpages much 'deeper' than a few subject
headings
represent the book they catalog. But a cataloger's subject
headings
are a form of concept indexing. They're the cataloger's
conception
of what a book or other textual work is about. A textword
index
is just a 'deconstructed' collection of the words on a webpage,
copied
from the page by a computer program called a 'crawler' or
'spider.'
"Indexes
compiled from textwords index webpages much more deeply,
but
in a much dumber way than concept indexing. Textwords supply
the
raw material of retrieval; we have to supply the intelligence.
Online,
we only get help from concept indexing when a webpage author
chooses
to get involved in the indexing process by putting meaningful
words
and phrases from his mind into his page's metatag fields.
"Okay...
>From our example of a highly-correlated co-occurrence
of
searchwords which retrieved highly-successful search results,
we'll
proceed down the garden path toward examples of searchword
co-occurrence
which plunge us into morasses of chicken feed and
garbage.
This is the greater reality of textword information
retrieval."
Confidence
Kevin
and Mary Lou headed for the cafeteria. "I knew all that stuff.
I
just didn't know it in the terms he used," declared Kevin. "So did
you...
right?"
"I
knew not what I knew," agreed Mary Lou. "Search principles do
seem
more obvious when someone presents them in an organized way
and
in elegant terminology such as 'chicken feed' and 'garbage.'"
"Yeah,
but infotrieval is easier than we're supposed to think it is.
I've
been doing it since I was a freshman."
"You
are a freshman, here. And wait 'til the teacher starts giving us
tough
retrieval problems to solve. We'll both feel like freshpersons."
"Hey,
you aren't a wild-eyed feminist, are you?"
"Only
when I have to be."
Phrases and Sentences
"Okay.
We've learned how the AND operator can deliver the ball
right
down the alley to the kingpin. In our Charlemagne example,
we
searched for naturally-correlated textwords, and they appeared
smack-dab
on the first page of our search results.
"The
AND operator can be be used for the retrieval of natural-
language
phrases and even whole sentences, but successfully only
when
the words in those search-phrases and search-sentences are
identical
with those on the webpages we seek. Despite what you may
have
heard about 'fuzzy logic,' we're usually dependent upon our own
searchword
choices and upon elementary search logic which predates
the
computer by many years.
"There's
only one automatic adjustment for 'synonymy' in textword
searching.
Synonymy is the problem given us when a concept can be
expressed
equally by two or more ways of writing it. A convenient
adjustment
for synonymy occurs when the search engine searches our
words
as both 'whole words' and as 'fragments' of other words.
This
handy fragment-inclusion search feature allows us to retrieve
most
nouns in their plural forms by simply searching for their
singular
forms. But 'fragment inclusion' can be counter-productive
if
it picks up nonrelevant words when our smaller searchwords,
searched
also as fragments by the search engine, happen to occur
within
those nonrelevant words.
The
teacher paused to allow that to sink into the open minds before
him.
He guessed they were familiar with such retrieval concepts as
fragment
inclusion, but he feared his "elegant" language might be
overwhelming
them. The public high-schools from which they had
recently
graduated did not excel in vocabulary building.
"There
are more kinds of synonymy and 'near-synonymy' than are
found
in singular/plural variations or in word-stem variations like
'history'
and 'historical.' A search for a scientific term such as
'columbium,'
retrieves only those few webpages which use that older
name
for 'niobium.' If we search for 'Saint' abbreviated as 'St.'
we
won't retrieve pages which spell out Saint. We must be precise
with
our searchword choices, and we should try word 'variants'
if
at first we don't retrieve anything at all.
"I
dislike to keep saying this, but later... we'll discuss the handy
process
of 'stemming,' or 'wildcarding,' which some search engines
offer
us to include some near-synonyms---words like 'history' and
'historical,'
which mean nearly the same thing. And we'll study the
OR
operator, which allows us to methodically include synonyms and
near-synonyms
in our searches.
"There's
a troublesome truth about ANDing: generally, the fewer
words
we use, the more chicken feed and garbage we retrieve. And
the
more words we use, the less likely we are to retrieve anything."
"That
sounds like we have everything working against us. But it's
a
principle which combines two realities. Somewhere between using
too
few and using too many searchwords is where we achieve our
best
retrieval. There are two exceptions to this principle.
"One:
if we're searching something unique, a rare word or name,
we
can use that word with great retrieval effectiveness, and we
won't
be overcome with garbage.... Two: if we're searching a small
Web
database, such as the past two weeks of recent news which is
archived
on many news websites, then a single, fairly-uncommon word
or
name usually won't return much garbage because the database is
so
small that false correlations don't occur as often as they do
in
large databases."
Orange Juice First Thing in the
Morning
"Here's
an example of searching a subject by using a descriptive
natural
language phrase, not just a couple of meaningful words:
Drinking a glass of orange juice first
thing in the morning
may not be a good idea for older people.
Why? The Web can
tell us, but not so quickly as in our
Charlemagne example.
To find out, we search for... what?"
"'Orange
juice first thing in the morning,'" chanted a student.
"Right.
Don't be afraid to search a lengthy natural-language phrase.
Remember
that we're searching Web text, indirectly. Phrases and
sentences
are the stuff of natural-language text. But it's better
for
us to quote our phrase or sentence than to rely on a simple
ANDing
of its words by the search engine. By quoting, we request
that
our words be 'juxtaposed' or 'contiguous' where they occur on
webpages."
The teacher wrote a search example on the whiteboard:
<"orange juice first thing
in the morning">
"Search
that now and you'll retrieve about twenty webpages which might
discuss
a little-known physiological phenomenon. Some weeks ago, I did
just
that, and I retrieved a single item which had the answer to my
question.
But yesterday, I repeated my clever search to prepare for
today's
lesson, and the item that I retrieved before is now missing,
leaving
me only chicken feed!" He groaned.
"That's
one of the frustrations of Web searching. One day, a relevant
item
is retrieved in a good, precise search---and the next month,
it's
missing from the same good search. But I've decided to use the
OJ
example anyway. I call this technique of searching with a long,
natural-language
phrase, 'searching long.' Using that strategy, we'll
retrieve
only twenty-or-so chicken feed items about drinking orange
juice
first thing in the morning---all technically relevant to our
searchwords,
but not relevant to our search for unhealthfulness.
"We've
searched the same way some Web authors wrote their text---I
said
'some,' not 'all'---but none of the items we retrieved actually
gave
us the answer I previously found. Sometimes, there are multiple
answers
to a query.... So what are we left with if searching long
doesn't
produce?"
"Searching
short?" a student said, hesitantly.
"Searching
shorter. That's our only alternative in an AND search:
to
cut back on our searchwords. We'll purge our long, precise phrase
of
all its words except the three most-meaningful ones." He wrote:
<"orange
juice" morning>
"I've
quoted orange juice as a phrase to reduce false co-occurrences
of
its two words. Even so, we still retrieve 92,900 items! Amazing!
Can
you imagine any other database which would retrieve so much darn
stuff
for those three words?!... Try this search now and see if
there's
any gold in the first few pages of results."
The
students searched, then examined the results while the teacher
walked
about, observing them. The silence of cogitative labor was
broken
by the clacking sounds of keyboards being furiously used.
"I
found one!" exclaimed an eager searcher. "But it wasn't on the
first
page of results. It says that the fruit sugar in orange juice
acts
to 'elevate blood lipid levels'---whatever that means."
"Some
might claim that a college graduate would know what it means.
But
I doubt that many of them would. Lipids are fat compounds, and
elevated
blood lipids can be a factor in heart attacks for those
vulnerable
to them.... It helps to know that before you search.
"There's
a principle at work here, and it's the first principle of
subject
searching on the Web. You know it from personal experience:
the
Web is such a vast database that almost any few common words
we
AND together will return a flood of information, most of it
not
relevant to our search intent. How can this be?" He gestured
like
a sawdust evangelist.
There
was only silence, as the class awaited his explanation for
this
seemingly-unknowable principle.
"It's
not carved into granite anywhere, but it's the big reality of
textword
retrieval that the larger the database searched, the more
likely
co-occurrences of searchwords will return webpages with false
correlations
retrieved by those co-occurrences. This reality occurs
in
offline databases, as well. Textual databases are very different
from
structured databases with their discrete data fields, defined
data
types, and system query language. In natural-language text,
subject
'data' are jumbled together. We have to retrieve from that
jumble
by what amounts to guesswork searching.
"The
key phrase here is 'natural language.' On the Internet, we
search
for subjects which are expressed in the language of English
prose,
even where the text is formatted with its words or numbers
in
tables instead of in paragraphs. Tabular data in textual
databases
aren't really divided-up and put into discrete 'fields'
by
data type, although they may be displayed that way. A webpage
is
a single field for everything on it, unless it's divided into
two
or more separate display 'frames,' each with its own URL.
"There
are, of course, a page's hidden-but-independently-searchable,
metatag
fields. In these, the webpage author can catalog his page
with
'metadata'---that is, bibliographic data such as title, date
of
preparation or modification, and some descriptive words. Later,
we'll
learn how to directly search these metatags, for what they're
worth.
Even when these fields are filled, however, they may be less
useful
than the visible text. Webpage authors are good HTMLers but
poor
catalogers. The same is true of scientists.
"Okay.
Access my homework webpage to see your homework assignment:
A famous scientist once said, 'The
universe is not only queerer
than we know, it's queerer than we can
know.' Use either
the Web or the Usenet to discover who
originally said that.
"Although
this sentence is occasionally quoted by astronomers and
others
today, they usually change the word 'queerer' to something
more
appropriate to today's linguistic reality. I've also seen the
quote
attributed to the wrong scientist. If you find such a misquote,
let
us know about it.
"I'm
giving you these hints to demonstrate a troublesome problem with
Internet
searching: sometimes we may unknowingly use the wrong words
to
express a desired subject. Wrong words---poor retrieval, or no
retrieval.
Yet the Internet is so enormous that variations of a
memorable
phrase or sentence may get put up there, and some of these
may
provide us with clues for finding the genuine stuff.... Good luck.
"Thanks
a lot." hissed Kevin, sarcastically.
"Don't
you think we need luck to retrieve from the Internet?"
asked
Marylou, tongue-in-cheek.
"Luck
is for gamblers. The Internet isn't Las Vegas. It's a queerer
place
than we can imagine."
"True.
But chance on the Web does seem to favor the house."
THE END OF PART TWO
Next: Part Three, "The
Nearness of You"
_______________________________________________________________________
©
2002 by Frederick Rustam. Frederick
Rustam is a retired civil
servant.
He formerly indexed technical reports for the Department of
Defense.
He writes science fiction for Web ezines as a hobby. He
studies
and enjoys the Internet as a hobby.