by
Frederick Rustam
Part Three, THE NEARNESS OF YOU
Part One | Part Two |
_____________________________________________________________________
Questor
Institute is a new, experimental technical school where
bright-but-poor
high-school graduates on full scholarships spend
two
years seeking to become wizards of Internet sorcery by studying
the
science and philosophy of information retrieval from textual
databases
such as the World Wide Web. In Part Two, "The Delight of
Being
Together," Kevin and Marylou learned about the AND logical
operator
and about co-occurrence, correlation, and context. They
also
studied the usefulness of natural-language phrases for the
retrieval
of more than just interesting quotations.
_____________________________________________________________________
Subject Qualification
"We've
devoted several days to the AND operator because it's the most
important
of the Boolean search operators. There are two reasons for
its
importance. It's the one we most-often use for retrieval because
it
allows us to put two or more words together to search subjects more
complex
and more specific than we can express with one word. And it's
often
the only operator we'll find at many websites' internal search
engines.
Be thankful that some larger, more-versatile search engines
offer
us three other logical operators.
"After
we've constructed a complex search-subject by using some
natural-language
words, we might have to use one or more additional
words
to 'qualify' our complex subject by what amounts to an 'aspect'
of
it, that is, a special way of looking at it---by place, time,
or
bibliographic format, for example.
"When
we seek a complex subject such as 'the history of stream
pollution
in West Virginia by coal mining,' for example, our basic
subjects
can be viewed as 'stream pollution' and 'coal mining.'
'West
Virginia' and 'history' are the place and time aspects of
our
complex subject. When we seek webpages which treat those two
aspects,
we add the qualifying searchwords 'West Virginia' and
'history'
to our search input:
<stream pollution "coal
mining" "west virginia" history>
"I've
used 'stream pollution' without quotes because it may also
be
written as 'pollution of streams.' My two unquoted words will
pick
up both forms. The word 'streams' is actually an inadequate
searchword,
though; I should use the names of all the kinds of
streams
which might be polluted, but for now we'll simplify this
part
of our subject's complexity.... So is this the way to search
for
it?... Maybe.
"'West
Virginia' is a bang-on qualifier, but 'history' is tricky.
It
involves dates we don't know, and which may not be found on
webpages
about the subject. The qualifier, 'history,' doesn't
always
appear as a textword on relevant webpages, either, because
historical
treatments often don't label themselves as such. An
author
may write from a historical perspective, but may not title
his
work, 'A History of...'---or even use the words 'history' or
'historical'
anywhere in the body of his work. The nature of a
subject's
treatment in a document is the kind of slippery concept
which
human concept-indexers perceive in their subject analysis,
but
which Web indexing crawlers can't.
"Qualification,
when we can use it, is a positive way of concentrating
on
the aspects of a subject which we want, and conversely, of excluding
---'disqualifying'---other
aspects of it which we don't want. Subject
qualification
is a more difficult retrieval process than the complex
interaction
of concrete subjects, but it's an important means of
zeroing-in
on a subject which may be written about in many aspects.
You
must learn how to qualify subject searchwords with aspect words
---and
to abandon your aspect words if they don't retrieve well.
You'll
find today's homework to be a very challenging problem in
subject
qualification by place."
Hot Vinegar
"Single
words or short phrases must be rare or unique for us to
achieve
quick and obvious retrieval success with them.... When
English
mystery writer, Ellis Peters, wrote her Brother Cadfael
series
of mystery novels, she chose 'Cadfael' as the name of her
medieval
monk-detective because it's a very rare name, even in
Wales.
So if we want to retrieve webpages about Brother Cadfael,
we
can just search for:
<cadfael>
"And
we get only Brother Cadfael. Rarity is built into that subject,
so
to speak, and we don't even have to use the word 'Brother' with
the
name Cadfael to retrieve him. In fact, it's not a good idea to
search
for 'Brother Cadfael' because he's been written about on the
Web
often as simply 'Cadfael.' This retrieval situation illustrates
the
negative face of subject qualification.
"When
we search with several searchwords, and one of our words is not
a
textword in a relevant document, we've 'disqualified' that document
from
retrieval. In this sense, 'disqualification' is the opposite of
qualification.
We use more searchwords to qualify---to specify---a
subject,
but too many searchwords can 'overspecify' a search and
exclude
relevant documents.
"In
one episode of the Brother Cadfael mysteries on TV, Cadfael had
to
use mos teutonicus on a corpse to view its bones. Anybody
here
have enough Latin to translate that phrase?... It means 'the
German
practice'---of boiling a corpse in vinegar to remove its
flesh.
I was curious to see if the Web had anything on this rare
subject,
so I searched for:
<"mos teutonicus">
"Despite
its presumed rarity, I quoted this phrase in my search
because...
why?"
Marylou
had the answer. "Because 'mos' might be found in words as a
fragment
of them, and they might falsely co-occur with 'teutonicus.'"
"Right.
Be careful with searchwords like 'mos' which often appear
in
text as fragments of other words.... I retrieved two items. They
were
two postings from a Web discussion forum about mos teutonicus.
One
posting inquired about it. A follow-up posting offered a short
bibliography
of books on medieval burial practices in which that
subject
was treated. In this manner, the Web can lead us to other
information
sources, even when it can't provide anything substantive
about
a subject. But as you know, the Web isn't all the Internet.
"Much
Internet information can also be found in in the 'news groups'
of
the Usenet. Years of Usenet discussions have been archived and
textword-indexed
by some of the search engines. But you'll probably
use
this resource only if Web information proves insufficient for
your
purposes. We'll deal specifically with the Usenet archives
as
an information source, later, and we'll also search there for
mos
teutonicus.
"Somebody
here will do that tonight, I'll bet," whispered Kevin,
who
knew how classroom suck-ups operated. "So I gotta do it, too."
"Your
reasoning escapes me," replied Marylou, haughtily. But she
made
a mental note to make the search, also.
The Biggest Ocean Wave Ever
Measured
"We
learned in our previous 'orange juice' example that searching
for
a subject with a long, quoted phrase can sometimes retrieve gold,
and
at the same time greatly diminish chicken feed and garbage....
Another
way to achieve pinpoint retrieval is to search for an ANDed
combination
of known words which don't form a natural-language phrase
or
sentence but whose words collectively profile our desired subject
and
act to filter-out nonrelevant items. Note---I said 'known words,'
not
'guessed words.' We have to know enough specific facts about our
subject
so that we can express those facts with their own words,
so
to speak. An example of this:
I read that the largest ocean wave ever
seen was measured in
1933 in the Pacific Ocean. It was 112 feet
high. I wondered,
'Wow! How did they measure that one and
yet survive it.' So,
with faith in the Web's great ocean of
information, I searched
for a combination of known words:
<wave pacific 1933
112>
"Numbers
are considered as words in text. They can be written-out
also---for
example, 'nineteen thirty three'---but I guessed that
these
particular numbers wouldn't be.... Make this search now,
and
see what you retrieve."
As
usual, Marylou was racing ahead. "Only a few result-items. The
first
one is right on the money: a miniature webpage which succinctly
sets
forth the whole story about this huge wave. It's a little gem."
"It
is, indeed. It tells you how the Navy oiler, USS Ramapo,
measured
the wave and why they survived it. The World Wide Web is
truly
fruitful when we know some of the words that're almost certain
to
be on a webpage somewhere. Note that the order, the 'syntax,' of
our
known words can affect our search results in some engines. So
put
them in their 'natural order' for searching, that is, the order
in
which they would most-likely occur in text. Some search engines
use
that order as one of their 'secret' techniques for ranking
search
results."
A Streetcar Named Jette
"Now
that we've had great success by using unique single words and
names,
phrases, and topic-profiling combinations of descriptive
words,
let's turn to a subject which seems quite-specifically named
but
isn't all that easy to find. Many search-combinations contain
the
fragmentary seeds of their own retrieval difficulty. Here's one:
I was watching CNN. They ran a brief promo
for their service,
a montage of international images that
showed how widespread
their newsgathering service is. In the
last few seconds of this
filmed promo, there was a street scene
which showed a streetcar
approaching the camera. Just before the
promo ended, the trolley
came close enough to the camera for me to
read the destination
sign on the front, above the windshield.
It read, '94 | Jette.'
Right away, I wondered if I could search
out this streetcar line
on the Web and discover the city where
that portion of the promo
film was shot.
"This
retrieval may seem like grasping at straws, but that's how
we
sometimes find what we seek. Even three years ago, when I made
my
search, the World Wide Web had become a database of astronomical
proportions.
Did any computer scientist ever envision a distributed
database
of such vast dimensions that a search for almost anything
conceivable
retrieves something relevant?
"Back
to the streetcar: I searched for the two known parts of its
destination
sign:
<94 AND jette>
"This
returned an astounding 8878 items!... It was at the sixth
item
on the sixth page of search results that I found a webpage at
the
website of 'Planitram,' the outfit that runs the streetcars in
Brussels,
Belgium. The 94 line to the district of Jette was listed,
and
some info was given about the line's 'headway'---how long we'll
have
to wait for the next streetcar after we've just missed one.
Success
on the sixth results page may seem like rough retrieval,
but
it's better than no success or success on the twelfth page.
"Jette
is a personal name, by the way. And anytime somebody named
Jette
and the number 94 appear somewhere on a webpage---with 94 as
a
fragment of the date '1994,' for example---we retrieve that page
in
an ANDed search, even though it's not relevant to the Brussels
streetcar.
There were plenty of these items in my results.
"But
there's a better way of matching two pieces of data than by
simply
doing an ANDed search of them. What is it?"
"Use
the NEAR operator," ventured Marylou.
"Right
you are. Here's another search principle: the closer the
'proximity'---nearness---of
two textwords, the more likely they are
to
be correlated. If we can specify that our searchwords be found
close
to each other in text, we increase the chances they'll be
correlated
in the retrieved webpages. The NEAR operator which does
this
was used in proprietary online information systems years before
the
World Wide Web was born.
"Unfortunately,
many Web search engines don't offer us 'proximity
searching,'
as the use of the NEAR operator is called, so we may have
to
abandon our favorite engine to use another one which has it. If we
put
the NEAR operator between two searchwords, we'll retrieve only
those
documents in which the two words are fairly close to each other
---no
more than ten words apart in one Web search engine I used to
search
for '94 | Jette'.
"By
the way, some commercial online databases allow us to specify the
number
of words in the text separating any two NEARed searchwords, and
some
textual database management programs simply allow us to retrieve
any
two words occurring in the same sentence or in the same paragraph
of
text.
"So
my revised streetcar search was:
<94 NEAR
jette>
"This
strategy returned only 474 items, a gross retrieval reduction
of
ninety-five percent!... And the same Planitram webpage I found
in
my previous AND search was now the fourth item on the first page
of
results!... I looked further through these results, and I found
another
page with a Planitram table which listed all their bus and
streetcar
lines, including old No. 94.
"Now's
the time to remind you that the Planitram table of transit
data
I retrieved was indexed by a general search engine because it
was
put up on a webpage in HTML format. If their data had been in
a
separate, non-HTML database---even one searchable via a webpage
gateway---their
data couldn't have been indexed by a Web indexing
crawler....
A familiar example of a non-Web-indexed, non-HTMLed
database
is the public library's catalog, which we can search
from
a page on the library's website. But we'll never find any
of
that catalog's entries directly by using a general Web engine.
"Okay.
Maybe you're thinking about searching my streetcar line as
a
quoted phrase. Well, I did that for comparison. My search results
differed
markedly from those where I used the AND or NEAR operator!
The
Web is full of nasty surprises. I searched one engine for:
<"94
jette">
"It
returned sixteen items. Only one chicken-feed item was relevant,
and
that page was a humongous list of European 'tramcar' types and
the
lines they ran on.
"The
same phrase search on another engine retrieved twenty-four items.
Five
of those on the first page were Planitram webpages, but none of
these
had the 94 line listed, even though these items 'dropped' on a
search
for the 94!" The teacher sighed, "Woe is me."
"After
a webpage is indexed, its text may be changed, but textwords
no
longer there may still be in the search engine's index file, and
they'll
remain there until the page is crawled again. However, you
may
be able to view the originally-indexed text if the search engine
caches
a 'snapshot' of the original page and offers you that page
as
an alternative to viewing the current page. This is a helpful
feature.
In one engine, a link to the originally-indexed page appears
at
the bottom of each item-annotation as the underlined link-word,
'Cached.'
Click on that link and look for your searchwords which are
missing
from the currently-retrieved version of the page.
"Even
a better logical operator may not much improve your retrieval
from
the Web's vast universe of text if you have a tough subject to
search.
Although my search for <94 NEAR jette> proved more fruitful
than
<94 AND jette>, many items not relevant to my purpose were still
returned
by the search engine. The principle I previously mentioned
about
the relationship between textual database size and the number
of
nonrelevant retrievals never fails us. And there's no database
larger
than the World Wide Web.
The
teacher decided to inflate his students a bit.
"Today's
lesson revisits the awful reality of textword retrieval
from
the Web. It's often difficult and frustrating because there're
so
many possibilities for false co-occurrence among the textwords
of
webpages. But the way I see it is this: some people just gotta be
skilled
at Web textword retrieval, and some of those skilled people
will
be Questor graduates. You. That's why you'll be in this class
for
your whole two years, studying the problems of infotrieval and
learning
the solutions to them, where there are any solutions.
"Before
we wrap it up for today, a brief word about the relevance-
sorting
of search results. With all the search engines, exactly how
this
is done is mostly proprietary info. But some of the techniques
are
mentioned on their Help pages. They usually count the number of
times
our searchwords occur as textwords in the retrieved webpages.
And
those pages which have our searchwords in their HTML metatag
'Title,'
'Keyword,' or 'Description' fields are ranked higher than
those
where our words are only found in the body of the text.
"Some
engines also rank the sorted items by the number of hyperlinks
to
them from other webpages, on the working theory that those pages
which
are most linked-to are the most relevant ones. But this ranking
is
valid for an individual search only after the retrieved webpages
are
first sorted by an examination of the number and position in them
of
our searchwords. A webpage is usually linked-to for its main
subject,
and this first has to be determined---guessed, really---by
the
search engine before a page's link statistics can be used to give
it
a boost upward in the results. This means that when we retrieve a
page
for a minor subject on it---a subject which other webpages haven't
linked
to the page for---the page's link statistics are of less value
in
ranking it for us.
"These
complex relevance-judgment techniques are, you understand,
a
dumb-computer substitute for human evaluation. But they are
useful;
they generally cause the more-relevant pages to bubble up
in
the search results---I said 'generally.' But relevance is a
tricky,
subjective attribute of text. Counting words and 'links-to'
doesn't
always put what we want on the first page of search results.
"A
search engine is like any other computer program: it does what we
tell
it to, not what we want it to. If it were really 'intelligent,'
it
would go beyond our bare statement of searchwords. It would add
some
relevant terms that we didn't anticipate---and join all these
words
with the correct logical operators. Someday, maybe they'll do
that....
But second guessing about searchwords can be destructive
as
well as constructive, especially if it's done by 'artificial
intelligence'
programming. You're here to learn how to use your minds
to
retrieve, and how to use proven computer programs as tools for
doing
it. Remember that the old truism, 'garbage in, garbage out,'
applies
to our search strategies as well as to the documents we seek.
Don't
be garbagey with your searchwords.
"Okay.
Your homework assignment is to retrieve a photograph which
will
allow us to judge for ourselves, but not to verify, an assertion
that's
difficult to prove unless we travel to a far place and do some
historical
research there. In this case, the Web will give us no
informational
verification, but it will help us to draw our own
conclusions.
Here's your retrieval situation:
Some say that the Paramount Pictures name
and logo which was
created by the company's founder---a man
from Ogden, Utah---
was inspired by Mt. Ben Lomond, a peak in
the Wasatch Range
near his hometown.
"I
want you to find a photograph of Mt. Ben Lomond and see how
inspirational
you find it to be for the Paramount logo. You can
search
any of the Web's text-and-image or image-only databases.
Those
of you with the most-inspiring photos will achieve the
most
success.
"This
homework assignment may seem like a simple retrieval problem,
but
it isn't. There are three troublesome complications.
"First,
there are four Mt. Ben Lomonds in the world; I don't want
pictures
of the other three. So you'll have to qualify your basic
search-name
by ANDing or NEARing it with a qualifier which 'localizes'
the
putative Paramount Mt. Ben Lomond and more-or-less excludes the
others.
We'll learn about the exclusion of nonrelevant textwords with
the
NOT operator, later. For now, you'll use the positive process of
qualification
in your attempt to emphasize Utah's Mt. Ben Lomond
in
your search results.
"The
second complication is that you'll find pages which mention
Mt.
Ben Lomond, but few of them will illustrate it for you---
chicken
feed.... The third complication is that Web image databases
retrieve
a lot of really weird garbage because these images must be
indexed
indirectly by crawling their filenames, their webpage captions,
or
nearby text. Web images are not concept-indexed by human indexers
whose
minds grasp the identity and meaning of images.
"Okay.
Gentlemen and gentlewomen, start your search engines!"
In
the hallway, the class joker began singing,
You take the high road,
And I'll take the low road,
And I'll get Ben Lomond before ye.
Kevin
scowled. "I hope that guy craps out on the homework."
"That's
not a very nice sentiment," chided Marylou. "He may find
the
best picture of Mt. Ben, and you may be the one to 'crap out.'"
"No
way. I'll stay up all night if I have to."
"That's
the spirit."
THE END OF PART THREE
Next: Part Four, "Ordinary
Citizens as Scholars"
_______________________________________________________________________
©
2002 by Frederick Rustam. Frederick
Rustam is a retired civil
servant.
He formerly indexed technical reports for the Department of
Defense.
He writes science fiction for Web ezines as a hobby. He
studies
and enjoys the Internet as a hobby.