|

Computing & Information Services Dept.
November 9, 1999, 1:30-4 p.m., Ham-Smith 3.
Instructor: Jim Cerny
1. Overview.
Published description:
This hands-on class demonstrates how to use Internet
search engines and other search tools effectively.
Includes developing a strategy, choosing a search
engine, evaluating results, and searching the campus
Intranet.
2. Search engines, directories, and portals.
3. Syntax.
- Natural language queries.
Just type a sentence or phrases. "Why is
the sky blue?"
- Pattern matching to one or more keywords in the queries.
Treatment varies with search engines. "bus
taxi limo airport reservation"
Use of relevance scores.
-
Search engine math -- compound (Boolean) queries.
AND OR NOT also minus and plus signs.
"+fall +spring +summer +winter"
- Special search restrictions.
Date, proximity, language, error tolerance.
4. How search engines work.
- Discovery and collection.
Spiders or crawlers search the Web. Frequency
varies.
Trace links without looping, Respect robot exclusion.
Parse material found.
- Selection (indexing and categorizing) features.
Building the underlying index and features it supports
for forming queries.
META keywords, titles, text. Use of stop lists.
Identification of images.
By URL.
Capitalization. Proximity. Stemming and word
boundaries.
Handling numbers. Date restrictions. Compound
(Boolean) expressions.
Fuzzy matching.
Use external linkage patterns to supplement internal text.
- Delivery (retrieveal front-end and presentation).
The interface you see. Detail. Relevance score
or rank. Title. URL. Description.
Size. Date. Related topics. Related links.
Cluster same site matches.
5. Strategies.
- How much do you know about the topic?
- Looking for specific facts -- the big encylopedia.
- Looking for breadth -- to find a good starting point.
- Looking for depth -- to find all the detail you can.
- Learn to use several search engines well.
- Become familiar with basic syntax (Boolean operators)
common to most search engines.
- RTFM -- Read the Friendly Manual. Look at the help pages
for the usage details and features special to that engine.
6. Some thoughts and observations.
- Search engines are important to the success of the Web.
Free, fast, easy to use.
Very large quantities of information.
Power of full text indexing.
- Some things are not normally indexed and searchable.
Information in specialized file formats, e.g., PDF.
Information served dynamically, e.g., from databases.
Information restricted by domain, e.g. robot exclusion.
- Some things cost money for access.
Wall Street Journal. Lexis-Nexis database. EBSCO magazines.
- You are your own researcher and editor.
How evaluate information found for accuracy?
How cite information found?
- A Web search engine is not a data warehouse or encyclopedia.
How do we know what a search engine should be able to
find?
Search engines do not purport to index everything.
- Your role as information consumer.
Focus is on finding information fast.
Expectations for content - Fast facts? Unpublished info?
Out of print info?
- Your role as information provider.
Focus is on becoming known, becoming indexed.
How to register yourself with search engines.
How to keep pages from indexing.
robot exclusion standard with the file: robots.txt
<meta name="robots"
content="noindex,nofollow">
Every networked desktop system will someday be a server.
- Privacy.
To think of privacy by obscurity is a mistake.
Unpublicized information on Web sites is easily
indexed and found.
Personal posting patterns on Newsgroups is easily
summarized.
Logs at servers. Logs by your browser.
- Designing Web Sites for navigation.
Many visitors will jump to specific pages, bypassing the formal
home page.
This implies care in identifying yourself in page
design.
This implies a need to publicize yourself to search
engines and become indexed.
- Active agents.
Remote all-purpose search engines will be supplemented by active
agents.
Webcasting with channels.
Agents or bots (e.g., NetMind).
- You can't keep up!
The rate of change in the searching technology is overwhelming.
Try to know a few good tools in detail.
The quantity of information is potentially overwhelming.
To see agents/bot technology as a solution is an
illusion. It may make it worse.
- Trade-offs.
As the quantity of information indexed grows, developing a
narrow
search is more difficult.
Analogy to the Heisenberg Uncertainty Principle!
Analogy to Type I and Type II errors in statistical
inference.
- Search sites are businesses with various business models.
How many can survive with the current business models?
Many started in Universities, obtained venture
capital, issued stock.
Growing trend toward acquisition by media and portal
companies.
7. Other types of searches.
8. Examples.
A good way to familiarize yourself with search engines is
to select a topic (not too obscure, not too common)
and use three or more
search engines and directories to see how they work
(features) and how comprehensive they are.
Search first for breadth (quantity) and then refine the search
for relevance (quality). See the following table of examples
using the advanced search options of AltaVista.
AltaVista represents the high end of search
technology in terms of quantity of pages indexed (maybe 150 million)
and options for approaching a search. AltaVista allows complex
Boolean searches, supplies suggestions for refining (narrowing)
a search, with options for languages, application of a family
filter, natural language (Ask Jeeves technology), and more.
Some features are not obvious, such as
AltaVista automatic phrase detection.
AltaVista
represents a large investment in both computing power and network
connectivity to ensure very rapid responses. Comparable
search engines in size and features are Northern Light, FAST,
and HotBot. When faced with overwhelming quantity of matches,
relevance becomes very important and Google brings a new
approach to that.
| Sample Search Topics:
as of September 30, 1999
| AltaVista
(Advanced)
|
| |
|
| 1. "Virginia Woolf"
"Virginia Woolf" near FAQ
"Virginia Woolf" near "home
page"
"Virginia Woolf" near quotations
| 16,679.
9.
112.
19.
|
| 2. "Hale-Bopp comet"
"Hale-Bopp comet" near orbit
"Hale-Bopp comet" near orbit and url:nasa
| 3,690.
14.
1.
|
| 3. Fargo and movie and review
Fargo near "movie review"
| 3,435.
20.
|
| 4. everest
"Mount Everest"
"Mount Everest" and expedition
"Mount Everest" and expedition -- FROM
-- "01/jan/98"
"Mount Everest" and (expedition or
trek)
"Mount Everest" and fatalities
"Mount Everest" and (fatal or death)
"Mount Everest" and paragliding
| 68,563.
11,058.
2,695.
2,997.
3,370.
40.
1,858.
34.
|
| 5. host:unh.edu
url:www.unh.edu
url:unhinfo.unh.edu
url:www.iol.unh.edu [InterOperability Lab]
url:wwwlearn.unh.edu [Continuing
Education]
| 22,403.
4,656.
785.
1,113.
0.
|
| 6. granite
granite and state
"granite state"
"granite state" and "state government"
| 369,570.
87,160.
7,464.
161.
|
| 7. ichthyosaur
image:ichthyosaur*
"charles r. knight"
| 410.
31.
88.
|
| 8. "hypervelocity impacts"
"hypervelocity impacts" and satellites
"hypervelocity impacts" near satellites
| 292.
82.
6.
|
| 9. "klaatu barada nikto"
"The Day the Earth Stood Still"
klaatu [just English yields 3,482.]
barada [just English yields 1,982.]
nikto [just English yields 1,001.]
| 292.
2,374.
4,401.
2,527.
8,969.
|
| 10. "bride of frankenstein"
"reanimated babe cadavers" [this page is the only match!]
| 2,808.
1.
|
| 11. "Wizard of Oz"
"Wizard of Oz" and "frank
baum"
"Wizard of Oz" and "william jennings
bryan"
"Wizard of Oz" and "pink
floyd"
| 24,772.
1,916.
84.
781.
|
| 12. "urban
speleology"
|
22.
|
9. Words to search by.
The following was posted on the AltaVista site at one time
and it applies to the whole complex of search engine
technology:
"AltaVista Search Service marks a turning point in the way we view
and use the Internet. [...] It also has implications for how
web sites are structured. For instance, the home page, which
has been the traditional point of entry to a site -- defining
it, setting the tone, and providing links to further detail --
may never be seen by the typical visitor."
jim .cerny@unh.edu
Stop me before I click again!
[an error occurred while processing this directive]
|