Web Admin's Guide to Site Search Tools
How to choose, implement and maintain your Web Site Search Tools.
There's a paradox: the more information your site has, the more useful it is -- and the harder to navigate! No matter how well you design your site navigation elements, visitors will need other ways to find what they're looking for. Site search tools provide a powerful and familiar means to provide that access. Visitors can just type the words and press the Search button in a form, and get a list of all the documents that match those words on your site.
Luckily, you don't have to write this yourself. There are many site search tools available, for almost every platform, web server and site you can imagine. They range from free to very expensive, from easy graphic interfaces to compile-it-yourself. The information here will give you a head start in choosing the right site search tool for your site.
Definitions: Parts of a Local Site Search Tool
- Search Engine
- The program (CGI, server module or separate server) that accepts the request from the form or URL, searches the index, and returns the result page to the server.
- Search Index File
- Created by the Search Index program, this file stores the data from your site in a special index or database, designed for very quick access. Depending on the indexing algorithm and size of your site, this file can become very large. The file must be updated often, or it will lose synchronization with the pages and provide obsolete results.
- Search Forms
- HTML interface to the site search, provides visitors entry of their search terms and specify their preferences for the search. Some tools provide pre-built forms.
- Search Results Listing
- HTML page lists the pages, which contain text matching the search term(s). These are sorted in some kind of relevance order, usually based on the number of times the search terms appear, and whether they're in a title or header. Most listings include the title of the page and a summary (the Meta Description data, the first few lines of the page, or the most important text). Some also include the date modified; file size, and URL. The format of this is often defined by the site search tool, but may be modified.
As you can imagine, setting all this up takes some time and effort. But before you can do so, you must choose the best site search tool, and that requires some design choices.
Preparing A Site for Searching
Physical Requirements
Site search tools will require additional disk space and processing power. Search indexes never get smaller, so be sure there is space to spare.
In addition, you must plan to update the search index soon after you've changed the files so that searches will locate the correct data: happily, most of the site search tools provide an automatic update scheduler.
Preparing the Pages
When someone searches your site, the listing is very different from the pages themselves. The list usually contains page titles and some kind of text, either the Meta Description data, the first few lines of the page, or a programmatically generated summary of the most important text. In addition, the listings are sorted by the search engine in order of relevance, according to its particular algorithm.
You can present your data well and help your visitors find what they're looking for by keeping search results in mind when you edit your pages. Note that these improvements work for both local and webwide search tools: the work you do will make your pages appear better in any search results.
Page Titles
The titles are the main element in a result listing, so always title your pages carefully. Give a little context as well as the specific topic of the page, and always make sure the spelling is correct. In addition, most search engines use the existence of a word in a title as a clue that the page is a good match for searches on that word, and will rank the page high up in the results list.
For example, if your site is about native plants, use "Native Plant Directory: California Live Oak" instead of just "Live Oaks" as your title. They're equally accurate, but the longer title tells your people what to expect on that page when they look at a listing: it's not about Southern Live Oaks, and it's not about growing or protecting them.
Meta Descriptions
You should also use the Meta Description tag to summarize the contents of each page. Many local and webwide search engines will display this as part of their results, so it provides you an opportunity to present the page in its best light. This is easier than it looks -- you'll find that many of your pages can use very similar descriptions with just the specific topic words changed.
An example of a good Meta Description for the Live Oak page might be:
<META NAME="description" CONTENT="Description of the California Live Oak with pictures, range map and growth patterns.">
As you can tell, creating a description of other pages, such as the Coast Redwood or the Douglas Iris, would be extremely easy. Use the same text and change the plant name, adding or removing the other parts depending on the contents of the page.
Meta Keywords
Keywords are also an important part of your pages. They allow search engines to identify the most important elements of the page and to rank the results so that the most relevant pages are at the top. You can also include common misspellings or other words that may not appear anywhere on the page. A good set of keywords encapsulates the specific topics the page covers.
An example of Meta Keywords for the Live Oak page would be:
<META NAME="keywords" CONTENT="California Live Oaks, Coast Live Oak, liveoak, Quercus Agrifolia, oak woodlands, range map, native plants, native trees">
This describes the topics on the page and means that it would be retrieved if someone does a search on any of these terms, even if they are not in the text.
Meta Keywords help search engines define the relevancy of a match. If the word "white" is anywhere in the text, the search engine will retrieve the Live Oak page on a search for "California White Oak". But because the word is not in the Meta Keywords, it can rank this page lower than others which have "white" as a keyword.
Headings
Many search engines also use headings to rank a page in relevance for a particular search. They assume that words in headings are more important than words in the text, so the pages are more relevant to that search.
For example, if you search for "Oak" and "Range", a page with both those words marked with HTML header tags, the search engines will rank it higher than pages with those words in the text.
Consider vocabulary when you create pages, and think of your headers as small descriptions of those sections.
Choosing a Site Search Tool
- price
- platform
- capacity
- ease of installation
- maintenance
Example: University of Pennsylvania Site Search Needs Analysis and Selection Process
An excellent example of site search requirements, analysis, selection and installation process is available at the University of Pennsylvania's web team area.
They have kindly allowed others to view their information and notes, providing a model of the procedures they followed from late 1996 through the installation of AltaVista Search Intranet in 1997. They mention that products and features have changed since then, so the results might well be different if they were going through the same process today.
U Penn Procedures:
- Create a plan and schedule.
- Screen available search tools based on compatibility and requirements.
- Define preliminary end-user requirements
- Check existing listings of site search tools
- Choose the most appropriate options for additional research
- Develop technical requirements document for end-user needs (boolean searching, results listings, etc.), administration, cost of ownership, vendor reliability, hardware and OS compatibility.
- Evaluate options based on requirements in a table.
- Install test versions of the final candidate products.
- Perform automated and manual user tests and evaluate results.
- Define and develop required local customization.
- Install and publicize the new search tools.
Disclaimer: The University of Pennsylvania makes no commercial endorsement of any consulting service or specific product.
Another good example of the process of choosing and installing a site search tool, in this case covering several Education Department sites. The group set up a requirements document, and tested Netscape Catalog (later replaced by Compass Server), InQuery, Verity Search '97 and Ultraseek, which they ultimately chose.
Testing Notes
Follow the instructions carefully and take copious notes as you go through the steps. You may not have to reinstall the software for months or even years, and it's often difficult to reconstruct your work. The notes will also help if you install an upgrade to the software.
Site Search Engine Issues
- The search engine is the application that searches the data and returns the results to the client. This usually means creating an HTML page in the specified format.
- Most search engines search within an index, created by an Index application. A number of search engines just search the files in real-time, but that can get very slow.
To send a search to the search engine, most systems include forms. The site visitor enters their search terms in a text field, and may select appropriate settings in the form. When they click the Submit button, the server passes that data to the search engine application.
Types of Site Search Engines
CGI Programs
- The Common Gateway Interface (CGI) standard allows a web server to communicate with external programs.
- Most site search CGIs are invoked by a site visitor filling in data and clicking a Search or Submit button on an HTML form. They take the data from a form as parameters, search for the terms, limit the results according to any other settings, and return the result list as an HTML page.
- CGI programs can be written in everything from C to Perl to AppleScript, depending on the web server and the platform. Many CGIs are portable from Unix to Windows and Macs, depending on the language and libraries they use. CGIs are compatible with many different web servers, but there is some overhead in sending the data back and forth, and some cases where the CGI programs can become overwhelmed. See also Plug-Ins.
- For more information on CGI concepts, see the CGI overview at NCSA.
Perl Scripts
- Perl is a scripting language, and is not compiled to object binary like C or Pascal. It has its own syntax and libraries, and communicates with web servers using the CGI standard. You can use Perl scripts on most platforms and with most web servers.
- Several web site search tools are written in Perl: see the Perl listing for details.
- For more information, see the Perl Institute.
Server Plug-Ins
- For better data interchange, less overhead and more flexibility, web server companies have defined APIs (Application Programmer Interfaces) to their servers. This allows third-party developers to create modules for the servers that run inside the server process.
- Several web site search tools are written to various server APIs. They are rarely portable and generally compiled to binary object code.
- For more information, see NSAPI (Netscape's API), ISAPI (Microsoft's API), and W*API (WebSTAR and other Mac servers API).
Java Applications
- Applications, written in the Java language, which runs in the Java Virtual Machine. Applets are small Java applications that run inside the browser program.
Java Servlets
- For applications written in Java using the Java Servlet API, many web servers now exchange data with Java applications using this interface, much like the CGI system. Because Java is designed to be cross-platform, many of the Java Servlets can run almost anywhere.
- For more information, see
Search Servers
- Some search engines run as separate servers. The form data is passed as part of the URL, just like a URL, but the search engine application runs as a separate HTTP server on a different machine. This reduces the load on the main web server substantially.
Compatibility
Search Options
- Natural Language Processing
- Boolean Operators
- Vectors
- Fuzzy Matching
- Phrase Searching
- Proximity Matching
- Concept Browsing & Automatic Matching
- Thesaurus
- Query By Example
- Stemming & Substitutions
- Non-English character matching
- Special features (price-range searching, for example)
- Spelling error tolerance.
Site Search Indexing
- The Search Index is the application which reads the text of the documents to be searched and stores them in an efficient searchable form usually called the index (Microsoft calls it a "catalog").
- Web site indexes must be able to save files in a web server directory, so that the search engine can locate it when a site visitor wants to search it. Remote search engines store the index files on their server, where they are used by the search engine when the user starts searching.
- Local File Indexes locate the files to index by following the directory structure of the hard drive, usually starting with the web server root directory. They will index files based on their location in the directory, rather than following links. Most local file indexes allow you to limit the indexing by file name, type, extension, and/or location.
Updating Local File Indexes
When updating, local file indexes can check the system update date for the file, and only index new or modified files. Some indexes which are tightly linked to their operating system can be notified about file changes and creation in the specified folders, and will only update index entries for those files.
Local Indexing and Dynamic Elements
Local indexes will get the page exactly as it appears on the local disk. They will not include dynamic data from CGIs, SSI (server-side includes), ASP (active server pages) and so on. This can be an advantage if the dynamic elements are repetitive, such as navigation bars, and should not be indexed. In addition, these pages will not be marked as modified unless the content of the page has changed, so they will only be re-indexed when necessary.
Local Indexing and Security
You must be very careful about which files are allowed to stay in the indexed site directory. It's easy to index private and obsolete files by accident, allowing site visitors access to these files via the search engine. Even if the pages themselves cannot be read because they are protected by a password, unauthorized people could deduce the contents of these files by searching.
Robot Spider Indexes
- Robot Spider Indexes locate files to index by following links, just like webwide search engine spiders. You specify the starting page, and these indexes will request it from the server and received it just like a browser. The index will store every word on the page and then follow each link on that page, indexing the linked pages and following each link from those pages. Most robot spider indexes allow you to designate several starting points, so even pages which are not linked from your main page can be indexed.
-
- Because they use HTTP, robot spider indexes can be slower than local file indexes, and can put more pressure on your web server, as they ask for each page. They will miss pages which have been accidentally unlinked from any of your starting points. And spiders may have problems with framed sites, just like webwide search engine robots.
- Updating Robot Indexes
- To update the index, the robot spider will query the web server about the status of each linked page by asking for the HTTP header using a "HEAD" request (the usual request for an HTML page is a "GET"). The server may be able to fill the HEAD request from an internal cache, without opening and reading the entire file, and so the interaction may be much more efficient. Then the index compares the modified date from the header with its own date for the last time the index was updated. If the page has not been changed, it doesn't have to update the index. If it has been changed, or if it is new and has not yet been indexed, the robot spider will then send a GET request for the entire page, and store every word.
- Robot Indexing and Dynamic Elements
- Robot spider indexes will receive each page exactly as a browser will receive it, with all dynamic data from CGIs, SSI (server-side includes), ASP (active server pages) and so on. This is vital to some sites, but other sites may find that the presence of these dynamic elements triggers the re-indexing process, although none of the actual text of the page has been changed.
- Robot Indexing and Security: Robots.txt
- Robot spiders cannot index unlinked files, so they will ignore all the miscellaneous files you may have in your web server directory. They can be controlled by the robots.txt file in the root folder of the host directory structure, just like webwide search tools, and the Meta tag in the format <META NAME="ROBOTS" CONTENT="NOINDEX">. For more information on this, see Web Server Administrator's Guide to the Robots Exclusion Protocol, Robot Exclusion Standard Revisited, and the Web Developers Virtual Library notes on robots.txt.
-
- Warning: For true security on your web site, do not rely on lack of linking and the robots.txt file. Only a password or IP-based security system will protect your files.
Index Formats
- An inverted index stores a list of entries made up of the search term (all the words in a page) and a pointer to the page that contains that search term. There may be other information as well.
- For local file indexes, this pointer is a local file path: for robot spider indexes, it is a URL. Some indexes store additional structured data in a database, so the inverted index points to the database entry which points to the page.
- The application sorts the data on the term so that the search engine can locate the matching terms extremely quickly. This is called "inverted" because the term is used to find the record, rather than the other way around.
- An index may contain gaps to allow for new entries to be added in the correct sort order without always requiring the following entries to be shifted out of the way.
Other Index Issues
- Fields
- Vectors
- Stop words
- Stemming & Substitutions
- Specification
- Updating
- Multiple Machines
- Virtual Hosts
- Capacity
- Speed
- Size of Index
- Server Load
- Compatibility (Text, HTML, PDF, etc.)
Updating Indexes
Indexes must also update the index periodically, to stay synchronized with web pages as they change. Indexing can be extremely CPU-intensive, so you should schedule indexing for low-usage times. For sites which change quickly, the index must be very fast and efficient. Most indexes perform incremental updates, indexing the changes rather than starting from scratch every time. Some will not accept searches while the index is updating, others display the old index until the update is available, and still others attempt to provide new information as soon as it has been indexed.
Legacy Publishing and Searching, Other File Formats
Many local indexes can read and index non-HTML files such as PDF, word processing, spreadsheet, presentation, accounting and even database files. They use filters or viewers to translate the data, then index it normally.
Most search tools just provide links to the non-HTML files in the results list. When a searcher clicks on the link, their browser will probably just start downloading the file. Browsers and servers can only display files in formats they know about, such as text and HTML, or for which they have a Plug-In, such as Shockwave. So there is nothing to do but download the file. In most cases, that's not what a user is expecting, and it breaks the flow of attention, especially on a slow modem.
Some search tools, including Verity, Fulcrum and Quadralay, will automatically convert the data from its native format into HTML and will serve it like that. While the formatting will probably be a bit awkward, displaying the data directly (even though it requires time to convert), keeps the user's attention and lets them stay in their browser instead of opening another application.
Search Forms Issues
- Simple vs. Advanced
- Options
- Popups
- Never use an unlabeled icon for a search button!
- etc.
Search Results Issues
- recall and precision
- HTML templates
- source code
- results ranking
- visual cues
- abstract / summaries
- sorting
- no matches
- too many matches
- search on found set
Maintenance and Updating
- Regular updating of the index
- Regular backups along with other server files
- Test scripting to uncover any corruption or problems
- Occasional rebuilds to remove obsolete data and improve efficiency
Search Log Analysis
- Regular viewing of server or search engine log
- Use search queries to improve keyword and add data
- Check for errors to improve wording of search forms and results listings
Site Search Tools
Copyright © 1998-1999
iFetchIt.com