Ch. 6: Query Reformulation
And as noted in Chapter 3, a common search strategy is for the user to first issue a general query, then look at a few results, and if the desired information is not found, to make changes to the query in an attempt to improve the results. This cycle is repeated until the user is satisfied, or gives up. The previous two chapters discussed interfaces for query specification and presentation of search results. This chapter discusses the query reformulation step.
6.1: The Need for Reformulation
Examination of search engine query logs suggests a high frequency of query reformulation. One study by Jansen et al., 2005 analyzed 3 million records from a 24 hour snapshot of Web logs taken in 2002 from the AltaVista search engine. (The search activity was partitioned into sessions separated by periods of inactivity, and no effort was made to determine if users searched for more than one topic during a session. 72% of the sessions were less than five minutes long, and so one-topic-per-session is a reasonable, if noisy, estimate.) The analysis found that the proportion of users who modified queries was 52%, with 32% issuing 3 or more queries within the session. Other studies show similar proportions of refinements, thus supporting the assertion that query reformulation is a common part of the search process.
Good tools are needed to aid in the query formulation process. At times, when a searcher chooses a way to express an information need that does not successfully match relevant documents, the searcher becomes reluctant to radically modify their original query and stays stuck on the original formulation. Hertzum and Frokjaer, 1996 note that at this point “the user is subject to what psychologists call anchoring, i.e., the tendency to make insufficient adjustments to initial values when judging under uncertainty”. This can lead to “thrashing” on small variations of the same query. Russell, 2006 remarks on this kind of behavior in Google query logs. For example, for a task of “Find out how many people have bought the new Harry Potter book so far”, he observes the following sequence of queries for one user session:
- Harry Potter and the Half-Blood Prince sales
- Harry Potter and the Half-Blood Prince amount sales
- Harry Potter and the Half-Blood Prince quantity sales
- Harry Potter and the Half-Blood Prince actual quantity sales
- Harry Potter and the Half-Blood Prince sales actual quantity
- Harry Potter and the Half-Blood Prince all sales actual quantity
- all sales Harry Potter and the Half-Blood Prince
- worldwide sales Harry Potter and the Half-Blood Prince
In order to show users helpful alternatives, researchers have developed several techniques to try to aid in the query reformulation process (although existing tools may not be sophisticated enough to aid the user with the information need shown above.) This chapter describes interface technologies to support query reformulation, and the ways in which users interact with them.
6.2: Spelling Suggestions and Corrections
Search logs suggest that from 10-15% of queries contain spelling or typographical errors (Cucerzan and Brill, 2004). Fittingly, one important query reformulation tool is spelling suggestions or corrections. Web search engines have developed highly effective algorithms for detecting potential spelling errors (Cucerzan and Brill, 2004, Li et al., 2006).
Before the web, spelling correction software was seen mainly in word processing programs. Most spelling correction software compared the author's words to those found in a pre-defined dictionary (Kukich, 1992), and did not allow for word substitution. With the enormous usage of Web search engines, it became clear that query spelling correction was a harder problem than traditional spelling correction, because of the prevalence of proper names, company names, neologisms, multi-word phrases, and very short contexts (some spelling correction algorithms make use of the sentential structure of text). Most dictionaries do not contain words like blog, shrek, and nsync.
But with the greater difficulty also came the benefit of huge amounts of user behavior data. Web spelling suggestions are produced with the realization that queries should be compared to other queries, because queries tend to have special characteristics, and there is a lot of commonality in the kinds of spelling errors that searchers make. A key insight for improving spelling suggestions on the Web was that query logs often show not only the misspelling, but also the corrections that users make in subsequent queries. For example, if a searcher first types schwartzeneger and then corrects this to schwartzenegger, if the latter spelling is correct, an algorithm can make use of this pair for guessing the intended word. Experiments on algorithms that derive spelling corrections from query logs achieve results in the range of 88-90% accuracy for coverage of about 50% of misspellings (Cucerzan and Brill, 2004, Li et al., 2006). For Web search engine interfaces, one alternative spelling is typically shown beneath the original query but above the retrieval results. The suggestion is also repeated at the bottom of the results page in case the user does not notice the error until they have scrolled through all of the suggested hits. As noted in Chapter 1, in most cases the interface offers the choice to the user without forcing an acceptance of an alternative spelling, in case the system's correction does not match the user's intent. But in the case of a blatantly incorrect typographical error, a user may prefer the correction to be made automatically to avoid the need for an extra click. To balance this tradeoff, some search engines show some hits with their guess of the correct spelling interwoven with others that contain the original, most likely incorrect spelling.
There are no published large-scale statistics on user uptake of spelling correction, but a presentation by Russell, 2006 shows that, for those queries that are reformulations, and for which the original query consisted of two words, 33% of the users making reformulations used the spelling correction facility. For three-word query reformulations, 5% of these users used the spelling suggestion.
In an in-person study conducted with a statistically representative subject pool of 100 people, Hargittai, 2006 studied the effects of typographical and spelling errors. (Here typographical means that the participant knows the correct spelling but made a typing mistake, whereas spelling error means the participant does not know the correct spelling.) Hargittai, 2006 found that 63% of the participants made a mistake of some kind, and among these, 35% made only one mistake, but 17% made four or more errors during their entire session. As might be predicted, lower education predicted higher number of spelling errors, but an interesting finding was that the higher the participant's income, the more likely they were to make a typographical error. Older participants were more also likely to make spelling errors. The most surprising result, however, was that of the 37 participants who made an error while using Google search, none of them clicked on the spelling corrections link. This would seem to contradict the statistics from Russell, 2006. It may be the case that in Hargittai's data, participants made errors on longer queries exclusively, or that those from a broader demographic do not regularly make use of this kind of search aid, or that the pool was too small to observe the full range of user behavior.
6.3: Automated Term Suggestions
The second important class of query reformulation aids are automatically suggested term refinements and expansions. Spelling correction suggestions are also query reformulation aids, but the phrase term expansion is usually applied to tools that suggest alternative words and phrases. In this usage, the suggested terms are used to either replace or augment the current query. Term suggestions that require no user input can be generated from characteristics of the collection itself (Schütze and Pedersen, 1994), from terms derived from the top-ranked results (Anick, 2003, Bruza and Dennis, 1997), a combination of both (Xu and Croft, 1996), from a hand-built thesaurus (Voorhees, 1994, Sihvonen and Vakkari, 2004), or from query logs (Cui et al., 2003, Cucerzan and Brill, 2005, Jones et al., 2006) or by combining query logs with navigation or other online behavior (Parikh and Sundaresan, 2008).
Usability studies are generally positive as to the efficacy of term suggestions when users are not required to make relevance judgements and do not have to choose among too many terms. Some studies have produced negative results, but they seem to stem from problems with the presentation interface. Generally it seems users do not wish to reformulate their queries by selecting multiple terms, but many researchers have presented study participants with multiple-term selection interfaces.
For example, in one study by Bruza et al., 2000, 54 participants were exposed to a standard Web search engine, a directory browser, and an experimental interface with query suggestions. This interface showed upwards of 40 suggested terms and hid results listing until after the participant selected terms. (The selected terms were conjoined to those in the original query.) The study found that automatically generated term suggestions resulted in higher average precision than using the Web search engine, but with a slower response time and the penalty of a higher cognitive load (as measured by performance on a distractor task). No subjective responses were recorded. Another study using a similar interface and technology found that users preferred not to use the refinements in favor of going straight to the search results (Dennis et al., 1998), underscoring the search interface design principle that search results should be shown immediately after the initial query, alongside additional search aids.
6.3.1: Prisma
Interfaces that allow users to reformulate their query by selecting a single term (usually via a hyperlink) seem to fare better. Anick, 2003 describes the results of a large-scale investigation of the effects of incorporating related term suggestions into a major Web search engine. The term suggestion tool, called Prisma, was placed within the AltaVista search engine's results page (see Figure 6.1). The number of feedback terms was limited to 12 to conserve space in the display and minimize cognitive load. Clicking on a hyperlink for a feedback term conjoined the term to the current query and immediately ran a new query. (The chevron ( >>) to the right of the term replaced the query with the term, but its graphic design did not make it clearly clickable, and few searchers used it.) Term suggestions were derived dynamically from an analysis of the top-ranked search results.
The study created two test groups by serving different Web pages to different IP addresses (using bucket testing, see Chapter 2). One randomly selected set of users was shown the Prisma terms, and a second randomly selected set of users was shown the standard interface, to act as a control group. Analysis was performed on anonymized search logs, and user sessions were estimated to be bursts of activity separated by 60 minutes of no recorded activity. The Prisma group was shown query term refinements over a period of five days, yielding 15,133 sessions representing 8,006 users. The control group included 7,857 users and 14,595 sessions. Effectiveness of the query suggestions was measured in terms of whether or not a search result was clicked after the use of the mechanism, as well as whether or not the session ended with a result click.
In the Prisma group, 56% of sessions involved some form of refinement (which includes manual changes to the query without using the Prisma suggestions), compared to 53% of the control group's sessions, which was a significant difference. In the Prisma condition, of those sessions containing refinements:
- 25% of the sessions made use of the Prisma suggestions,
- 16% of the users applied the Prisma feedback mechanism at least once on any given day,
- When studied over another two weeks, 47% of those users used Prisma again within the two week window, and
- over that period, the percentage of refinement sessions using the suggestions increased from 25% to 38%.
Despite the large degree of uptake, effectiveness when measured in the occurrence of search results clicks did not differ between the baseline group and the Prisma group. However, the percentage of clicks on Prisma suggestions that were followed immediately by results clicks was slightly higher than the percentage of manual query refinements followed immediately by results clicks.
This study also examined the frequency of different refinement types. Most common refinements were:
- Adding or changing a modifier (e.g., changing buckets wholesale to plastic buckets): 25%
- Elaborating with further information (e.g., jackson pollock replaced by museum of modern art): 24%
- Adding a linguistic head term (e.g., converting triassic to triassic period): 15%
- Expressing the same concept in a different way (e.g., converting job listings to job openings): 12%
- Other modifications (e.g., replacing with hyponyms, morphological variants, syntactic variants, and acronyms): 24%.
6.3.2: Other Studies of Term Suggestions
In a more recent study, White et al., 2007 compared a system that makes term suggestions against a standard search engine baseline and two other experimental systems (one of which is discussed in the subsection below on suggesting popular destinations). Query term suggestions were computed using a query log. For each query, queries from the log that contained the query terms were retrieved. These were divided into two sets: the 100 most frequent queries containing some of the original terms, and the 100 most frequent of queries that followed the target query in query logs -- that is, user-generated refinements. These candidates were weighted by their frequency in each of the two sets, and the top-scoring six candidates were shown to the user after they issued the target query. Suggestions were shown in a box on the top right hand side of the search results page.
White et al., 2007 conducted a usability study with 36 participants, each doing two known-item tasks and two exploratory tasks, and each using the baseline system, the query suggestions, and two other experimental interfaces. For the known-item tasks, the query suggestions scored better than the baseline on all measures (easy, restful, interesting, etc). Participants were also faster using the query suggestions over the baseline on known item tasks (although tied with one experimental system), and made use of the query suggestions 35.7% of the time. For those who preferred this query suggestion interface, they said it was useful for saving typing effort and for coming up with new suggestions. (The experimental system for suggesting destinations was more effective and preferred for exploratory tasks.)
In the BioText project, Divoli et al., 2008 experimented with alternative interfaces for terms suggestions in the specialized technical domain of searching over genomics literature. They focused specifically on queries that include gene names, which are commonly used in bioscience searches, and which have many different synonyms and forms of expression. Divoli et al., 2008 first issued a questionnaire in which they asked 38 biologists what kind of information they would like to see in query term suggestions, finding strong support for gene synonyms and homologues. Participants were also interested in seeing information about genes associated with the target gene, and localization information for genes (where they occurs in organisms). It should be noted that a minority of participants were strongly opposed to showing additional information, unless it was shown as an optional link, in order to retain an uncluttered look to the interface.
A followup survey was conducted in which 19 participants from biology professions were shown four different interface mock-ups (see Figure 6.2). The first had no term suggestions, while the other three showed term suggestions for gene names, organized into columns labeled by similarity type (synonyms, homologues, parents, and siblings of the gene). Because participants had expressed a desire for reduced clutter, at most three suggestions per columns were shown, with a link to view all choices.
(a)
(b)
Design 2 required selection of the choices by individual hyperlink, with an option to add all terms. Design 3 allowed the user to select individual choices via checkboxes, and Design 4 allowed selecting of all terms within a column with a single hyperlink. Design 3 was most preferred, with one participant suggesting that the checkbox design also include a select all link within each column. Designs 4 and 2 were closely rated with one another, and all were strongly preferred over no synonym suggestions. These results suggest that for specialized and technical situations and users, term suggestions can be even more favored than in general Web search.
6.3.3: Query Refinement Suggestions in Web Search Interfaces
The results of the Anick, 2003 and the White et al., 2007 studies are generally positive, and currently many Web search engines offer term refinement. For example, the Dogpile.com metasearch engine shows suggested additional terms in a box on the right hand side under the heading “Are you looking for?” (see Figure 6.3). A search on apple yields term suggestions of Apple the Fruit (to distinguish it from the computer company and the recording company), Banana, Facts about Apples, Apple Computers, Red Apple and others. Selecting Apple the Fruit retrieves Web pages that are about that topic, and the refinements change to Apple Varieties, Apple Nutrition, History Fruit Apple, Research on Fruit, Facts about the Fruit Apple, and others. Clicking on Facts about the Fruit Apple retrieves web pages containing lists of facts.
The Microsoft search site also shows extensive term suggestions for some queries. For instance, a query on the ambiguous term jets yields related query suggestions including Jet Magazine, Jet Airways, JetBlue, Fighter Jets, Jet Li and Jet Stream (see Figure 5.8 in Chapter 5).
Jansen et al., 2007b studied 2.5M interactions (1.5M of which were queries) from a log taken in 2005 from the Dogpile.com search engine. Using their computed session boundaries (mean length of 2.31 queries per session), they found that more than 46% of users modified their queries, 37% of all queries were parts of reformulations, and 29.4% of sessions contained three or more queries. Within the sessions that contained reformulated queries, they found the following percentage of actions for query modifications (omitting statistics for starting a new topic):
- Assistance (clicked on a link offered by the question Are you Looking For?, which are term refinements): 22.2%
- Reformulation (the current query is on the same topic as the searcher's previous query, and shares one or more common terms with it): 22.7%
- Generalization (same topic, but seeking more general information): 7.2%
- Specialization (same topic, but seeking more specific information): 16.3%
- Content Change (identical query, but run on a different collection): 11.8%
- Specialization with reformulation: 9.9%
- Generalization with reformulation: 9.8%
(Here, collections refer to Web pages versus searching images, videos, or audio data.) Thus, they found that 8.4% of all queries were generated by the reformulation assistant provided by Dogpile (see Figure 6.3), although they do not report on what proportion of queries were offered refinements. This is additional evidence that query term refinement suggestions are a useful reformulation feature. A recent study on Yahoo's search assist feature (Anick and Kantamneni, 2008) found similar results; the feature was used about 6% of the time.
6.4: Suggesting Popular Destinations
White et al., 2007 suggested another kind of reformulation information: showing popular destination Web sites. They recorded search activity logs for hundreds of thousands of users over a period of five months in 2005--2006. These logs allowed them to reconstruct the series of actions that users made from going to a search engine page, entering a query, seeing results, following links, and reading web pages. They determined when such a session trail ended by looking for a stoppage, such as staying on a page for more than 30 minutes, or a change in activity, such as switching to email, or going to a bookmarked page. They distinguished session trails from query trails; the latter had the same stopping conditions as the former, but could also be ended by a return to a search engine page. Thus they were able to “follow” users along as they performed their information seeking tasks.
White et al., 2007 found that users generally browsed far from the search results page (around 5 steps), and that on average, users visited 2 unique domains during the course of a query trail, and just over 4 domains during a session trail. They decided to use the information about which page the users ended up at as a suggestion for a shortcut for a given query. Given a new query, its statistical similarity to previously seen query-destination pairs was computed, and popular final destinations for that query were then shown as a suggested choice (see Figure 6.4). They experimented with suggestions from both query trails and sessions trails.
In the same study of 36 participants, they compared these two experimental approaches against a standard search engine baseline and a query suggestions interface, testing on both known-item tasks and exploratory tasks. For exploratory tasks, the destination suggestions from the query trails scored better than the other four systems on perceptions of the search process (easy, restful, interesting, etc.) and usefulness (perceived as producing more useful and relevant results) for the exploratory tasks. The task completion time on exploratory tasks was approximately the same for all four interfaces; the destination suggestions were tied in terms of speed with query term suggestions in known-item tasks. In exploratory tasks, query trail destination suggestions were used more often (35.2% of the time) than query term suggestions and session trail destination suggestions.
Participants who preferred the destination suggestions commented that they provided potentially helpful new areas to look at, and allowed them to bypass the need to navigate to pages. They suggested that destinations were selected because they “grabbed their attention,” “represented new ideas,” or users “couldn't find what they were looking for.” Those who did not like the suggestions stated as a reason the vagueness of showing only a Web site; presumably augmenting the destination views with query-biased summaries would make them more useful. The destination suggestions produced from session trails were sometimes very good, but were inconsistent in their relevance, a characteristic which is usually perceived negatively by users. The participants did not find the graphical bars indicating site popularity to be useful, mirroring other results of this kind.
6.5: Relevance Feedback
Another major technique to support query reformulation is relevance feedback. In its original form, relevance feedback refers to an interaction cycle in which the user reads retrieved documents and marks those that appear to be relevant, and the system then uses features derived from these selected relevant documents to revise the original query (Ruthven and Lalmas, 2003). In one variation, the system uses information from the marked documents to recalculate the weights for the original query terms, and to introduce new terms. In another variation, the system suggests a list of new terms to the user, who then selects a subset of these to augment the query (Koenemann and Belkin, 1996). The revised query is then executed and a new set of documents is returned. Documents from the original set can appear in the new results list, although they are likely to appear in a different rank order. In some cases the relevance feedback interface displays an indicator such as a marked checkbox beside the documents that the user has already judged. For most relevance feedback techniques, a larger number of marked relevant documents yields a better result.
In a method known as pseudo-relevance feedback (also known as blind relevance feedback), rather than relying on the user to choose the top k relevant documents, the system simply assumes that its top-ranked documents are relevant, and uses these documents to augment the query with a relevance feedback ranking algorithm. This procedure has been found to be highly effective in some settings (Thompson et al., 1995, Kwok et al., 1995, Allan, 1995). However, it does not perform reliably when the top-ranked documents are not relevant (Mitra et al., 1998a).
Relevance feedback in its original form has been shown -- in artificial settings -- to be an effective mechanism for improving retrieval results (Salton and Buckley, 1990, Harman, 1992, Buckley et al., 1994, Mitra et al., 1998a). For instance, a study by Kelly et al., 2005 compared carefully elicited user-generated term expansion with relevance feedback based on documents that were pre-determined by an expert to be the most relevant. The results of relevance feedback using the top-ranked documents far outstripped user-generated term expansion. Kelly et al., 2005 used the highly relevant documents as an upper bound on performance, as it could not be expected that ordinary users would identify such documents.
This finding is echoed by another study for the TREC HARD track in which an expert was shown the documents pre-determined to be most relevant and spent three minutes per query choosing documents for relevance feedback purposes. The resulting improvements over the baseline run was 60% over the metric used to assess improvements from clarification dialogues (Allan, 2005). The results of using user-generated additional terms were that queries that were already performing well improved more than queries that were not performing well originally. This study also found that spending more time in the clarification dialogue did not correlate with improved final results.
Despite its strong showing in artificial or non-interactive search studies, the use of classic relevance feedback in search engine interfaces is still very rare (Croft et al., 2001, Ruthven and Lalmas, 2003), suggesting that in practice it is not a successful technique. There are several possible explanations for this. First, most of the earlier evaluations assumed that recall was important, and relevance feedback's strength mainly comes from its ability to improve recall. High recall is no longer the standard assumption when designing and assessing search results; in more recent studies, the ranking is often assessed on the first 10 search results. Second, relevance feedback results are not consistently beneficial; these techniques help in many cases but hurt results in other cases (Cronen-Townsend et al., 2004, Marchionini and Shneiderman, 1988, Mitra et al., 1998a). Users often respond negatively to techniques that do not produce results of consistent quality. Third, many of the early studies were conducted on small text collections. The enormous size of the Web makes it more likely that the user will find relevant results with fewer terms than is the case with small collections. And in fact there is evidence that relevance feedback results do not significantly improve over web search engine results (Teevan et al., 2005b).
But probably the most important reason for the lack of uptake of relevance feedback is that the method requires users to make relevance judgements, which is an effortful task (Croft et al., 2001, Ruthven and Lalmas, 2003). Evidence suggests that users often struggle to make relevance judgements (White et al., 2005), especially when they are unfamiliar with the domain (Vakkari, 2000b, Vakkari and Hakala, 2000, Spink et al., 1998). In addition, when many of the earlier studies were done, system response time was slow and the user was charged a fee for every query, so correct query formulation was much more important than for the rapid response cycle of today's search engines. (By contrast, a search engine designed for users in the developing world in which the round trip for retrieval results can be a day or more has renewed interest in accurate query formulation (Thies et al., 2002).) The evidence suggests it is more cognitively taxing to mark a series of relevance judgements than to scan a results listing and type in a reformulated query.
6.6: Showing Related Articles (More Like This)
To circumvent the need for multiple relevant document selection, Aalbersberg, 1992 introduced an incremental relevance feedback that requires the user to judge only one document at a time. Similarly, some Web-based search engines have adopted a “one-click” interaction method. In the early days of the Web, the link was usually labeled as “More like this”, but other terms have been used, such as “Similar pages” or “Related articles” at the biomedical search engine Pubmed. (This is not to be confused with “Show more results at this site” which typically re-issues the query within a subdomain.)
More recently in PubMed, after a user chooses to view an article, the titles of some related articles are shown along the right hand side (see Figure 6.5). Related articles are computed in terms of a probabilistic model of how well they match topics (Lin and Wilbur, 2007). These related articles are relatively heavily used by searchers. Lin et al., 2008 studied a week's worth of query logs from PubMed in June 2007, observing about 2M sessions that included at least one PubMed query and abstract view. Of these, 360,000 sessions (18.5%) included a click on a suggested related article, representing about one fifth of non-trivial search sessions. They also found that as session lengths increased, the likelihood of selecting a related article link grew, and once users started selecting related articles, they were likely to continue doing so, more than 40% of the time.
Thus, the evidence suggests that showing similar articles can be useful in literature search, but it is unclear what its utility is for other kinds of search. Related article links act as a “black box” to users, meaning they cannot see why it is that one set of articles is considered related and others are not. Furthermore, they do not have control over in what ways other articles are related. Interfaces which allow users to select a set of categories or dimensions along which documents are similar may be more effective for this, as discussed in Chapter 8 on integrating navigation and search.
6.7: Conclusions
When an initial query is unsuccessful, a searcher can have trouble thinking of alternative ways to formulate it. Query reformulation tools can be a highly effective part of the search user interface. Query reformulation is in fact a common search strategy, as evidenced by the statistics presented throughout this chapter -- roughly 50% of search sessions involve some kind of query reformulation.
Both spelling suggestions and term expansions are effective reformulation tools -- term suggestion tools are used roughly 35% of the time that they are offered to users. Additionally, showing popular destinations for common queries, and showing related articles for research-style queries have both been shown to be effective. However, relevance feedback as traditionally construed has not been proven successful in an interactive context.