Ch. 9: Personalization in Search
Currently most Web search engines produce the same results independently of who the user is; search engines are in fact designed to do well at meeting the average goals of the group over the specific goals of the individual (Teevan et al., 2005a). Nonetheless, researchers, marketers, and search pundits alike often state that search results should reflect the fact that people differ in their needs, goals, and preferences (Pitkow et al., 2002). As a typical example, an avionics expert should see a different results ranking for a query on jets than a football fan from New York.
Tailoring search results to an individual's interests is just one form of information personalization. The term is currently applied to a diverse array of information applications. Given a profile of a user's interests, information content can be personalized by: re-ranking search results listings to better reflect the user's preferences, automatically notifying a user when new articles of interest appear or important changes are made to existing documents, recommending items that are related in some way to an item currently being viewed, and customizing the look and content of a web site or other information artifact.
Using Explicit Preferences Individual User-created customization of Web sites (S. 9.1.1) User-created alerts and interest profiles (S. 9.1.2) User-specified relevance judgements for creating alerts (also called Routing, Filtering, and Standing Queries) (S. 9.1.3) Combining user-created and machine-augmented alerts (S. 9.1.4) Collective Content determined by explicitly selected user groups (S. 9.1.5) Collaborative Filtering, based on similarity to other users' explicit relevance judgements (S. 9.1.6) Collaborative Filtering, based on similarity of content to explicitly selected content (S. 9.1.6) Item-based collective recommendations (S. 9.1.7) Using Implicit Preferences Individual Information alerts and profiles built from the user's search history (S. 9.2.1) The same, combined with document viewing history (S. 9.2.1) The same, combined with mapping viewed pages into categories (S. 9.2.2) Collective Re-ranking of search results based on collective clickthrough (S. 9.2.3) Collaborative filtering based on similarity to other users' implicit document viewing behavior (S. 9.2.4) Explicit And Implicit Individual Modifying implicit alerts with explicit relevance judgements (S. 9.3.1) Personal information assistants or agents (S. 9.3.1)
Information personalization approaches vary along several dimensions. Some make use of information that users explicitly specify about themselves and their interests, while others attempt to infer users' preferences implicitly from their actions. Some approaches note that individuals tend to have both long-term, stable interests, and short-term temporary interests, and that these can come into conflict when proposing recommendations based on past actions. Some approaches make use of an individual's actions alone, while others attempt to improve results for an individual by harnessing the actions of the many users of the system, using what can be referred to as social or collective information. When determining how to make recommendations, some systems make content-based suggestions, based on similarity of keywords and text, whereas others make user-based recommendations, in which the searcher is grouped with other individuals who have expressed similar preferences in the past.
This chapter discusses information personalization applications along the dimensions described above. The combinations are summarized in Table 9.1. Section 9.1 discusses personalization based on information explicitly specified by users, including customization of Web sites, manual profile creation, creation of profiles via explicit relevance judgements, and collaborative filtering. Section 9.2 discusses personalization methods that make use of the information implicit in users' actions as they go about their information seeking tasks. Topics include: building profiles based on observing which documents a user looks at while doing their information seeking, organizing this implicit information into pre-defined category structures, using the information about the behavior of large numbers of people clicking on information items, and using implicit information as data for collaborative filtering. Section 9.3 describes personal information assistants (also known as agents), which engage the user more interactively, and combine explicit and implicit relevance information. Section 9.4 discusses the related area of search over users' personal information. Finally, Section 9.5 summarizes the current status of the use of personalization in information delivery systems.
9.1: Personalization Based on Explicit Preferences
9.1.1: User-Driven Customization
Web portals allow users to customize the kinds of information they see on their home page, including which news feeds to receive, which cities to show the current weather for, and which sports teams to track (Manber et al., 2000, Pierrakos et al., 2003). These portals provide users with an enormous breadth of choices over content. However, it is often reported that users tend to be reluctant to change default settings and use customization features; a study of MyYahoo users found that “a majority of active users” did not customize their pages (Manber et al., 2000). To encourage customization, some sites require users to customize their preferences when they register. At one time in the past, the sports Web site ESPN.com required users to indicate their favorite sports and teams directly on the new user registration page. (However, this information was not used to customize the appearance of the user's pages on the site.)
Semi-automated customization is in wide use on the Web; mapping sites and portals keep track of a registered user's zip code and use this information to geographically customize content such as weather and local shopping search results. Airline sites retain information about a user's favorite travel destinations, seating, and meal preferences. E-commerce sites retain payment information to accelerate the purchasing process. By contrast, customization for search results is not widely used.
An early version of Google search results personalization asked users to specify their interests (Teevan et al., 2005a). More recently, Google introduced a personalization interface that the user sees only when logged into their Google account. The interface allows the user to explicitly move search hits up in the rankings or by clicking on a button labeled with an arrow, or hide the hits in a menu at the bottom of the page by clicking on an “x” icon (see Figure 9.1). The changes in rank ordering are preserved for the next time the user issues the same query; it is unclear if the user operations are used for reweighting hits for other queries. No usability results have yet been reported on this feature.
9.1.2: User-Created Alerts and Interest Profiles
A commonly used form of explicitly personalized information is customized alerting services, which are commonly offered by web-based content brokers. Examples are services that send a subscriber news of standing interest (e.g., “send me all articles that mention my company's name”), or alert a researcher to late-breaking journal articles on a particular topic. Information filtering via explicit user profiles has been studied for many years, and goes by many names, including alerts, routing, filtering, content-based recommendations, standing queries, and user modeling (Hanani et al., 2001, Croft et al., 2001, Stadnyk and Kass, 1992, Fischer, 2001, Adomavicius and Tuzhilin, 2005).
To set up an alert, the user typically builds a profile for a topic of interest. In some cases, construction of a user profile requires the user to specify a list of required, optional, and banned terms that characterize which documents to retain and which to exclude (Kamba et al., 1995, Yan and Garcia-Molina, 1995) ; this is the most common approach in commercial systems today (Wærn, 2004). After the profile is created, the system continually retrieves documents that match the profile as new information becomes available. Google Alerts and Yahoo Alerts allow users to specify a set of search terms for monitoring news and Web sites; as new articles arrive they are compared to the user-specified profile, and those articles that match the profile are presented to the user. Profiles can also be used to block out information; when used this way they are usually referred to as filters, as is seen with spam filters for email.
The evidence suggests that manual creation of alert profiles does not work very well. Yang and Jeh, 2006 interviewed 18 Google engineers, finding that only 3 participants used the Google Web Alerts system. Two participants who had used the alert service in the past said that the quality of the recommendations was too low to justify creating more alerts. Those who did not use alerts said that setting up the profiles for a query of interest required too much effort. These results led Yang and Jeh, 2006 to assert that alert systems will only be useful if they can automatically recognize which queries represent information needs that users really care about seeing again, and if the alerts can be automatically configured. As a step towards automating alert creation, Yang and Jeh, 2006 identified several criteria for good recommendations from alert systems, including: the information should be new for that user, or should be newly modified information where the modifications make it “interesting;” information should not be too similar to already viewed information; and the retrieved information should be of high quality. Section 9.4 discusses their attempts to use implicit information to build alert profiles.
9.1.3: Creating Profiles from Explicit Relevance Judgements
As an alternative to specifying keywords manually to produce a query, a user can be asked to rate a set of documents as interesting or not, and from these ratings a profile can be created using machine learning techniques. Thus, this approach is a combination of relevance feedback and text categorization (Sebastiani, 2002).
In support of this kind of profile generation, from 1993 to 2002 the TREC competition included a routing and filtering track. In TREC-9 (Robertson and Hull, 2001), a three-way distinction was made among adaptive filtering, batch filtering, and routing. In adaptive filtering the system started with a user profile (represented in this case as a set of relevant documents) and was then fed a stream of documents, each of which had been manually assigned a relevance judgement. The system was expected to adaptively update the filtering profile. Batch filtering was identical to adaptive filtering, except that the system also started with a large sample of evaluated training documents, which made it similar to a text categorization problem. Routing was similar to batch filtering, but the system was expected to return a ranked list of documents, as opposed to a binary relevant-or-not decision. Most participating systems in the TREC task used machine learning techniques.
Despite the TREC research efforts, explicit relevance-judgement based approaches are not widely used. This may be because, as seen in the discussion of relevance feedback (see Chapter 6), making relevance judgements is an effortful and error-prone task (Croft et al., 2001, Ruthven and Lalmas, 2003, White et al., 2005). Another problem with this approach is the profiles become out-of-date as the vocabulary of the relevant documents changes over time, so users must continually make relevance judgements in order to keep the profiles up to date.
9.1.4: Modifying Machine-Built Profiles
Several researchers have studied the effects of allowing users to modify their explicitly created profiles, after they are created by or augmented by machine learning algorithms. Unfortunately, the outcome of these studies tends to be negative. For example, a study by Wærn, 2004 showed that allowing users to set up the initial profile made the subsequent training algorithms perform better than those that used relevance judgements but no initial user setup. However, they also found that allowing users to explicitly modify the trained profile did not result in improvements.
Another more recent study by Ahn et al., 2007 examined whether allowing users to modify a machine-generated profile for news recommendations could improve the results. In this case they represented user profiles as weighted term vectors, and they showed this information in the form of lists of words whose varying font size reflected the weight of the term in the profile (see Figure 9.2). Participants could click on words to remove them from the profile; the words were subsequently shown in a strike-through font; clicking on the word a second time would add it back into the profile. Participants could also type in new words. Changes provided immediate feedback; after a change was made, the user would see which documents were promoted or demoted. Hovering the mouse over an article would also show the weighted keywords that caused it to be retrieved.
Ahn et al., 2007 conducted a study with 10 information science graduate students, and found that allowing them to adjust the profiles significantly worsened the results. Keyword removal caused about 4 times more harm than adding keywords. There were no significant differences in participants' subjective responses to the interactive versus the automated system. In exit interviews, participants noted that although they liked the idea of having more control, they were frustrated by the resulting system performance (in part because redundant articles were shown at the top of the results, a flaw in the study design). It may be the case that if a more transparent model had been used, the participants would have been more successful at adjusting their profiles; these results suggest that the results of vector space ranking and statistical weighting are unintuitive. This finding echoes results found for relevance feedback term suggestions by Koenemann and Belkin, 1996 ; participants wanted the system to suggest relevant terms, but also wanted to have control over which terms were used to augment the query.
9.1.5: Explicit Group Recommendations
As user participation on Web sites has expanded, many tools have arisen that make use of the “wisdom of crowds,” or massive user behavior, to recommend information. In its simplest form, collective behavior can be used to show information that is currently popular. News sites show the most viewed or most emailed articles, which is an implicit form of recommendation information. Web sites like Digg.com allow users to explicitly mark news articles, images, or Web pages as being of special interest, and those pages that have the highest number of recent marks are shown prominently on that Web site.
Recently, social bookmarking sites such as delicious.com and furl.com have become popular. In these sites, users mark Web pages as being of high quality or interest. A user can specify a network of people whose bookmarks and recommendations they want to see in their search results. For example, on Yahoo's MyWeb, an information retrieval researcher can indicate that their personal network consists of colleagues in their field. Then when that person issues a query, say visualization, the search results will show those web pages that contain the term and that have been marked by the members of their network as being interesting (see Figure 9.3). This kind of personalization uses explicitly indicated relevance information, based on annotations made by people other than the searcher, but personalized by the knowledge of that searcher's explicitly indicated social group.
9.1.6: Collaborative Filtering (Social Recommender Systems)
An important kind of socially-determined explicit recommendation is seen in the area of collaborative filtering, also known as social recommender systems. This is well-established technology that uses ratings from a large number of different users to assign scores to information objects, such as movies. An individual indicates their own preferences, and is matched with other people who indicated similar scores for the same information items (Resnick and Varian, 1997). The explicit recommendations discussed above are essentially “most popular” lists; those that are most often viewed, voted on, or bought. By contrast, collaborative filtering tries to match an individual with a subset of like-minded people, so that those people whose interests do not coincide with mainstream interests can still get useful recommendations. For example, a movie recommender system like MovieLens (Herlocker et al., 2004) asks users to explicitly rate (both positively and negatively) a set of movies they have seen in the past. It then tries to recommend movies the user hasn't seen based on ratings that other people have made who have seen those films. The special twist is that the system first attempts to match a given user to other users, based on which movies they have all rated similarly in the past. Then it shows the user additional items that have been highly rated by like-minded filmwatchers.
Collaborative filtering that matches a user with like-minded users seems to work well for information items that have an aspect of taste to them, for example, movies and music (Sharanand and Maes, 1995), and recommender tools have become important parts of many online systems. The results of collaborative filtering when applied to information needs do not seem to be as successful. This makes sense intuitively, as it is usually easier to predict if a friend will like an arbitrarily chosen piece of music than to predict whether or not they will find a particular Web page interesting.
Collaborative filtering usually refers to recommendations in which the user is shown items that people with similar preferences have seen and rated in the past. A related but different technology is content-based recommendations, in which the user is recommended items similar in content to those already seen by that user and judged relevant, similar to routing queries discussed above.
There have been a number of attempts to combine content-based and collaborative-filtering based methods (Basu et al., 1998, Claypool et al., 1999, Balabanovic and Shoham, 1997, Sarwar et al., 1998), but they are not clearly successful. Content-based similarity works better at, for example, recommending related research papers to a given paper, for citation purposes (McNee et al., 2006). It does not work so well at suggesting content of general interest, although it is sometimes better at suggesting novel or unexpected recommendations (McNee et al., 2002).
9.1.7: Item-Based Collective Recommendations
A different collective-based recommendation technique is well-known from the Amazon.com Web site. When looking at an item, a user is shown other items purchased by people who purchased the target item, or rated both the target and the suggested items highly (Linden et al., 2003). In other words, if both item A (say, the film Being John Malkovich) and item B (say, Adaptation) are judged to be high-quality by many different people, when a new user comes along and chooses for some reason to view item A, then that user will also be shown item B as a recommendation. This differs from collaborative filtering, in that people are not matched to other people based on shared judgements. Rather, an item is matched to another item based on the fact that individuals have indicated they like both items.
This technique can be implemented in either an explicit or implicit manner, depending on whether recommendation of the items comes from explicit ratings or implicit approval signalled by a purchasing or viewing action. (It should be noted that Amazon.com's recommendations are especially accurate because they are based on purchase history -- paying money for something is a strong endorsement -- and because their recommendations are conservative; they present only a very limited number of related books for any given book.) If an individual has unusual tastes, then this kind of recommendation may not work well for them, since it is absent the personalization that comes from being matched to other people with similar taste. But choosing to view a particular item gives a strong signal as to the user's current interests.
9.2: Personalization Based on Implicit Relevance Cues
As mentioned above, because most people find it difficult or undesirable to provide explicit relevance judgements (Wærn, 2004), today most personalization efforts make use of preference information that is implicit in user actions. The subsections below discuss profiles based on monitoring user activity as they search for and view information item.
9.2.1: Alerts Based on a User's Search History
As discussed in Chapter 2, search engine logs can record queries issued by users over time. Often the same user's queries can be tracked via information from a cookie or a session variable, yielding what is sometimes referred to as a search history.
As discussed above, Yang and Jeh, 2006 attempted to build alerts for users without requiring the users to indicate explicitly what their interests were. Based on users' search histories, they first developed an algorithm to predict user's ideal standing queries. After predicting these topics, a second algorithm attempted to automatically classify new documents as to whether or not they fit the standing queries. The goal was to alert users to the appearance of new documents that might be of interest based on what they had viewed in the past.
Yang and Jeh, 2006 first evaluated implicit features derived from 159 users' search history logs to determine which would be useful for automatically identifying what the topics are that the user is interested in. The best features were: a large number of clicks in the search session (>8), a large number of refinements in the search session (>3), and a strong history match score, meaning that terms from a query within a session matched queries from other sessions in the user's search history. The resulting precision/recall tradeoffs varied from 90/11 to 57/55, so the results were promising if not steller.
Yang and Jeh, 2006 next assessed features for determining which documents should be recommended for the user's topics as standing queries. They found that a combination of a statistical ranking score combined with in-link information multiplied by the inverse of the search engine ranking produced good results -- precision/recall tradeoffs in the 70/88 range. Thus, this approach which uses searchers' prior behavior to implicitly come up with articles to recommend, is a promising way to improve alert systems.
Going beyond query logs, another method of gathering implicit preference information that seems especially potent is recording which documents the user examines while trying to complete an extended search task. In this framework, users examine documents for their own understanding, and do not have to make explicit relevance judgements about those documents. They may in fact view irrelevant documents as they browse and navigate, but the idea is that these “noisy” views do not significantly hurt the overall results. This approach can thus be thought of as occupying a middle ground between pseudo-relevance feedback and traditional relevance feedback. The information “trails” that users leave behind them as a side-effect of doing their tasks have been used to suggest term expansions (White et al., 2005, White et al., 2007), automatically re-rank search results (Teevan et al., 2005b), predict next moves (Pitkow and Pirolli, 1999), make recommendations of related pages (Lieberman, 1995), and determine user satisfaction (Fox et al., 2005).
Several studies have found a correlation between reading time and relevance of documents, despite the fact that this data is often noisy (Kim et al., 2001, Morita and Shinoda, 1994, Claypool et al., 1999, Agichtein et al., 2006b). A study by Fox et al., 2005 found an association between explicit ratings of user satisfaction and implicit measures that are commonly used to infer this. The strongest indicator for individual Web page satisfaction was a combination of click history, time spent on search result page, and how the user ended the search session (e.g., killing the browser window, navigating to bookmarks, printing the document). These implicit measures also predict satisfaction across a user session, although less strongly.
Markov models of navigation path accesses (Deshpande and Karypis, 2004) and longest substring grouping computed over large numbers of user sessions (Pitkow and Pirolli, 1999) have been used for making predictions about which pages a user will view next. Heer and Chi, 2002 tracked the clicks of participants who were asked to complete certain tasks on a Web site. The resulting analysis found that the time spent reading each page in combination with information about similarity among the contents of the pages accessed could be used to categorize the clicks according to activity types.
Shen et al., 2005b asked participants to find documents relevant to particularly difficult TREC queries and kept track of which documents the participants clicked on while completing the task. They found that immediate query history did not improve a search results re-ranking algorithm. They also investigated incorporating terms from search results summaries that participants clicked on while performing the given search task, in effect doing implicit relevance feedback. Using the summary information from clicked on documents from the first four queries in the session achieved significant improvements in rank ordering for what would have been the fifth round of querying, and these improvements were robust even though only one third of the clicked-on documents were relevant. However, since they used the 30 most difficult TREC queries, these results may not be representative for more typical queries.
In a followup study, Tan et al., 2006 gathered query history for four users for one month. After their time was up, the participants were asked to select 15 queries from their usage history (these queries had to match their general interests or belong to a long chain of queries). They were then asked to rank the top 20 search results produced by a standard search engine for each of these 15 queries. Tan et al., 2006 then tried to predict these relevance judgments, based on prior search activity. They found mild improvements in rank order for first-time queries and significant improvements for repeated queries.
Teevan et al., 2005b developed a user profile by making use of a large number of implicit features gathered over time. These features included all documents that the individual had looked at in the past, whether or not this information was accessed in the last month, and the participants' query history. The profile was then used to re-rank search results, using a relevance feedback algorithm (see Chapter 6). They found that the richer the previously visited information, the better the relevance feedback results. However, the personalized ranking performed significantly worse than a web-based ranking baseline. Combining the personalized results with the web-based ranking yielded a slight improvement over web-based ranking alone.
The use of implicit information from viewing of documents described in this section makes use of a user's interaction history over many query sessions. An alternative is to automatically adjust the rankings in real time, during a single search session. Singh et al., 2008 use a version of the clickthrough inversion idea described in Chapter 5 to re-rank the hits returned for a single query and the user views those results. The terms associated with the links that have been clicked on so far are weighted positively, while those not clicked on are weighted negatively, to provide a real-time re-ranking. A similar idea is being promoted by the startup company SurfCanyon. This can be seen as an implicit version of the explicit re-rankings used in the Google search results personalization interface described above.
9.2.2: Profiles Based on Implicit Behavior Assigned to Categories
Because it is difficult to differentiate one click from another, click history has sometimes been used in combination with an ontology or category structure to improve results.
Qiu and Cho, 2006 combined topic-based PageRank (Haveliwala, 2003) with user click history to reorder search results. (Topic-based PageRank is computed similarly to standard PageRank except that inlinks must originate from pages about a given topic.) The algorithm estimated which topics a user was most interested in by examining the user's click history and mapping pages clicked on to the top levels of the ODP (an open source category system in which volunteers classify Web pages manually). This algorithm obtained an improvement over both standard PageRank and topic-based PageRank, but they found that the improvement derived primarily from highly ambiguous queries.
Chen et al., 2002 used an individual's long-term access history on a hierarchical category structure to predict which categories would be accessed next, thus in effect creating a topic-based prediction model. (Here, access means provide a rating on an item in the Epinions Web site.) They found that an individual's category re-access patterns were better predictors than path-based models using combined history of many users without category classification. They also found improvements when taking recency into account, reflecting the fact that users' interests shift over time. Liu et al., 2002 and Gauch, 2003 also found evidence that mapping click history to category structure can successfully determine categories of interest for users. By contrast, Dou et al., 2007 implemented simplified versions of algorithms in which user profiles were based on category information, and found this performed less well than standard search results.
Topic-based personalization seems especially suited for topic-specific information collections. For example, a sports site could track which sports a user read about in the past and automatically promote prominence to these sports when the user views the site in future. However, automatic inference of preference does not seem have caught on for vertical sites other than news; instead, explicit customization is the norm for domain-specific Web sites.
9.2.3: Socially Determined Relevance using ClickThrough
An alternative to personalizing an individual's search results based on that person's actions is to use many different peoples' actions to determine which results are generally relevant for the population as a whole. Recent work has explored how to use implicit relevance judgements from multiple users to improve search results rankings. This section discusses the use of user clicks on search results document summaries, or clickthrough data.
One version of this idea used the clicks observed for a small group of users. Joachims, 2002 presented the merged results of a metasearch engine to a group of 20 AI researchers, and recorded which search result links they clicked on in response to 260 queries. He built a machine learning model from the attributes of the clicked-on search results and used this model to re-rank the output of the metasearch engine. When evaluated on 180 user-generated queries, he found that users clicked on more links from the improved ranking on average than on the individual and combined search engines' rankings.
Joachims, 2002 analyzed which features most influenced the successful re-ranking and found that the best features were those that can be incorporated into a standard (unpersonalized) ranking algorithm, such as cosine similarity between abstract and query. Only a few highly weighted features reflected the particular properties of the user group, such as whether the domain name was “nec” (because the user pool often obtained references from the citeseer citation system hosted at this domain). It is unclear though if the results of this limited study would generalize over longer use, since interests shift over time, or if it would work for large groups whose interests are more heterogeneous. However, similar results were found in another small study (Mislove et al., 2006).
In later work, Joachims et al., 2005 conducted experiments to assess the reliability of clickthrough data when compared against explicit relevance judgements. They found a significant bias based on the rank order of the results presented, showing that users select higher ranked links even if the original output from the search engine is presented in reverse order (see Chapter 5). They concluded that this bias makes it difficult to use clickthrough as absolute feedback, but introduced several new effective ways for generating relative signals from this implicit information. They showed, for example, that assigning negative weight to an unclicked link that immediately follows a clicked on search result link can improve estimation of implicit relevance.
Agichtein et al., 2006b built upon this work and showed even more convincingly that clickthrough and other forms of implicit feedback are useful when gathered across large numbers of users. They randomly chose 3,500 queries from a search log and recorded clickthrough data for more than 120,000 searches over instances of these queries. Their approach made use of information about which links had been clicked on for the identical query in the past by other users, thus essentially gathering human rankings on search results for a given query (the work of Joachims, 2002 learns features across queries). Not surprisingly, the more click data available for a given query, the more reliable the predictions, but even with just 10 prior clicks, results can be improved over a state-of-the-art ranking function.
Agichtein et al., 2006a found especially effective a clickthrough model which compared the actual clicks on a given link to its expected clickthrough given its position in the results listing. The algorithm downweights unclicked links that precede and follow a clicked link in the original results order, with the assumption that the user looked at and rejected those other links. Agichtein et al., 2006a then discovered that a machine learning model trained on user browsing features worked even better than clickthrough data. These features included time spent viewing a page, time spent viewing pages from the search result's domain, and distance in clicks of the viewed page from the original search results. They showed that this community-generated implicit information can improve rank order over that of a state-of-the-art ranking algorithm that uses hundreds of other features -- the best results occur when combining the implicit information with those other features. They noted, however, that this approach does not work well on low-frequency queries and may perform poorly on ambiguous queries in which there is more than one common interpretation of relevance.
Work by Dou et al., 2007 sheds even more light on the advantages and disadvantages of different kinds of implicit personalization algorithms for re-ranking search results. They gathered a large collection of queries and clickthroughs, examining data for 10,000 users, 56,000 queries, and 94,000 clicks over 12 days. They used the first 11 days' worth of data to form user profiles and to gather information about which links were clicked for each query. They then simulated the application of five different personalization algorithms on the remaining 4,600 queries from the last day of the log. They retrieved the top 50 results for each query from the comparison search engine and assumed that clicking a link indicated a relevance judgement for the query, acknowledging but ignoring the known problem of search-result position bias.
Dou et al., 2007 used one individual-level reranking strategy that assumed for a query q submitted by user u, pages frequently clicked on by u in the past for q are more relevant than those seldom clicked on. (Thus, this measure will not work for a new query that the user has never run.) They also tested a very simple version of socially-determined relevance, computing pairwise similarity between users based on past clicks, and used historical clicks by similar users to re-rank search results. They also tried three profile-based methods, using 67 pre-defined topic categories. They contrasted profiles based on short-term history (the current session), long term history (the entire 11-day history) and a linear combination of the two.
For the test queries, they isolated out those queries that could not be improved on by re-ranking, because the search engine's top result was the one chosen by users. They called the remaining set of queries non-optimal. They also defined an entropy measure, because it has been suggested that queries which show less variation among click choices are less likely to be improved by personalization (Teevan et al., 2005a). The entropy measure is higher for ambiguous queries, for which different users click on different results.
Dou et al., 2007 found that for queries with low click entropy, the personalization methods had similar or even worse performance than the comparison search engine's rankings. However, on higher-entropy queries, the click-based personalization strategies worked well. The individual-level strategy improved results 1.4% over the default rankings on all queries, and 3.6% on the non-optimal queries. On those queries whose click entropy was ≥ 2.5, individual-level clicks improved results by an impressive 23.7%.
These results suggests that a good search history interface should substantially improve the search experience for users. They also suggest that personalization should be used with caution for queries with low click variation. These results do not suggest that this kind of clickthrough information improves results for ambiguous queries without other context. The clickthrough algorithms were designed to work only for queries already clicked on in the past by the user, and in this study 33% of the queries in the log were repeated by the same user. The authors did not examine in isolation how well the clickthrough results worked on never-seen queries with high entropy, which include the truly ambiguous queries.
In the Dou et al., 2007 study, the profile-based methods that try to suggest hits based on estimated similarity to other queries did not perform as well. As Tan et al., 2006 note, search history contains noise that can be irrelevant to the current query, and this may be part of the reason for the degraded results. Dou et al., 2007 did find that the combined long-term and short-term history information was more stable than either alone.
In summary, search results re-ranking based on implicit clickthrough data by many people, for queries that have already been seen, can be more accurate than a search engine algorithm alone. But the clickthrough-based techniques do not work well for queries that have not been seen before, which means they are applicable for only about 50% of queries. Socially-based clickthrough-based techniques also do not work well when there is more than one common interpretation for a given query, which is where an individual's general preferences and current context determines relevance. Individual-based personalized rankings seem to work best on highly ambiguous queries, so the two solutions appear to be complementary.
9.2.4: Implicit Social Recommendations
Das et al., 2007 made an ambitious attempt to improve Google's personalized News service. Their approach was to use collaborative filtering on content data using implicit relevance information. They report on experiments in which they compared several different algorithms for making recommendations based only on user similarity, not taking content into account. As their source of implicit relevance information, they used clicks on articles in the Google News service as a proxy for interest in those articles (as opposed to ratings or relevance judgements which are typically used for collaborative filtering). An unusual aspect of this problem is that the pool of news items is constantly changing, unlike most collaborative filtering applications which capitalize on having a large number of ratings over a stable, slowly growing set of items.
Das et al., 2007 tried two model-based algorithms and one memory-based algorithm, but the latter did not scale to the entire problem (much of the Das et al., 2007 work is concerned with scalability, as the system must handle millions of users and produce new results within milliseconds). The two model-based algorithms were indistinguishable in performance. Bearing in mind the difficulty of the problem due to the rapid turnover of news items, and reflecting other attempts to apply collaborative filtering to content rather than taste-based information (e.g., (Sarwar et al., 1998) ), the precision/recall graphs for the news data were not particularly strong. The highest precision achieved on a large sample of users is 17% with a recall of 10%. However, when evaluated on millions of users over a 5-6 month period, the clicks on the suggested items were 38% better on average than a baseline of showing the most popular news items.
In another twist on this idea, Sugiyama et al., 2004 combined implicit information from user browsing history with collaborative filtering to achieve mild improvements in rank ordering.
9.3: Combining Implicit and Explicit Information
9.3.1: Modifying Implicit Recommendations with Explicit Information
Sakagami and Kamba, 1997 did a small study on recommending news articles, comparing an individual's explicit ratings to implicit information based on time spent reading articles. They found that explicit rating information on every item read produced better results than implicit alone. However, they also found that results were nearly as good when most information was gathered implicitly, with the caveat that users could indicate explicit information on occasion, when they were especially interested in an article.
In the field of artificial intelligence, researchers have attempted to develop systems that monitor users' progress and behavior over long interaction periods in an attempt to make intelligent recommendations of information the user may want to see in the future, as well as take actions on behalf of the user (Jameson, 2003). These systems are called semi-automated assistants or recommender agents, and often make use of machine learning techniques. Some of these systems required explicit user input in the form of a goal statement (Joachims et al., 1997) or relevance judgements (Pazzani et al., 1996), while others quietly recorded users' actions and tried to make inferences based on these actions (Horvitz et al., 1998). These systems are related to the recommender systems described earlier in this section, but were intended to act more explicitly as agents working on behalf of the user, and are more interactive in nature.
Many attempts have been made to automatically classify email (Sahami et al., 1998, Kushmerick and Lau, 2005). Early work by Kozierok and Maes, 1993 used both explicit and implicit information to make predictions about how users would handle e-mail messages (what order to read them in, where to file them) and how users would schedule meetings in a calendar manager application by “looking over the shoulder” of the users, recording every relevant action into a database. After enough data had been accumulated, the system used a nearest-neighbors method (Stanfill and Kahle, 1986) to predict a user's action based on the similarity of the current situation to situations already encountered. This system integrated learning from both implicit and explicit user feedback. If a user ignored the system's suggestion, the system treated this as negative feedback, and accordingly added the overriding action to the action database. After certain types of incorrect predictions, the system asked the user questions that allowed it to adjust the weight of the feature that caused the error. Finally, the user could explicitly train the system by presenting it with hypothetical examples of input-action pairs.
The goal of another early system, Letizia (Lieberman, 1995), was to bring to the user's attention a percentage of the available next moves that are most likely to be of interest, given the user's earlier actions. Upon request, Letizia provided recommendations for further action on the user's part, usually in the form of suggestions of links to follow when the user was unsure what to do next. Using only implicit information, the system monitored the user's behavior while navigating and reading Web pages, and concurrently evaluated the links reachable from the current page. Saving a page as a bookmark was taken as strong positive evidence for the terms in the corresponding Web page while links skipped are taken as negative support for the information reachable from the link. Selected links could indicate positive or negative evidence, depending on how much time the user spent on the resulting page and whether or not the decision to leave a page quickly was later reversed. Additionally, the evidence for user interest remained persistent across browsing sessions. Thus a user who often read kayaking pages and who was at another time reading the home page of a professional contact may be alerted to the fact that the colleague's personal interests page contains a link to a shared hobby. The system used a best-first search strategy and heuristics to determine which pages to recommend most strongly.
More recently, a large multi-team research project called CALO investigated how to build an intelligent personal agent, using information extracted by analyzing a user's email, contact information, calendar information and other documents (Chaudhri et al., 2006, Mitchell et al., 2006). These ideas are being incorporated into a product by the startup company Siri.com.
9.4: Searching over Personal Information
An important distinction exists between personalizing search engine behavior based on an individual's (or group's) past behavior and preferences, and searching over an individual's personal materials and information. This is sometimes referred to as desktop search, but since personal information is distributed across many machines and services, the term personal information management is more commonly used today (Teevan et al., 2006b) for this wider scope of application.
The definition of personal information is often stretched to mean not only those documents that an individual has produced or been given, but anything that person has viewed in the past. Research shows that people are highly likely to revisit information (Jones et al., 2002, Milic-Frayling et al., 2004), so allowing search over recently viewed information can improve a user's productivity (Dumais et al., 2003). Research also suggests that users' search strategies differ when searching over previously seen materials (Jones et al., 2002, Barreau and Nardi, 1995). Bergman et al., 2008 found that, when searching for files on their own computers, participants strongly preferred navigating their file system over using desktop search.
Certain organizational features are more salient for information that a person has already seen, or which pertains to an individual's life, so search over personal information can use memory cues that are not salient for information artifacts that the user has not seen in the past. The Stuff I've Seen project (Dumais et al., 2003, Cutrell et al., 2006a) built a search index over all materials that a user had stored in their personal directories along with all external Web pages they accessed, their email messages, their calendar appointments, and so on. They evaluated this system with a longitudinal study across several months and hundreds of users. Users reported that the system was especially helpful when they could remember contextual cues surrounding the previous use of the information, but not more detailed information. For example, a user looking for an address of a restaurant whose name he could not remember might search on restaurant, filter the results to see emails from his daughter, sort the narrowed results by date, and then scan for an email from the last week that contains the restaurant address. This study also finds that organizing by time is often important, but only in certain contexts -- for when an email was sent, for example, but not for when a Web page was created. The study also finds that sorting by time was preferred over ranked sorting, and that sorting by people's names was also quite important.
Although there is a great deal of research in this area, to date automated personalization has yet to have a major impact (Jameson, 2003, Fischer, 2001). The main reason seems to be that it is difficult to make predictions accurately. Additional problems relate to user privacy and the need for user control coupled with the desire by users for the system to be unobtrusive.
Some results indicated that re-ranking can reduce search efficiency. Teevan et al., 2007 found that searchers tended to look for information they had found in a previous search session 40% of the time, and that changing ranking order increased the time it took for users to find a link they'd clicked on for a similar search in the past. For automated re-ranking based on a user's actions, most controlled experiments results suggest that when it works at all, it works best for very ambiguous queries (Qiu and Cho, 2006, Croft and Wei, 2005, Teevan et al., 2005a). (This is also true for some other interface techniques like clustering, see Chapter 8.) However, disambiguation of general words is easily solved by simple query reformulation (e.g., adding the word engines to the ambiguous query jets), and represents only a small fraction of queries (see Chapter 3). Thus, if disambiguation of ambiguous queries is the primary benefit of personalized results rankings, it may not make a noticable difference for most users. As noted above, the best application of personalized ranking seems to be for repeated queries that are naturally ambiguous. To address these issues, Teevan et al., 2008 report on an algorithm for predicting which queries can most benefit from personalization.
Several major search engines currently offer a personalization feature that attempts to rank search results based on what the user has searched for in the past, but no details beyond anecdote are available as to their efficacy. When asked to comment on the behavior of Google's personalization system, a Google VP stated that it is used cautiously, adjusting results only in about one out of five queries, and changing at most two hits in the first page of ten (Hotchkiss, 2007d).
The appropriateness of automatically taking “intelligent” action on behalf of users has been called into question (Shneiderman and Maes, 1997, Nielsen, 1998a, Norman, 2007a). Systems that use implicit information to try to infer what the user wants to do are often poorly received. The widespread and virulent dislike of animated agents such as “Clippy” introduced into the Microsoft Windows 97 operating system is a famous example of the failure of automated agents in a user interface (Whitworth, 2005). In the case of term suggestions for query reformulation, White et al., 2005 found that participants wanted the system to quietly track the information they accessed and use it for suggesting additional terms, but wanted to have control over which terms were automatically added.
Another example can be seen in interfaces for automated spelling correction. Google's initial spelling suggestion system did not work well and was not well-received. After the algorithm was improved and became highly accurate, the reception became positive when the interaction allowed just one spelling suggestion below the original query (Hurst, 2002). At one point Google automatically replaced the original spelling with a correction when few results were found, but this feature was discontinued, presumably because of negative user reaction. However, Google does automatically massage some queries behind the scenes, for example, by removing middle initials in the retrieval results when matching against queries consisting of people's names with no middle initial.