Ch. 2: The Evaluation of Search User Interfaces
Throughout this book, the merits of an interface are assessed from a usability perspective; this chapter discusses how these assessments are done. It is surprisingly difficult to design a usable new search interface, and perhaps even harder to convincingly assess its usability (Draper and Dunlop, 1997). An evaluator must take into account differences in designs, tasks, participant motivation, and knowledge, all of which can vary the outcome of a study. Furthermore, as mentioned in Chapter 1, small details in the design of the interface can have a strong effect on a participant's subjective reaction to or objective success with the interface. For instance, problematic placement of controls or unintuitive text on a hyperlink can prevent a participant from succeeding in a task. Differences in font contrast and spacing can unconsciously affect a participant's subjective response to a design.
What should be measured when assessing a search interface? Traditional information retrieval research focuses on evaluating the proportion of relevant documents retrieved in response to a query. In evaluating search user interfaces, this kind of measure can also be used, but is just one component within broader usability measures. Recall from Chapter 1 that usable interfaces are defined in terms of learnability, efficiency, memorability, error reduction, and user satisfaction (Nielsen, 2003b, Shneiderman and Plaisant, 2004). However, search interfaces are usually evaluated in terms of three main aspects of usability: effectiveness, efficiency, and satisfaction, which are defined by ISO 9241-11, 1998 as:
- Effectiveness: Accuracy and completeness with which users achieve specified goals.
- Efficiency: Resources expended in relation to the accuracy and completeness with which users achieve goals.
- Satisfaction: Freedom from discomfort, and positive attitudes towards the use of the product.
These are the criteria that ideally should be measured when evaluating a search user interface. They can be tailored to correspond directly to search tasks; for example, the efficiency criterion can measure which positions within the search results the relevant documents appear in. Some aspects of usability may be emphasized more than others. If one is testing, for example, a new technique for suggesting alternative query terms, the focus may be more on increasing efficiency and reducing errors, but not so much on memorability of the technique. Surprisingly, many usability studies omit assessing participants' subjective reaction to the interface. In practice, this can be the most important measure of all, because an interface that is not liked is not likely to be used. It is important to measure all three aspects of usability, as a meta-analysis showed that correlation among them in usability studies tends to be low (Hornbæk and Law, 2007).
This chapter summarizes some major methods for evaluating user interfaces. First, an overview is provided of traditional information retrieval evaluation. This is followed by sections discussing different major interface evaluation methods. Informal evaluation is especially useful for developing new ideas or for the early stages of development of a new design. Formal studies are useful for rigorously comparing different designs, either to help advance the field's understanding of search interfaces, or to help an organization decide which of several designs or features works best in a given context. Longitudinal studies, in which participants use the interface over time, reveal long-term usage patterns as participants become familiar with the interface and adapt it to their everyday working environment. Bucket tests, or large-scale comparison studies, allow an organization to test the effects of different designs by comparing how people use the different designs on a massive scale.
These sections are followed by a set of guidelines about special considerations to ensure successful search usability studies and avoid common pitfalls. The chapter concludes with general recommendations for search interface evaluation.
2.1: Standard Information Retrieval Evaluation
In the bulk of the information retrieval (IR) research literature, evaluation of search systems is equivalent to evaluation of ranking algorithms, and this evaluation is done in an automated fashion, without involving users. When evaluating in this paradigm, a document collection and a set of queries are defined, and then documents from the collection are identified as relevant for those queries (Saracevic, 2007). Ranking algorithms are judged according to how high they rank the relevant documents.
This kind of evaluation has been embodied most prominently in the Text REtrieval Conference (TREC), run by the U.S. National Institute of Standards (NIST) for more than 15 years (Voorhees and Harman, 2000). The goal of TREC is to advance the state of the art in IR research by coordinating tasks for different research and commercial groups to test their algorithms on. TREC tasks (also known as tracks) are designed by the research community in tandem with NIST, and have included question answering, video search, routing queries, gigabyte dataset search, and many other tasks. For many years, however, the marquee task of TREC was the ad hoc retrieval track, in which systems competed to rank documents according to relevance judgements. In the ad hoc track, the TREC coordinators supply the document collection, the queries, and the relevance judgements, which are assigned by human judges. The competing groups develop their ranking algorithms, freeze their systems, and then receive the queries. They are not allowed to change their system based on those queries -- rather, they have to run their algorithms in a batch mode, on the queries as given. Thus, there are no human participants interacting with the system in the TREC ad hoc task.
The most common evaluation measures used for assessing ranking algorithms are Precision, Recall, the F-measure, and Mean Average Precision (MAP). Precision is defined as the number of relevant documents retrieved divided by the number of documents retrieved, and so is the percentage of retrieved documents that are relevant. Recall is the number of relevant documents retrieved divided by the number of documents that are known to be relevant, and so is the percentage of all relevant documents that are retrieved. These measures reflect tradeoffs between one another; the more documents an algorithm retrieves, the more likely it is to increase recall, but at the same time reduce precision by bringing in additional nonrelevant documents. For this reason, the F-Measure is often used to balance between precision and recall. It is defined as the weighted mean of the two measures, but is usually used with an even weighting between precision and recall, computed as (2 P R) / (P R). The problem with this measure is that it is taken at whatever recall level is produced by ranking k documents. Therefore, an algorithm that ranks all of its relevant documents in the first few positions is not given any more points than an algorithm that ranks the same number of relevant documents in the last positions. The measure of average precision addresses this deficiency by computing the precision repeatedly, at each position in the ranking for which a relevant document appears, and taking the average of these precision scores. The MAP score is the average of the average precisions, taken over a set of test queries.
Number: 312Title: Hydroponics
Description: Document will discuss the science of growing plants in water or some substance other than soil.
Narrative: A relevant document will contain specific information on the necessary nutrients, experiments, types of substrates, and/or any other pertinent facts related to the science of hydroponics. Related information includes, but is not limited to, the history of hydroponics, advantages over standard soil agricultural practices, or the approach of suspending roots in a humid enclosure and spraying them periodically with a nutrient solution to promote plant growth.
The TREC evaluation method has been enormously valuable for comparison of competing ranking algorithms. There are, however, no shortage of criticisms of the approach. These include:
- In most cases “relevance” is treated as a binary “yes or no” assessment, and a system is not rewarded for returning results that are highly relevant versus those that are marginally relevant.
- A system is usually not penalized for returning many relevant documents that contain the same information, as opposed to rewarding diversity in the content of the relevant documents.
- The evaluation has a focus on retrieving as many relevant documents as possible. The TREC evaluations nearly always require systems to return 1,000 documents. This may be realistic for a legal researcher who must find every potentially relevant document, but does not reflect the goals of most searchers, especially on the web. (To address this last concern, a variation on measuring precision has become popular, called Precision@k, meaning the precision for the top k documents retrieved, where k is a small number such as 10.)
- The TREC queries can be seen as unrealistic because they contain very long descriptions of the information need (see Figure 2.1) rather than the 2-3 words of a standard web search engine query. It can be argued that creating a deep description of the information need is a large part of the problem that search systems should assist with.
More germane to the topic of this book, however, is that this evaluation does not require searchers to interact with the system, create the queries, judge the results, or reformulate their queries. The ad hoc track does not allow for any user interface whatsoever. More recently, the principles of HCI have influenced IR interface research, and the expectation has become that search user interfaces must be assessed with human participants. TREC competitions from 1997 to 2000 included an Interactive track, in which each group was asked to recruit about a dozen participants to tackle TREC queries, and the goal of the evaluation was to assess the process of search as well as the outcome in terms of precision and recall. This track also introduced an evaluation that judged how many different “aspects” of a topic a participant's results set contained. More recently, other tracks have allowed for manual adjustment of queries and integration of user interaction with system evaluation. Incorporating interfaces into evaluation of ranking algorithms can lead to richer views of evaluation methods.
It can be useful to adjust the measures of precision and recall when assessing interactive systems. For instance, Pickens et al., 2008 distinguish among documents returned by the search engine, documents actually seen by the user, and documents selected by the user as relevant. There have been a number of efforts to measure the effects of graded relevance judgements, the most popular of which is called discounted cumulative gain (DCG) (Järvelin and Kekäläinen, 2000, Kekäläinen, 2005). Käki and Aula, 2008 note that real users of Web search engines typically only look at one or two documents in their search results per query. Thus, they propose the measure of immediate accuracy to capture relevance according to this kind of behavior. It is measured as the proportion of queries for which the participant has found at least one relevant document by the time they have looked at k documents selected from the result set. For instance, immediate accuracy of 80% by second selection means that for 80% of the queries the participant has found at least one relevant document in the first two documents inspected. Käki and Aula, 2008 claim this measure successfully allowed them to find meaningful differences between search interfaces, is easy to understand, and seems to reflect Web search behavior well.
2.2: Informal Usability Testing
There is no exact formula for producing a good user interface, but interface design indisputably requires the involvement of representative users. As discussed in Chapter 1, before any design starts, prospective users should be interviewed or observed in field studies doing the tasks which the interface must support (Draper and Dunlop, 1997). This is followed by a repeated cycle of design, assessment with potential users, analysis of the results, and subsequent re-design and re-assessment. Involvement of members of the target user base is critical, and so this process is often referred to as user-centered design (Kuniavsky, 2003, Mayhew, 1999). Potential users who participate in the assessment of interfaces are usually referred to as participants.
Showing designs to participants and recording their responses -- to ferret out problems as well as identify positive aspects of the design -- are referred to as informal usability testing. Informal usability studies are typically used to test a particular instantiation of an interface design, or to compare candidate designs, for a particular domain and context. In the first rounds of evaluation, major problems can be identified quickly, often with just a few participants (Nielsen and Landauer, 1993). Although participants usually do not volunteer good design alternatives, they can often accurately indicate which of several design paths is best to follow. Quick informal usability tests with a small number of participants is an example of what has been dubbed discount usability testing (Nielsen, 1993), as opposed to full formal laboratory studies.
In the formative, early stages of design it is common to show participants rough or low-fidelity (low-fi) prototypes of several designs, often using paper mock-ups and sketches. Low-fi designs are also faster to produce than implemented systems, and studies suggest they can reveal similar types of usability problems as more finished designs (Virzi et al., 1996). Low-fi interfaces also allow assessors to focus on the interaction and major design elements, as opposed to more eye-catching aspects such as the graphic design, which can be easily changed. Because paper prototypes are fast to develop, designers can test a number of different options and discard the less promising ideas at little cost (Rettig, 1994, Beaudouin-Lafon and Mackay, 2003). Paper prototypes for information-rich interfaces (such as search interfaces) require certain special accommodations, such as pre-printed results listings and pre-determined, “canned” queries, which somewhat reduce their realism and validity, but which are nevertheless very useful in the early stages of design.
Evaluation of low-fi (paper) prototypes is not only possible, but often highly effective. The goal is that the participant should be able to move through the sequence of actoins necessary for accomplishing some task and produce feedback about the interaction. Usually two or three evaluators are present while one participant examines the system. One evaluator acts as the host, and explains the goals of the design and the tasks that the user is requested to accomplish. The participant considers the tasks, and then looks at the prototype and points at buttons to click, entry forms to fill out, and so on. A second evaluator “plays computer” by moving the paper pieces around, and even improvising and devising new components to compensate for missing pieces of the design. The third evaluator sits farther away, stays silent, and records the participant's comments, impressions, confusions, and suggestions. Participants are encouraged to “think aloud” and voice what they find confusing as well as what they like (Boren and Ramey, 2000, Nielsen, 1993, Nielsen and Loranger, 2006). As mentioned above, participants tend to be less hesistant about suggesting fundamental changes to a paper prototype than to a design that looks finished and polished.
After the low-fi design is testing well with potential users, common practice is to build more detailed or high-fidelity versions, with some amount of interactivity built in. These partially implemented designs are again evaluated, first with just a few participants, to help refine the design, evaluate the effectiveness of different features, assess the interaction flow, and test other holistic properties of the design. After each round of assessment, changes usually need to be made to the design, and more testing must be done until the interface is found to be working well across tasks and participants. After the more high-fidelity design is working well, it can be more fully implemented and assessed with larger numbers of participants using more formal evaluation methods.
There is an oft-debated question about how many participants are needed in informal studies in order to find the major usability problems in a design. Nielsen, 2000 published results suggesting that only five participants are needed on average to find 85% of usability problems. Spool and Schroeder, 2001 disputed this result, claiming that the first five participants found only 35% of the major problems. More recently, Lindgaard and Chattratichart, 2007 compared the performance of several different evaluation teams who were assessing the same interface design, varying the number of participants from five to fifteen. They found no correlation between number of participants and number of problems found, but did find a correlation between the number of distinct tasks that participants were asked to complete and the number of problems found. This suggests that in order to assess a design thoroughly, participants must be asked to exercise its capabilities in many different ways.
Another form of discount usability testing is heuristic evaluation (Nielsen, 1992, Nielsen, 1993), which is based on the assumption that a usability expert can recognize design flaws and anticipate usability problems. In the heuristic evaluation process, several usability experts study a design and critique it according to a set of usability guidelines or heuristics, which are used to assign some structure to the critique. Studies have shown that different types of assessment often reveal different, non-overlapping problems with the design (Jeffries et al., 1991, Nielsen, 1994). Heuristic evaluation combined with informal usability testing works very well in the early stages of design.
Another popular form of informal usability assessment is the field study. In this approach, the experimenters travel to the participant and observe them using the interface in their own natural environment, be it at work or at home. The assumption is that people behave more realistically when using their own equipment and in their own settings, rather than working in someone else's unfamiliar office (Grimes et al., 2007). In a field study, experimenters can either simply observe how the interface is used, or can ask participants to do certain tasks, but it is more difficult to have participants experiment with different variations of an interface in this setting.
To help ensure a successful design process, it is important for the designers to refrain from becoming attached to any particular design. Instead, they should view a candidate design objectively, as one of many possibilities. This mental stance makes it easier for designers to accept and learn from negative responses from study participants. Often participant reactions include subtle hints about what the most promising directions for the design are, but the evaluator who is hoping for a different outcome can miss these hints. A common mistake of novice designers is to downplay or ignore the negatives and emphasize positive results.
2.3: Formal Studies and Controlled Experiments
Formal usability studies in the form of controlled experiments aim to advance the field's understanding of how people use interfaces, to determine which design concepts work well under what circumstances, and why. They can also be used to help decide if a new feature or a change in approach improves the performance of an existing interface, or to compare competing interfaces.
To shed light on the phenomena of interest, controlled experiments must be somewhat artificially constrained in order to identify which features, if any, make a difference in the usability of the design. Just as in a clinical trial of a pharmaceutical drug, it is important to isolate the right factors, have proper controls to compare against, and choose the study population to mirror the target population correctly. When properly designed, the results of a formal study on a particular feature or idea should be applicable to many different interface designs. This section describes some of the components of formal studies, with an emphasis of the issues specific to search interfaces, but the reader is recommended to read more extensive treatments to fully understand the details of experiment design (Keppel et al., 1992, Kohavi et al., 2007, Kohavi et al., 2008).
2.3.1: Techniques Employed in Formal Studies
The classic formal usability study (also known as a lab study) is conducted in a usability laboratory (Shneiderman and Plaisant, 2004), in which observers can be hidden behind a two-way window. However, a quiet room with a desk, chairs, and computer is oftentimes sufficient. Some studies are recorded with videotape or audiotape, and the content is later transcribed. User behavior can be recorded by people taking notes, but screen-capture programs allow for higher accuracy and greater completeness, especially if paired with click-capture and logging. Some studies use eye-tracking technology (Nielsen and Pernice, 2009) to determine exactly which parts of the screen people view, to better understand the mental process that participants are undergoing when assessing a search interface (see Figure 2.2).
Participants in a usability study should be asked to read and sign a consent form, indicating which modalities are being recorded (audio, video, screen capture, eye tracking) and how the participants' anonymity will be protected. Participants should be told that they are free to leave the study at any time and for any reason.
It is common to give participants warm-up or practice tasks in order to help them get familiar with the system and relax into the tasks. It is also important to pilot test usability studies before running the full study with recruited participants. Initially, pilot tests are often done using co-workers or otherwise easy to recruit individuals, before testing on participants from the desired user base.
Formal (and informal) studies do not necessarily have to be conducted with experimenters sitting in the same room as the participant, or directly observing the participant. Several studies of remote usability testing find that this method yields similar results in most cases to in-person testing (Christos et al., 2007, West and Lehman, 2006, McFadden et al., 2002). Remote testing has the advantage of making it easier to reach a more diverse user pool, and the study can be done in the participants' everyday work environment, and so is potentially more ecologically valid and more cost-effective than travel. Drawbacks to remote testing include difficulties in capturing facial expressions and non-verbal cues and less free-flowing communication between experimenter and participant. It might also be difficult to get experimental software working remotely, but special software has been developed to aid in remote testing that addresses this issue to some degree.
2.3.2: Balancing Condition Ordering Effects
A problem common to all usability studies is that order of exposure to experimental conditions can bias the results. It is common, when working with search interfaces, for participants to learn in several ways: about the tasks, the collection being searched over, and about the search interaction itself. Thus participants might get faster on a set of tasks independent of the different properties of the interfaces being compared, or they might get tired, making them slower on the later conditions. If timing data is being collected as part of a study, it is important to compute statistics that compare time taken at different phases of the study. It is also important to pilot test a study and shorten it if necessary if participants tire towards the end.
Participants learn about a topic while seeking information about it. Thus a participant should never see the same query or topic in two different conditions. The effect of prior exposure is much too great and invalidates any results seen in the second viewing. An example of this can be found in a study in which participants were asked the same question five different times, for five different visual displays. Not surprisingly, participants' performance for a given display was stronger for conditions viewed later in the study. Because the questions were repeated it is not possible to tell if this was the cause or if the task was one that got easier with practice, and any differences seen among the interfaces is called into question.
Participants can be influenced in other ways by order effects. If one interface is much better than another, then seeing the inferior interface second can cause it to have lower usability ratings than if it was viewed first. For example, in a study by the author and students comparing Flamenco, a faceted metadata interface, to a baseline standard search interface for image browsing, the order in which interfaces were viewed had a strong effect on the subjective ratings (such as “This interface is easy to use” or “This interface is flexible,” etc.). When the faceted interface was viewed first, the subjective ratings for the baseline were considerably lower than when baseline was the first interface shown (thus providing additional support for the superiority of the faceted interface) (Yee et al., 2003). Because the order of the interfaces was varied for different participants in this study, it was possible to detect this effect in the data analysis. Thus, to control for order effects, the experimenter should vary the order in which participants are exposed to the key conditions.
Formal experiments must be designed, just as interfaces are (Keppel et al., 1992, Kohavi et al., 2007, Kohavi et al., 2008). A description of a formal study design should state the independent (experimenter chosen) variables, such as the different designs under study or the queries assigned, as well as the dependent (response) variables, such as time elapsed, number of errors made, and the participants' subjective responses. The design must also describe the blocking of the experiment, which refers to how the different experimental conditions are combined, that is, which tasks are assigned to which participants using which interface designs, and in which order. The research questions or hypotheses being tested can then be stated in terms of the independent and dependent variables. The experiment blocking is devised in such a way as to allow statistical tests to determine significant differences and to confirm or refute the hypotheses. A description of the participants is also important, often including demographics of the participant pool, their familiarity with relevant technology, and sometimes the results of cognitive tests.
When comparing interfaces, it is common to use a within-participants design, meaning each participant views each key condition (e.g., each interface, each variation of the new feature). This allows for direct comparisons of the responses of a given individual, which can be useful for comparing subjective responses, timing, and other response variables, in order to control for individual variation. On the other hand, within-participants design has its drawbacks. In some cases participants can form a bias or develop a learning effect after interacting with an interface, or the experimenter may need to have a participant spend a lot of time on one condition, thus making a session last too long. In these cases, it can be better to instead do a between-participants design, in which each participant is shown only one key interface condition. This precludes responses from being compared directly, but it eliminates any artificial effects that occur from one participant seeing multiple designs. In most cases, more participants are required in the between-participants design in order to get statistically meaningful results, but each individual session can be shorter.
Because some tasks may be harder than others, it can also be important to control for which tasks are conducted on each interface condition. In order to control for both the order of exposure of interface and for the combination of interface and task, the experimenter can employ a Latin Square blocking design. A Latin Square can be thought of as a matrix or table with number of rows equalling number of columns. If no row or column shares the same value, the Latin Square condition holds (thus bearing some similarity to the game of Sudoku). An easy way to build an n x n Latin Square is to lay out the first row, and then rotate the values by one position along each column, and repeating until all rows are constructed. So, if the first row of a 3 x 3 Latin Square is 1, 2, 3, the next would be 2, 3, 1, and the last would be 3, 1, 2. Note that this layout does not achieve all possible orderings (of which there are n factorial), but for experiment design it is usually assumed that allowing each condition to appear in each position once is sufficient to capture relevant ordering effects.
As an example, consider a usability study by Aula, 2004 which had the goal of assessing the relative benefits of three methods of showing Web search results summaries (see Figure 2.3). In the Bold style, the summary was left unchanged from what was returned by the Google search engine, with query terms bolded, while in the Plain style, the summary was the same as in Bold but with the boldface highlighting removed. In the List condition, every time an ellipses appeared in the summary, a new list item was created, preceded by a small arrow. Incomplete sentences were marked with ellipses at the start and/or end of the list item.
This study was designed so that participants always saw 10 hits for each query, and in all cases, there were 9 distractors and 1 valid answer summary for the question. The 30 test queries were divided into task sets of 10 queries each (call these A, B, and C), and one-third of the participants saw the List view first, one-third saw the Bold view first, and one-third saw the Plain view first. The Latin Square design might look as shown in Table 2.1. Each row represents a participant group, containing 3 out of the 27 participants. The order in which the participants see the interfaces is shown from left to right. Note that the order in which the task sets were assigned was also varied to ensure that each query group is associated with a different interface type in each position in the test ordering.
Participant First Second Third Group Task Set Task Set Task Set 1 Plain (A) Bold (B) List (C) 2 Plain (B) Bold (C) List (A) 3 Plain (C) Bold (A) List (B) 4 Bold (A) List (B) Plain (C) 5 Bold (B) List (C) Plain (A) 6 Bold (C) List (A) Plain (B) 7 List (A) Plain (B) Bold (C) 8 List (B) Plain (C) Bold (A) 9 List (C) Plain (A) Bold (B) Table 2.1 Latin Square design. Each participant group contains 3 participants; each Task Set (A, B, C) contains 10 distinct queries.
This kind of blocking structure allows for statistical tests such as Analysis of Variance (ANOVA) (Stockburger, 1998, Keppel et al., 1992) that can determine whether or not there are interaction effects among the different conditions. In the case of the Aula (Aula, 2004) study, ANOVAs showed that there was a significant effect among interface types and task times, meaning that there is support for the hypothesis that the variation in task completion time was caused by the differences in the interfaces. ANOVAs also showed that although participants made errors 21% of the time, there was no significant effect of error on interface type.
To account for another potential cause of variation, Aula also balanced the location in which the single relevant document would appear in each task set. She ordered the results listings in such a way as to guarantee that for each task set, a correct answer occurred in each of the 10 positions in the list one time. Thus in task set A, a relevant document would appear in position 1 for one query, in position 2 for another query, and so on. This artificial element of the design helps counteract search users' tendency to select the top-ranked hits, allowing each interface an equal chance of having the correct answer in each position in the list. On the other hand, it may introduce an artifact into the study if it is the case that, say, the bolding interface works best when the correct answer falls in the first or second hit.
2.3.3: Obtaining Participants
Ideally, usability study participants are drawn from a pool of potential users of the system under study; for instance, an interface for entering medical record information should be tested by nurse practitioners. Often the true users of such a system are too difficult to recruit for academic studies, and so surrogates are found, such as interns who are training for a given position or graduate students in a field.
For academic HCI research, participants are usually recruited via flyers on buildings, email solicitations, as an (optional) part of a course or by a professional recruiting firm. Often the participants are computer science graduate students, who are not representative of most user populations in terms of their approach to and understanding of computer interfaces. Participants obtained in this manner are usually not a representative sample of the population of potential users, and so the results of such studies must be regarded in this light. More recently, the field of HCI has been moving towards recruiting more realistic participants in usability studies.
The number of participants required to observe statistically significant differences depends on the study design, the variability among the tasks and participants, and the number of factors being compared (Kohavi et al., 2008). Most academic usability studies use a very small number of participants, and often these are drawn from the researchers' environment. Again, these issues suggest that results of studies should be eyed with some skepticism. This is why, throughout this book, the number of participants, as well as the pool they are drawn from, are reported when a usability study is summarized. It is also why results of multiple studies are reported whenever available.
The massive scale of deployed search engines changes the participant-recruitment situation by allowing new ideas to be tested on thousands, or even tens of thousands of users, as discussed below in the section on large-scale log-based usability testing.
Another recent approach to recruiting participants is to use a outsourcing or “crowdsourcing” service, such as Amazon's MTurk service, in which tens of thousands of people sign up online to do quick tasks for small fees. This approach to recruiting participants was found to be a fast, effective way to run studies, particularly for relevance testing (Snow et al., 2008, Alonso et al., 2008, Kaisser et al., 2008).
2.3.4: Measuring Participants' Preferences
As mentioned above, it is important to measure participant preferences, or subjective responses, since interfaces that are not liked will not be used, given a choice. Participant preferences are usually obtained via questionnaires using Likert scales. In Likert scales, participants select a point within a qualitative range, such as “difficult to easy”, “strongly agree to strongly disagree”, “never to always”, and so on (Shneiderman and Plaisant, 2004). Most Likert scales have either 5, 7, or 9 degrees to choose among; odd numbers make it clear what the central or neutral choice is. If the participants can be expected to make fine-grained distinctions in their choices, it is preferable to use a wider scale, otherwise a narrower scale should be used. For example, if comparing a new interface against one that is known already to have strong positive reactions, a wider scale allows for participants to clearly indicate a preference above and beyond what is already familiar.
The meta-analysis by Hornbæk and Law, 2007 suggests ad hoc questionnaires are less reliable and have greater variability than studies that use standard questionnaires. One commonly used questionnaire for assessing subjective reactions to interactive interfaces is the Questionnaire for User Interaction Satisfaction (QUIS) tool (Chin et al., 1988, Shneiderman and Plaisant, 2004). It assesses reactions to questions pertaining to overall reactions to the system, characteristics of the graphical view, terminology, learning difficulty, and system capabilities, on a 9-point Likert scale. A more generally applicable questionnaire consisting of 5 sets of questions was analyzed by van Schaik and Ling, 2005, and was found to distinguish properties useful for measuring the quality of interaction in Web interfaces both for monitoring and improving the design of such sites. Sample questions from this questionnaire are:
- “Learning to use this site was easy.”
- “I felt lost.”
- “I judge the Web page to be: (very disordered 1 2 3 4 5 6 7 very ordered)”
Most of these are answered using a 7-point Likert scale ranging from “strongly agree” to “strongly disagree,” but the questions on aesthetics have their own scales, as shown in the third example.
When comparing two or more designs, it can be revealing to have participants assess each interface in isolation before comparing them directly. This provides a second check on the subjectivity scores for the same questions (Hornbæk and Law, 2007). For example, if in the direct comparison, interface A is rated more flexible than B, the experimenter can check the individual ratings for flexibility for the independent assessment for each interface in order to infer how strongly that difference is felt. If the participant gave a very low score for flexibility for design A and a high score for design B, then the difference can be assumed to be large, but if the scores are similar on the isolated questionnaires, the perceived differences in the comparison can be assumed to be of lesser degree. This is also useful to see if there are effects based on order of presentation. Finally, if the participant for some reason does not complete the entire session, partial subjective information will have been obtained.
Aesthetic impressions are an important form of subjective response, as they have been found to play a role in user acceptance and have been found to correlate with perceptions of an interface's quality, user satisfaction, and overall impression of a site (Hassenzahl, 2004, Lindgaard and Dudek, 2003). van der Heijden, 2003 found that the visual appeal of a Web site affected participants' enjoyment and perception of ease of use, and to a small degree, the usability of the system. Nakarada-Kordic and Lobb, 2005 report that viewers persevere longer in a search task on Web sites whose design appeals to them.
Another reason to evaluate participants' subjective responses is that objective measures and subjective preferences are sometimes only moderately correlated (Hornbæk and Law, 2007). Ben-Bassat et al., 2006 hypothesized that subjective responses to a system may be only partly influenced by its usability. To prove this, they conducted a study with 150 engineering undergraduate students in which they varied the aesthetic quality of a design as well as its usability (in terms of number of keystrokes needed to get a task done). Participants were asked to complete tasks using the different interfaces and then to fill out questionnaires about the interfaces' usability and aesthetics. The experimenters found that the interfaces that required fewer keystrokes were indeed considered more usable, but they also found that the high aesthetic interfaces were seen as slightly more usable than the low-aesthetic interface, even though those were slightly faster. Ben-Bassat et al., 2006 then told participants they would be paid according to how well they did the task in a subsequent experiment. Participants were asked to make bids in an auction format on the different interfaces, to determine which they would use. In this condition, people bid higher on the more efficient designs, independent of their aesthetic scores. Thus, they found that questionnaire-based assessments were affected by aesthetics, but auction-based, financially incentivized assessments were not.
In some cases it is desirable to separate effects of usability from those of branding or of visual design. For instance, to compare the relative benefits of the ranking abilities of search engine A versus B, it is important to re-render the results in some neutral layout that uses the same visual elements for both rankings.
2.4: Longitudinal Studies
To obtain a more accurate understanding of the value and usage patterns of a search interface, (in order to obtain what is called ecological validity in the social sciences literature), it is important to conduct studies in which the participants use the interface in their daily environments and routines, and over a significant period of time (Shneiderman and Plaisant, 2006).
A longitudinal study tracks participant behavior while using a system over an extended period of time, as opposed to first-time usages which are what are typically assessed in formal and informal studies. This kind of study is especially useful for evaluating search user interfaces, since it allows the evaluator to observe how usage changes as the participant learns about the system and how usage varies over a wide range of information needs (Shneiderman and Plaisant, 2006). The longer time frame also allows potential users to get a more realistic subjective assessment of how valuable they find the system to be. This can be measured by questionnaires as well as by how often the participant chooses to use the system versus alternatives.
Additionally, in lab studies, participants are asked to stay “on task” and so will tend to behave in a more directed manner, accomplishing tasks efficiently but not necessarily taking the time to explore and get to know new features or try new strategies. Longitudinal studies can capture more variation in usage and behavior as they occur in a more relaxed setting.
A good example of a longitudinal search study is described by Dumais et al., 2003, who assessed the usage patterns of a personal information search tool by 234 participants over a six-week period, using questionnaires and log file analysis. They split the participants into two groups, giving each a different default setting for sorting results (by Date versus by Ranking). This allowed the experimenters to see if participants chose, over time, to use an ordering different than the default. They indeed found an interesting pattern: those who start with Rank order as the default were much more likely to use Date ordering than vice versa, suggesting that ordering information by chronology is better than by a ranking metric when searching over users' personal data collections. Such an effect might not be seen in a short laboratory study.
In another example of a longitudinal study, Käki, 2005b invited participants to use a grouping search interface over a period of two months. Subjective responses were obtained both after initial usage of the system and after the completion of the trial period. The responses became more positive on most measures as time went by. Even more interestingly, the query logs showed that the average query length became shorter over time. Some participants commented on this trend, volunteering that they became “lazier” for some queries, using more general terms than they otherwise would, because they anticipated that the system would organize the results for them, thus allowing them to select among refining terms. Observing this kind of change in user behavior over time is a very useful benefit of a longitudinal study.
Although relatively new in their usage for search interface evaluation, and formal user interface evaluation generally, there is increasing momentum in support of longitudinal studies.
2.5: Analyzing Search Engine Server Logs
Most Web search engines record information about searchers' queries in their server logs (also called query logs). This information includes the query itself, the date and time it was written, and the IP address that the request came from. Some systems also record which search results were clicked on for a given query. These logs, which characterize millions of users and hundreds of millions of queries, are a valuable resource for understanding the kinds of information needs that users have, for improving ranking scores, for showing search history, and for attempts to personalize information retrieval. They are also used to evaluate search interfaces and algorithms, as discussed in the next section. In preparation, the following subsections describe some details behind the use of query logs.
2.5.1: Identifying Session Boundaries
Information seeking is often a complex process consisting of many steps. For this reason, many studies of query logs attempt to infer searchers' intent by observing their behavior across a sequence of steps or activities. This sequence is usually referred to as a search session, which can be defined as a sequence of requests made by a single user for a single navigation purpose (Huang et al., 2004).
There are several approaches for automatically distinguishing the boundaries of search sessions. The simplest method is to set a time threshold; when a user has been inactive for some amount of time, a cutoff is imposed, and all actions that happened before the cutoff are grouped into a session. He and Goker, 2000 found a range of 10 to 15 minutes of inactivity to be optimal for creating session boundaries for web logs. Silverstein et al., 1999 used five minutes, Anick, 2003 used 60 minutes, and Catledge and Pitkow, 1995 recommended 25.5 minutes.
Chen et al., 2002 recognize that timing information varies with both user and task, and so proposed an adaptive timeout method. They defined the “age” of an action to be the interval between an action and the current time. A session boundary was assigned when the age of an action was two times older than the average age of actions in the current session. Thus the time that is allowed to elapse before a session boundary is declared varies with the degree of activity the user is engaged in at different points in time.
Jansen et al., 2007a examined 2,465,145 interactions from 534,507 users of a search engine, comparing several methods for determined session duration, finding that the best method made use of IP address, a cookie (stored client-side information), and query reformulation patterns, the latter to distinguish when a given user is searching for several different information needs in rapid succession. Using this method, they found that 93% of sessions consisted of 3 or fewer queries, with a mean of 2.31 queries per session (SD 1.5). The mean session length using this method was 5 minutes and 15 seconds (SD 39 min).
2.5.2: Issues Surrounding Searcher Identity
In query log analysis, an individual person is usually associated with an IP address, although there are a number of problems with this approach: some people search using multiple different IP addresses, and the same IP address can be used by multiple searchers. Nonetheless, the IP address is a useful starting point for identifying individual searchers.
Note that for typical query log analysis, no attempt is made to determine the actual identity of the users; rather, the IP address is linked to an anonymized unique ID number. Nonetheless, there are important privacy issues associated with the retention and use of query log information. Major search engine companies keep their query logs in confidence, but academic researchers are at a great disadvantage if no such logs are available to them. For this reason, in some cases, query logs have been made available to academic researchers, usually without incident, but in one case to great notoriety (Bar-Ilan, 2007). Adar, 2007 and Xiong and Agichtein, 2007 discuss methods to anonymize the queries themselves, to balance the needs of certain types of research with user privacy, and Cooper, 2008 summarizes the issues from a policy perspective.
Despite these attempts, problems remain (Jones et al., 2007), and so query logs should be carefully handled. An alternative solution is to host a search engine and ask searchers to opt in to having their searches recorded for scientific purposes, with an option for them to remove any information they choose (Cooper, 2008).
Some major web-based companies support “toolbars” which the user must opt in to use and which provide value to the user. However, they monitor all of that user's web-based activity -- not just queries and clicks, but all navigation within the browser. This kind of information can not only lead to a much deeper understanding of peoples' information seeking processes, but can also record more potentially sensitive information. This data must be carefully anonymized and secured.
2.6: Large-Scale Log-Based Usability Testing (Bucket Testing)
An important form of usability testing that takes advantage of the huge numbers of visitors to some Web sites is large-scale log-based usability testing. In the days of shrink-wrapped software delivery, once an interface was coded, it was physically mailed to customers in the form of CDs or DVDs and could not be significantly changed until the next software version was released, usually multiple years later. The Web has changed this paradigm so that many companies release products in “beta,” or unfinished, status, with the tacit understanding that there may be problems with the system, and the system will change before being officially released. More recently, the assumptions have changed still further. With some Web sites, especially those related to social media, the assumption is that the system is a work-in-progress, and changes will continually be made with little advance warning. The dynamic nature of Web interfaces makes it acceptable for some organizations to experiment with showing different versions of an interface to different groups of currently active users.
Large-scale studies using log analysis is also known in the industry as bucket testing, A/B testing, split testing, and parallel flights (Kohavi et al., 2007, Kohavi et al., 2008, Wroblewski, 2006, Sinha, 2005). The main idea is to start with an existing interface that is currently heavily used and is available over the web, and create a variation on a design, or a new feature, to evaluate. User traffic that comes into the site is randomly split, so that one segment of the user population sees the new version or feature. The behavior of users in the experimental condition as well as the control (which is usually the standard site interface) is recorded in log files. In some cases, the study completes in a few days or even a few hours, for sites with very large customer bases.
Changes in behavior (or lack thereof) between the tested design and the standard design are ascertained from the server logs. The metrics recorded include which components are clicked on, dwell time (time elapsed between viewing and clicking), if an item is placed in a shopping cart, and so on. Feedback in the form of customer emails or online discussions can also figure into the analysis. Based on the results of these tests, a decision is made about whether to retain the feature or not. In some cases, if a new feature or design looks promising, the log studies are followed up with more standard usability evaluations, in which people are interviewed or invited into laboratory studies.
Bucket testing can be used both to test small changes in the interface, or for very large-scale innovations in the design. It is recommended that the new version first be tested with a very small percentage of the user population, and the logs and email monitored, to ensure that no errors are introduced. If the results do not appear to be problematic, then more users are included into the test condition (Kohavi et al., 2007).
This kind of study is inherently between-participants, because users see only one version of the interface and are not asked to compare different versions. But this differs from a formal study in that participants are not asked to complete certain tasks, nor is explicit feedback elicited. Therefore, in order to ensure that any differences seen in the log files are indeed significant, the study must be conducted with a very large user base, and significant-looking results need to be verified both with careful statistical significance testing and with additional studies to ensure that unforeseen causes are not at work.
Note that some of the terminology used in bucket testing is generally applicable to controlled experiments. However, bucket testing differs from standard controlled studies in several ways. As mentioned above, no tasks are assigned to the users, and no explicit feedback nor ratings are obtained, and so subjective responses can be difficult to ascertain. Bucket testing also differs from longitudinal studies, in that no explicit feedback is elicited and bucket tests last for at most a week or two. Bucket testing also differs from standard log analysis in that two or more variations of a design, presented to different sets of users, are compared.
The bucket testing approach to usability evaluation also differs from laboratory studies in that users are not asked to explicitly opt in to the study. It also differs from longitudinal studies in which participants volunteer to test out a software system over a period of time. Rather, in bucket testing, the people involved are rarely explicitly notified that they are using a non-standard version of the interface. However, the testing and logging falls within the terms of service that are listed on and standard for active Web sites.
Bucket testing has been found to be highly effective at resolving disputes about design decisions. As Kohavi et al., 2007 point out, often one's intuitions about what will work well on a large scale are simply incorrect, and numbers-based evaluation of features and designs can serve to resolve disputes when individuals from different parts of an organization are all trying to promote their own features (Kohavi et al., 2008). Kohavi et al., 2004 note that a “culture of experimentation” at Amazon.com that made running large-scale experiments easy allowed Amazon to innovate quickly; Kohavi et al., 2008 are pursuing a similar approach at Microsoft.
A major limitation of bucket testing is that the test can run effectively only over the short term because the user pool shifts over time and some users clear their cookies (which are used by the bucket tests to keep track of user IDs). Additionally, it is commonly observed that when comparing a new interface versus one that users are already familiar with, users nearly always prefer the original one at first. As (Kohavi et al., 2007) note:
“If you change the navigation on a Web site, experienced users may be less efficient until they get used to the new navigation, thus giving an inherent advantage to the Control. Conversely, when a new design or feature is introduced, some users will investigate it, click everywhere, and thus introduce a 'newness' bias. ... Both primacy and newness concerns imply that some experiments need to be run for multiple weeks.”
As a corollary to this, in some cases major Web sites roll out changes only gradually. In a famous example, eBay once took 30 days to gradually change the background color of its home page from gray to white, in order to avoid offending or startling users (Helft, 2008). Yahoo and Google also make changes to their most important properties gradually.
In a radical form of this approach, a Google VP claimed the use of large-scale usability testing to resolve a design dispute that engineers could not agree on (Sinha, 2005). She stated that when Google first launched its News service, the designers could not decide between sorting articles by time or by location. Rather than running a formative study in-house, they decided to launch with no sorting facility whatsoever. Within a few hours, they received hundreds of emails asking for sorting by date, but only a few asking for sorting by location, and so they had their answer.
2.7: Special Concerns with Evaluating Search Interfaces
For a number of reasons, evaluating information-intensive applications such as search is somewhat different, and oftentimes more difficult, than evaluating other types of user interfaces. Some pertinent issues are discussed below, along with best practices for evaluation of search interfaces.
2.7.1: Avoid Experimenter Bias
A problem common to all usability studies, not just search interface studies, is that designers who evaluate their own designs often unconsciously introduce biases into the evaluation process. It is best to approach interface evaluation from a neutral scientific perspective, but this can be difficult when evaluating one's own design. One solution for getting a more objective read on the usability of an approach is to have the evaluation done by a third party, that is, different people than the designers of the new interface. It is important that the designers trust the evaluators, though, in order to have effective transfer of the results.
Unfortunately, it is often impractical to have outsiders evaluate a research design, so when evaluating one's own design, the experimenter should be aware of ways to avoid introducing bias. It is important that the experimenter does not “leak” information to the participants about which is the favored design. A common mistake is to say something like “We've developed a new interface which we'd like you to evaluate.” This suggests to the participants that the experimenter may be disappointed if the interface does not perform well, and so may discourage less than fully forthcoming responses from some participants. Neutral language such as “we are evaluating several different designs” is both accurate and non-biasing. It is also good practice to assign neutral names to the different designs or conditions; for example, naming the designs after mountain ranges or colors.
2.7.2: Encourage Participant Motivation
Participant motivation or interest can have a significant impact on results (Buchanan et al., 2005, Spool, 2002, Sutcliffe and Ennis, 1998). A highly motivated person will be inventive and try alternative avenues in a way that a bored person will not. Thus motivated participants can demonstrate how an interface will be used on the intense usage end of the spectrum. Unfortunately, many search usability studies ask people to do search on topics about which they are uninformed and/or uninterested. There is evidence that participants do not try hard when paid to participate in search studies on topics they do not care about. Rose, 2004 notes that in one study conducted by AltaVista, 55% of survey results had to be discarded; in some cases participants failed a simple effort test in which they were required to distinguish actual search results from random URLs.
One way to stimulate participant motivation is to allow people to search for information that they are interested in, or perform tasks that they care about (Russell et al., 2006). However, in principle, it is preferable to require all participants to do the same task (under varying conditions), in order to facilitate comparisons within those conditions. One way to help improve motivation but still have some control over the topics tested in a study is to ask participants to select a subset from a pre-defined list of topics, thus allowing them to avoid subjects about which they have no knowledge or interest (Teevan et al., 2005a), but still allowing for direct comparisons of those participants who choose the same tasks to complete.
Another way to motivate participants is to match participants to the task at hand at recruiting time. An early study by the author and students on the Cha-Cha search system (Chen et al., 1999) provides a telling lesson. Undergraduate students were recruited to assess the interface whose main benefit was its ability to familiarize users with the large, diverse campus intranet, and placed search results within that context. The system ended up being popular among campus administrators, but most undergraduates care very little about campus administrative structure, and found little value in seeing this information.
Having learned from this experience, evaluations of the Flamenco system in each case matched the collection to the participant base (Hearst et al., 2002, Yee et al., 2003). For example, when evaluating on recipes, participants who like to cook were recruited; when evaluating on an architecture slide library collection, professional and student architects were recruited, and when evaluating on a fine arts collection graduate and undergraduate art history majors were chosen as participants. In each case the tasks were designed to be of inherent interest to those participant groups. Nonetheless, it is difficult to do this accurately. For the recipes study, one participant disliked the browse interface intensively; upon further questioning it turned out that despite the fact that he loved cooking, he hated recipes, and therefore had a negative reaction to any recipe interface.
Spool, 2002 describes an innovative way to create a highly motivated participant -- ask them to envision a product they would like to buy, show them a Web site that should contain the product, and offer them the money required to buy the product, but not for any other use. To increase the motivation still more, he suggests asking the participant to think about how they want to spend the money, and then send them away to come back a week later. When the participants return, Spool claims they are even more motivated than before, having had time to further mentally refine their desired purchase.
2.7.3: Account for Participants' Individual Differences
Nielsen, 1993 notes that “the two most important issues for usability are the users' task and their individual characteristics and differences” (p. 43). He analyzed 30 published evaluations of hypertext systems and found that four of the 10 largest effects were due to individual differences between participants, and two were due to task differences (Nielsen, 1989a). Nielsen contrasted users' experience with the system under study, with computers in general, and with the task domain. In some domains (such as programming skill), individual ability can vary by a factor of 20 (Nielsen, 1993).
The effects of several types of individual differences on search performance have been studied in the literature: participants' knowledge of the task domain, participants' experience as searchers, and participants' cognitive differences; each of these are discussed below. The existence of these differences underscore the need to use large participant pools when conducting formal studies, in order to balance out the differences.
Many search usability studies have investigated the role of domain knowledge on search outcome, but the results are mixed (Vakkari, 2000a). Hembrooke et al., 2005 found that performance on information-intensive tasks is influenced by the background knowledge and cognitive skills of the study participants. Jacobson and Fusani, 1992 examined the experiences of 59 novice users with a full-text IR system, and used a regression model to assess the relative contributions of computer, system, and subject knowledge to search success. Their results indicated that all three variables played a role in the outcome. Wildemuth, 2004 found that medical students' search tactics changed over time as their knowledge of the domain changed. Sihvonen and Vakkari, 2004 found strong effects of domain knowledge on the use of thesaurus expansion. Vakkari, 2000a studied a set of 11 students as they formulated masters thesis topics, and observed changes in the choice of search terms and tactics as their knowledge of their topics increased. Vakkari and Hakala, 2000 found that the more participants knew about the task, the fewer references they accepted as relevant. Similarly, Spink et al., 1998 found that the less knowledgable their participants were, the more items they marked as partly relevant. Kelly and Cool, 2002 found, in a study of 36 participants, that efficacy of search (measured as number of relevant documents saved divided by the number of documents viewed) was significantly higher for those people who were familiar with the topic than for those who were very unfamiliar with it, meaning that the knowledgeable participants were more accurate at identifying relevant documents.
It can be stated with some assurance that when evaluating with a fixed set of tasks, it is important to ask in advance if participants already know the answer to the question or have knowledge of the topic. For example, Aula, 2004 finds significant differences in the results of a summary browsing study after removing data points corresponding to questions in which participants knew the answers in advance.
A number of search usability studies have assessed the effects of knowledge of the search process itself, contrasting expert and novice searchers, although there is no consensus on the criteria for these classifications (Aula, 2005). Some studies have observed that experts use different strategies than novices (Hölscher and Strube, 2000, Lazonder et al., 2000, White et al., 2005), but perhaps more tellingly, other studies have found interaction effects between search knowledge and domain expertise (Hölscher and Strube, 2000, Jenkins et al., 2003).
Tabatabai and Shore, 2005, in a study of 10 novices, 9 intermediates, and 10 expert searchers, found that novices found less relevant information and were more satisfied, while experts found more relevant results, and were more nervous about having missed information. However, participants who used better strategies had better results, independent of their experience level. The two most important strategies were evaluative (assessment of relevancy of sources, search engines, and results) and metacognition (self-reflection and monitoring). Tabatabai and Shore, 2005 wrote:
“Novices used backtracking to see where it would take them, hoping to get out of a labyrinth and find a comfort zone. On the contrary, intermediates and experts used it to go where they wanted to. Novices were less patient and relied more on trial-and-error. Impatience led them to navigate more, to click more, and to execute before spending enough time exploring or planning. ... Experts were more aware of their feelings, were not surprised by them, and relied on them to modify their strategies. Their positive outlook on the search gave them patience to wait for a response from the system. Novices, on the other hand, felt lost, disoriented, and caught in a labyrinth more often than experts.”
On the other hand, Zhang et al., 2005 found that, in a study including engineering students, increasing domain knowledge changed search behavior, but not search success. Lazonder et al., 2000 tested 25 Dutch students and found that training did not help much after achieving initial familiarity, and saw no effect based on domain knowledge.
Numerous studies show that participants who are generally more cognitively skilled perform better than other participants in information-intensive tasks, independent of the interfaces being compared. Allen, 1992 assessed a range of cognitive factors using the Kit of Factor-Referenced Cognitive Tests (Ekstrom et al., 1976) and then assessed students searching a bibliographic database for pre-assigned topics. (It is common for evaluation of information visualization interfaces to include the paper folding test from the Kit to measure spatial ability (Chen, 2000).) Allen, 1992 found that perceptual speed had an effect on the quality of searches, and logical reasoning, verbal comprehension, and spatial scanning abilities influenced search tactics.
Tombros and Crestani, 2000 note that fast participants remain fast and slow remain slow, independent of the interface. Pirolli et al., 2003 suggests a similar phenomenon when comparing an experimental and a standard browsing interface. When participants were ranked by their performance on an advanced file browser, and then were ranked by their performance on a standard file browser, there was a high correlation between the two rankings (high performers do well regardless of browser). They did another analysis that partitioned the sums of squares (i.e., partitioned the variance), and found that most of the performance effect in their study was due to individual differences rather than differences in the interface designs. To repeat a point made above, it is important to balance the participants in the different conditions when using a small number of participants.
2.7.4: Account for Differences in Tasks and Queries
Task selection in search interface evaluation can greatly affect the outcome of the study. Nielsen, 1993 (p. 185) notes that “the basic rule for test tasks is that they should be chosen to be as representative as possible of the uses to which the system will be eventually put in the field.” He also advises that tasks be small enough to be completed within the time frame but not so small as to be trivial. The following sections describe the effects of task variations, query descriptions, and bias in task selection when evaluating search interfaces.
In fully-automated IR system evaluations, such as those performed for the TREC ad hoc track, it is well-established that differences in the queries can overwhelm the outcome of a comparison between systems. It has been shown that in a typical TREC task, variation within a system across queries is usually higher than variation among top-scoring systems (Buckley and Walz, 2000). Usually no system is the best for all topics in the task, and in fact it is rare for any system to be above average for all topics (Buckley and Walz, 2000). The track administrators typically use 50 different queries in order to control for effects of the task, but even this is insufficient (Voorhees and Harman, 2000).
Adding human participants to the evaluation most likely increases the sensitivity to task variation. A person might respond emotionally to a query, or may have expert knowledge in a topic, or may find a topic excruciatingly boring. Individualized reactions such as these can influence the results of a search usability study. To compound the problem, human participants can tire, and so can only be expected to do a limited number of queries within a test session, thus reducing the number of queries that can be run and increasing the likelihood of irrelevant artifacts having an effect on the outcome.
As mentioned above, Lindgaard and Chattratichart, 2007 found no correlation between number of participants and number of problems found for usability testing of entire interfaces, but did find a correlation between the number of distinct tasks that participants are asked to complete and the number of problems found. This suggests that differences in tasks uncover different problems with the interface. For these reasons, search usability studies are increasingly differentiating query types, and assuming different interfaces will work better or worse when attempting different query types. Examples of varying the information need in this way is given by Woodruff et al., 2001 and Baudisch et al., 2004.
One way to reduce the variability somewhat is to pre-test the queries to determine if they are of similar difficulty, along either subjective or objective measures. It is common practice when conducting a formal study to show queries to one set of participants and have them rate them according to difficulty, interestingness, requiring external knowledge, and so on, and then select a subset with equivalent properties to show to a distinct set of participants in a usability study.
Although post-hoc analyses have found a lack of correlation between human estimates of difficulty and system performance on the TREC ad hoc task (Voorhees and Harman, 2000), and between system estimates of difficulty and system performance (Voorhees, 2004), there is some work suggesting that the weakest-performing queries can be predicted with some accuracy (Kwok, 2005). A recent large-scale analysis of query-system failures suggests that systems often fail for the same reasons on the same queries: the systems emphasize some aspects of a topic while ignoring others, do not handle general concepts well, and lack sufficient natural language understanding (Buckley, 2004). This analysis may prove useful in future for manual identification of query difficulty.
In addition, it does seem that people are good at estimating difficulty for other people (as opposed to for automated ranking evaluations), especially if the pre-testers use the search system that the study participants will be using (Bell and Ruthven, 2004).
As noted above, people find many different ways to express the same concept. Different query terms submitted to the same system can produce radically different results (Buckley and Walz, 2000), thus leading to significant differences in outcome that have little to do with the interface.
To control for this variability, many usability studies pre-assign the query terms that the participant enters. Pre-writing the queries of course greatly reduces the realism of the assessment. Martzoukou, 2004 argues against the artificiality of using pre-chosen queries for evaluation of information seeking studies, and Bilal, 2002 found children were more successful with self-generated queries than with assigned topics and terms.
However, depending on what is being evaluated, pre-assigned query terms can be acceptable. For example, for the search results summary display experiment described above, the query formulation process is far less important than the results assessment process, and so the artificiality of the pre-written queries is less questionable. But for a study of relevance feedback term suggestions, pre-entering the query terms may be quite unrealistic.
A common way to introduce experimenter bias is in the selection of tasks or queries (Käki and Aula, 2008). For example, in numerous studies of information visualization of search results, the tasks chosen are along the lines of “Which interface allows the user to say how many documents have K instances of term A?” or “Which directory has the largest number of documents?” Although visualization can help with tasks of this sort, counting questions like these are not particularly common search tasks for most users, and so violates Nielsen's (Nielsen, 1993) call to evaluate using realistic tasks. This kind of evaluation is quite common because it is often difficult to show that a visualization improves search results in a meaningful way. Separating out different components in order to isolate which makes a difference is good experimental practice, but the components tested must be meaningful or realistic in some way in order to truly inform the field.
As another example, it can be quite difficult to show significant timing differences in Web search evaluations. In one study, the queries were pre-determined and the terms used for the queries were pre-written. The evaluators were successful at showing timing improvements for a category-based view. However, the queries were designed to be maximally ambiguous so that any kind of grouping information would be helpful. For example, a query for the home page of the alternative rock band They Might Be Giants was written as giants, which is unrealistic from a query generation point of view (people searching for music and movie titles use most of the title words in their queries (Rose and Levinson, 2004) ) and is guaranteed to return a wide array of irrelevant results. A much better approach that showed more realistically that participants used grouping to help disambiguate queries was to use a longitudinal study as done by Käki, 2005b and discussed above.
Biasing can of course result from other considerations discussed above; by not varying the order of presentation properly, by asking participants to repeat tasks, and by polluting timing data by having documents pre-loaded for some conditions but not for others.
2.7.5: Control Test Collection Characteristics
When doing a formal usability test, it is important to evaluate the system on a large underlying collection (Willett, 1988). In the early days of experimental IR, much of the testing was done on the tiny Cranfield collection, which consisted of just 1,398 abstracts. In 1985, a revolutionary paper by Blair and Maron, 1985 investigated search over an important legal case consisting of 40,000 documents, showing that high recall remained distressingly elusive with modern systems. The TREC experiments expanded test domains to another order of magnitude in size, starting with about 1 GB of text for training and another gigabyte for testing in 1992 (Harman, 1993).
Unfortunately, the problem of testing IR systems on small collections still occurs today. In one study comparing search information visualization techniques, a collection of only 651 documents was used and another used only 163 documents. Testing a search visualization on such a small collection can yield misleading results, since it is much easier to show a graphic display with 10 or 100 hits than with 1,000 or 10,000; the results on such a small study are unlikely to scale well to a realistic collection.
It is also important to compare two interfaces which index the same underlying collection. Experimenters who wish to compare a new design against a commercial system, or against a research system whose software is unavailable, are tempted to do a study with systems indexing different documents. Unfortunately, the results of a comparison over different collections is rarely valid, as the behavior of search systems is very sensitive both to the queries issued and the underlying collection. If one collection contains a great deal of information that is of interest to a participant and the other does not, the interface is likely to be blamed for the difference. Similarly, if the format of the contents of the two collections is quite different -- for example, one containing full text and the other containing abstracts, or one containing poor content and the other containing high-quality content -- the participants are likely to conflate the differences in the collections with the differences in the interfaces.
2.7.6: Account for Differences in the Timing Response Variable
Some of the most important metrics for evaluation of interfaces include time required to complete tasks, effort level required, and number and severity of errors made. However, comparing search interfaces based on time required to complete tasks can be fraught with problems. As mentioned above, task difficulty and participant differences and query knowledge can have effects that must be factored into the study design. In addition, any search system that allows participants to follow hyperlinks and move away from the search results page is likely to find widely varying timing data. Thus, most timing-based Web search studies require the users to look at a page of results listings, or at most view documents one link away from the results page (Käki and Aula, 2008). Timing data can also be misleading as sometimes the value of a new information access interface is in that it allows users to browse and get to know a collection of information, and thus longer usage times can signal a more successful design. In some cases, a deeper search session, in which the user can comfortably explore, iterate, and refine their search, may take more time than an unsatisfying interface that makes the user want to get in and get out as quickly as possible (Rose, 2004).
Käki and Aula, 2008 point out that there is no standard timing measure for search speed, and suggest a proportional measure called qualified search speed that measures answer per minute, in terms of quality of answers. Using this measure, the experimenter can compare system A with B in terms of relevant answers per minute and irrelevant answers per minute. They found in several studies that the differentiating measure between interfaces was the increase in relevant speed, with irrelevant speed remaining nearly constant.
On a more practical note, timing data can be affected by variable network delays or contrasts between pages that have already been visited, either by the participant or by a preceding participant, and so are in the cache and fast to access. (A related mistake is to allow a participant to see a previous' participants query history in the web browser drop-down menu.) Best practice suggests downloading all pages that will be seen in advance, and having the participants access them locally, if the study allows for that, and clearing out the cache of all prior history when running multiple participants in the same environment.
As another way around the timing quandary, longitudinal studies can show changes in use over time, and can be used to differentiate which types of queries induce different types of response times.
2.7.7: Compare Against a Strong Baseline
Yet another way to bias the outcome of a usability study is to compare a new design against an unrealistic baseline system. For example, one study compared a visualization of search results against a Web search interface that hid the summaries, even though this kind of view is standard for Web search engines and is the one to beat.
All designs should be equally aesthetically pleasing; as discussed in Chapter 1, unattractive designs de-motivate participants, and vice versa. And as mentioned above, all designs being compared should index the same underlying collection.
2.8: Conclusions
This chapter has described a wide range of methods for evaluating search user interfaces, and interfaces more generally. First was a discussion of standard information retrieval evaluation, which focuses on the performance of the ranking algorithms but not on user interaction. Next was a description of informal and formal methods for evaluating search interfaces, longitudinal studies which allow the participant to experience the interface over long periods of time, and large-scale evaluation that makes use of an existing user population to compare different variants of an interface. The chapter then provided extensive advice on how to avoid the many potential pitfalls that await those who evaluate search interfaces.
Given all this information, how should one design an evaluation? One strategy is, when designing the new search interface initially, begin by thinking about the evaluation, and then work backwards from this to the new design concept. Ensure that there is a way to measure the new design against a strong baseline that is representative of the state of the art.
It is also important to define realistic tasks that reflect what the user base will actually want to do with the system, and test on this kind of task, and to test against a wide range of tasks in order to tease out which aspects of the system succeed and which do not. The study participants should feel motivated to complete the tasks and their characteristics should match that of the true user base to the degree possible. Informal evaluations should be used to quickly determine what works and what does not. More formal studies should be used after the new design is being received well in informal studies, in order to pinpoint detailed problems, and to gather preference data to determine if the new design is better received than the state-of-the art. Longitudinal studies are a useful mechanism for determining the long-term likelihood of the success of the design. Finally, a good resource for ideas about how to evaluate an interface is to study research papers that evaluated similar interfaces in the past.