Search Analytics for Your Site. Louis Rosenfeld
Чтение книги онлайн.
Читать онлайн книгу Search Analytics for Your Site - Louis Rosenfeld страница 5
http://www.flickr.com/photos/rosenfeldmedia/5690980802/
Figure 1-1. In a relevancy test, queries ideally find most reasonable results at position #1 on the search results page. A large distance from the top position suggests a poorly performing query.
John’s relevancy test turned out to be very helpful. As Figure 1-1 shows, we can see which queries weren’t retrieving their ideal result at or near the top of the search engine results page.
Yet there are two major limitations with relevancy testing: First, it leaves out many queries that don’t have a “right answer”—queries that might be common and important. Second, this method relies on guessing what would be “right” for searchers, so it is a highly subjective measure. But a simple test like this one is a good starting point. It is consistent, and though it involves some subjective evaluation, it does so within a consistent framework. In this case, it allowed John to generate some simple test results from a representative sample. If the search engine failed this test—as Vanguard’s did—then you have some serious problems (which they did).
Precision: Getting Beyond Relevance
That’s why John decided to also introduce another set of metrics: precision. Precision measures the number of relevant search results divided by the total number of search results. It tells you how many of the search engine’s results are good ones. John specifically looked at the precision of the top five results—the critical ones that a searcher would likely scan before giving up.
To test precision, John developed a scale for rating each result that a tested query retrieved, based on the information the searcher provided.
Relevant (r): The result’s ranking is completely relevant.
Near (n): The result is not a perfect match, but it’s clearly reasonable for it to be ranked highly.
Misplaced (m): It’s reasonable for the search engine to have retrieved the result, but it shouldn’t be ranked highly.
Irrelevant (i): The result has no apparent relationship to the query.
Rather than guessing at what the searcher’s intent was, John was simply looking to assess how reasonable it was for the search engine to return each result, and whether or not the search engine put it in the right place. He recorded an r, n, m, or i for each result in a spreadsheet, as shown in Figure 1-2.
http://www.flickr.com/photos/rosenfeldmedia/5690980818/
Figure 1-2. Each result for each query was rated as Relevant, Near, Misplaced, or Irrelevant.
John then used a few different ways to calculate precision for each query. He came up with three simple standards—strict, loose, and permissive—to reflect a range of tolerances for different levels of precision.
Strict: Only results ranked as relevant were acceptable (r).
Loose: Both relevant and near results were counted (r+n).
Permissive: Relevant, near, and misplaced results were counted (r+n+m).
You can see how each query scored differently for each of these three precision standards in Figure 1-3. For example, of the first five search results for the query “reserve room,” two were relevant (r), two were nearly relevant (n), and one was misplaced (m). In strict terms, precision was 40% (two of five results were relevant); in loose terms, 80% (four of five were relevant or nearly relevant); and all were relevant in permissive terms.
http://www.flickr.com/photos/rosenfeldmedia/5690405259/
Figure 1-3. Each query’s precision scores were then calculated in three different ways: Strict, Loose, and Permissive.
[1] Chris Anderson’s excellent book The Long Tail (Hyperion, 2006) described the long tail phenomenon and its impact on commerce sites like Amazon and Netflix.
[2] In web analytics, these are referred to as accuracy and precision.
The Brake Works—Thanks to Site Search Analytics
John’s two tests of the original search engine—relevancy and precision—yielded two sets of corresponding metrics that helped his team compare the new engine’s performance against the old one (shown in Figure 1-4). The five relevancy metrics above the line were all based on how close to the top position the “ideal search result” placed. So the smaller the number, the better. For the “Target”—the benchmark figures based on the old search engine—the top queries’ ideal results placed, on average, three places below #1, where they ideally would have been displayed. John looked at the same data in different ways, using a median count, and three percentages that showed how often the ideal result was below the #1, #5, and #10 positions, respectively.
John used different metrics for precision as well—the strict, loose, and permissive measures described previously. In this case, bigger numbers were better because they meant a higher percentage of the top five results were relevant. As mentioned, the “Target” scores were the benchmark; they showed how the old search engine was performing. And the “Oct 3” scores showed how the new search engine was performing. The verdict, as you can see in Figure 1-4, was not pretty.
http://www.flickr.com/photos/rosenfeldmedia/5690405181/
Figure 1-4. The new search engine (“Oct