Findwise sponsored the 4th International Conference on the Theory of Information Retrieval (ICTIR) that took place in Copenhagen 29 September – 2 October 2013. The scope of the conference is to present the latest research and promote the exchange of ideas on the theory and foundations of Information Retrieval (IR). Findwise was at the conference to pick up theoretical ideas and bring them into practice at customers.
Is There Space For Theory In Modern Commercial Search Engines?
Ricardo Baeza-Yates (from Yahoo! Labs, Spain) had a keynote during the conference with the title Is There Space For Theory In Modern Commercial Search Engines? An interesting question for which the answer was an expected yes, and he quoted Donald E. Knuth to support this answer: “the best theory is inspired by practice and the best practice is inspired by theory” (“Theory and Practice”, Theoretical Computer Science, 1991).
His presentation eventually focused on predictive algorithms, and how these can be applied for the challenges in web search. Two examples were illustrated to make the point for applying predictive algorithms in information retrieval: (1) tier prediction and (2) query intent prediction.
In the first example, the task is to predict which corpus of documents to search, in order to provide faster answer time, given a query. It is often the case, especially in large international organizations, that the indexes of documents are partitioned to provide better response times. The task would be then to predict which partition to search, based on the query (without running the actual query). Using machine learning, a corpus predictor will predict which corpus to search, retrieve results from that corpus and then assess whether the choice was right; if wrong, it will try to correct the action. The decrease in answer time means however an increase in infrastructure costs (read more about tier prediction and the cost-efficiency trade-off in this article).
In the second example, the task is to predict the user intent given a query. Most of the work in query intent identification considers only one or a few facets of the query (its topic for example, or its informational, navigational, or transactional nature) and Ricardo mentions that the query is just “the tip of the iceberg” when it comes to understanding user intent. Given a user query, he proposes that the query is classified using multiple facets, and more specifically the following nine dimensions: genre, topic, task, objective, specificity, scope, authority sensitivity, spatial sensitivity, and time sensitivity (see photo below with his slide on how these facets are defined and which values each takes). You can read more about query intent prediction in one of his invited talks.
Photo taken at ICTIR 2013
IR research – challenges and long-range opportunities
The conference also included a panel discussion on the challenges and future of IR. The panel members were represented by Stephen Robertson (University College London, UK), Thomas Roelleke (Queen Mary, University of London, UK), ChengXiang Zhai (University of Illinois at Urbana-Champaign, USA), Ricardo Baeza-Yates (Yahoo! Labs, Spain), Peiling Wang (University of Tennessee, USA). Here are the main challenges that the panel members put up.
Small details matter. Thomas Roelleke compared the techniques used in golf playing with information retrieval models. The small details, such as how the player keeps the hands on the golf club, make a big difference in the result of the shot. Similarly, small changes in the theoretical models used in IR produce significant differences in the way the search results are ranked. For example, a small difference in the notation of frequencies in the formulas behind the ranking algorithms can lead to different results in the implementation.
There is a big dependency on text in IR. The way the user interacts with a retrieval system is through text, and current theoretical models in IR are relying heavily on this medium. Text and the tools that allow the user to interact with the search engine (mostly limited to tools such as the keyboard and the mouse) have determined how to build and evaluate the retrieval systems, as opposed to a situation where tools and systems are built based on user needs (also somehow related to the next challenge). Voice is one of the mediums that could become used more often in the future, especially since there is a trend of the user information need being converted into (simply) a user need, and the context of the search becomes more important into identifying the user intent. Emerging examples could be Apple’s Siri or Google’s Glass projects.
Adaptability of users versus adaptability of the system. Should we build systems that adapt to the user, or do the users have the ability to adapt to the systems we build? For example, if the empty string “” is still amongst one of the most common search queries in your search log, what does this say about your users? If we look at this problem from the user’s perspective, the interface provides a colorful button that asks to be pressed for an action to occur. Given this, one interesting questions is how should the system react to an empty string, i.e. what assumptions can be made about the user intent given the empty string?
Specialised search. There are specific cases of search on which a limited amount of research has been done, examples being desktop search and enterprise search, as most of the research being published is based on experiments done for web search. Moreover, theoretical models for these specialised searches are even sparser. Stephen Robertson actually suggests that enterprise search “needs even more theory than web search”. Enterprise search differentiates itself also in the way that evaluation of the performance of the search can be made. It often happens that in the intranet search, a user is able to pinpoint exactly which document is relevant for his search query, which is often not the case when evaluating web search.
An interesting consequence of this is, if we connect this back to the foundational retrieval models discussed over and over again throughout the conference, that users are most of the times unfamiliar with how the search engine determines the order of the search results. This can create frustration amongst users. So there should be some tradeoff in the details presented to the users – inform them about the assumptions made by the search engine but at the same time hide the small details from the users.
This post has only covered some of the highlights of the conference, and in the upcoming blog posts some of the topics will be covered in more details.