Web crawling is the last resort

Data source analysis is one of the crucial parts of an enterprise search deployment project. Search engine results quality strongly depends on an indexed data quality. In case of web-based sources, there are two basic ways of reaching the data: internal and external. Internal method involves reading the data directly from its storage place, such as a database, filesystem files or API. Documents are read by some criteria or all documents are read, depending on requirements. External technique relies on reading a rendered HTML with content via HTTP, the same way as it is read by human users. Reaching further documents (so called content discovery) is achieved by following hyperlinks present in the content or with a sitemap. This method is called a web crawling.

The crawling, in contrary to a direct source reading, does not require particular preparations. In a minimal variant, just a starting URL is required and that’s it. Content encoding is detected automatically, off the shelf components extract text from the HTML. The web crawling may appear as a quick and easy way to collect a content to be indexed. But after deeper analysis, it turns out to have multiple serious drawbacks.

Continue reading

Under the hood of the search engine

While using a search application we rarely think about what happens inside it. We just type a query, sometime refine details with facets or additional filters and pick one of the returned results. Ideally, the most desired result is on the top of the list. The secret of returning appropriate results and figuring out which fits a query better than others is hidden in the scoring, ranking and similarity functions enclosed in relevancy models. These concepts are crucial for the search application user’s satisfaction.

In this post we will review basic components of the popular TF/IDF model with simple examples. Additionally, we will learn how to ask Elasticsearch for explanation of scoring for a specific document and query.

Document ranking is one of the fundamental problems in information retrieval, a discipline acting as a mathematical foundation of search. The ranking, which is literally assigning a rank to a document matching search query corresponds with a term of relevance. Document relevance is a function which determines how well given document meets the search query. A concept of similarity corresponds, in turn, to the relevance idea, since relevance is a metric of similarity between a candidate result document and a search query. Continue reading

Gamification in Information Retrieval

My last article was mainly about Collaborative Information Seeking – one of the trends in enterprise search. Another interesting topic is the use of games’ mechanics in CIS systems. I met up with this idea during previously mentioned ESE 2014 conference, but interest is so high, that this year in Amsterdam a GamifIR (workshops on Gamification for Information Retrieval) took place. IR community have debated about what kind of benefits can IR tasks bring from games’ techniques. Workshops cover gamified task in context of searching, natural language processing, analyzing user behavior or collaborating. The last one was discussed in article titled “Enhancing Collaborative Search Systems Engagement Through Gamification” and has been mentioned by Martin White in his great presentation about search trends on last ESE summit.

Gamification is a concept which provides and uses game elements in non-game environment. Its goal is to improve customers or employees motivation for using some services. In the case of Information Retrieval it is e.g. encouraging people to find information in more efficient way. It is quite instinctive because competition is  an inherent part of human nature. Long time ago, business sectors have noticed that higher engagement, activating new users and establishing interaction between them, rewarding the effort of doing something lead to measurable results. Even if quality of data given by users could be higher. Among those elements can be included: leaderboards, levels, badges, achievements, time or resources limitation, challenges and many others. There are even described design patterns and models connected with gameplay, components, game practices and processes. Such rules are essential because virtual badge has no value until being assigned by user.

Collaborative Information Seeking is an idea suited for people cooperating on complex task which leads to find specific information. Systems like this support team work, coordinate actions and improve communication in many different ways and with usage of various mechanisms. At first glance it seems that gamification is perfect adopted to CIS projects. Seekers become more social, feeling of competence foster actions which in turn are rewarded.

The most important thing is to know why do we need gamified system and what kind of benefits we will get. Next step is to understand fundamental elements of a game and find out how adopt them to IR case. In their article “Enhancing Collaborative Search Systems Engagement Through Gamification”, researchers of Granada and Holguin universities have listed propositions how to gamify CIS system.  Based on their suggestions I think essential points are to prepare highly sociable environment for seekers. Every player (seeker) needs to have own personal profile which stores previous achievements and can be customized. Constant feedback on progress, list of successful members, time limitations, keeping the spirit of competition by all kinds of widgets are important for motivating and building a loyalty. Worth to remember that points collected after achieving goals need to be converted into virtual values which can distinguish the most active players. Crucial thing is to construct clear and fair principles, because often information seeking with such elements is a fun and it can’t be ruined.

Researchers from Finnish universities, who published article “Does Gamification Work?”, have broken down a problem of gamifying into components and have thoroughly studied them. Their conclusion was that concept of gamification can work, but there are some weaknesses – context which is going to be gamified and the quality of the users. Probably, the main problem is lack of knowledge which elements really provide benefits.

Gamification can be treated as a new way to deal with complex data structures. Limitations of data analyzing can be replaced by mechanism which increase activity of users in Information Retrieval process. Even more – such concept may leads to more higher quality data, because of increased people motivation. I believe, Collaborative Information Seeking, Gamification and similar ideas are one of the solutions how to improve search experience by helping people to become better searchers than not by just tuning up algorithms.

Enterprise Search Europe 2014 – Short Review

ESE Summit

At the end of April  a third edition of Enterprise Search Europe conference took place.  The venue was Park Plaza Victoria Hotel in London. Two-day event was dedicated to widely understood search solutions. There were two tracks covering subjects relating to search management, big data, open source technologies, SharePoint and as always –  the future of search. According to the organizer’ information, there were 30 experts presenting their knowledge and experience in implementation search systems and making content findable. It was  opportunity to get familiar with lots of case studies focused on relevancy, text mining, systems architecture and even matching business requirements. There were also speeches on softer skills, like making  decisions or finding good  employees.

In a word, ESE 2014 summit was great chance to meet highly skilled professionals with competence in business-driven search solutions. Representatives from both specialized consulting companies and universities were present there. Even second day started from compelling plenary session about the direction of enterprise search. Presentation contained two points of view: Jeff Fried, CTO in BA-Insight and Elaine Toms, Professor of Information Science, University of Sheffield. From industrial aspect analyzing user behavior,  applying world knowledge or improving information structure is a  real success. On the other hand, although IR systems are currently in mainstream, there are many problems: integration is still a challenge, systems working rules are unclear, organizations neglect investments in search specialists. As Elaine Toms explained, the role of scientists is to restrain an uncertainty by prototyping and forming future researchers. According to her, major search problems are primitive user interfaces and too few systems services. What is more, data and information often become of secondary importance, even though it’s a core of every search engine.


Despite of many interesting presentations, particularly one caught my attention. It was “Collaborative Search” by Martin White, Conference Chair and Managing Director in Intranet Focus. The subject was current condition of enterprise search and  requirements which such systems will have to face in the future. Martin White is convinced that limited users satisfaction is mainly fault of poor content quality and insufficient information management. Presentation covered  absorbing results of various researches. One of them, described in “Characterizing and Supporting Cross-Device Search Tasks” document, was analysis of commercial search engine logs in order to find behavior patterns associated with cross device searching. Switching between devices can be a hindrance because of device multiplicity. That is why each user needs to remember both what he was searching and what has already been found. Findings show that there are lots of opportunities to handle information seeking more effectively in multi-device world. Saving and re-instating user session, using time between switching devices to get more results or making use of behavioral, geospatial data to predict task resumption are just a few examples of ideas.

Despite everything, the most interesting part of Martin White’s presentation was dedicated to Collaborative Information Seeking (CIS).

Collaborative Information Seeking

It is natural that difficult and complex tasks forced people to work together. Collaboration in information retrieval helps to use systems more effectively. This idea concentrate on situations when people should cooperate to seek information or sense-make. In fact, CIS covers on the one hand elements connected with organizational behavior or making decision, on the other – evolution of user interface and designing systems of immediate data processing. Furthermore, Martin White considers CIS context to be focused around the complex queries, “second phase” queries, results evaluation or ranking algorithms. This concept is able to bring the highest values in the domains like chemistry, medicine and law.

During the CIS exploration some definitions appeared:  collaborative information retrieval, social searching, co-browsing, collaborative navigation, collaborative information behavior, collaborative information synthesis.  My intention is to introduce some of them.

"Collaborative Information Seeking", Chirag Shah

1. “Collaborative Information Seeking”, Chirag Shah

Collaborative Information Retrieval (CIR) extends traditional IR for the purposes of many users. It supports scenarios when problem is complicated and when seeking common information is a need. To support groups’ actions, it is crucial to know how they work, what are their strengths and weaknesses. In general, it might be said that such system could be an overlay on search engine re-ranking results, based on users community knowledge. In agreement with Chirag Shah, the author of “Collaborative Information Seeking” book, there are some examples of systems where workgroup’s queries and related results are captured and used to filtering more relevant information for particular user. One of the most absorbing case is SearchTogether – interface designed for collaborative web search, described by Meredith R. Morris and Eric Horvitz. It allows to work both synchronously and asynchronously. History of queries, page metadata and annotations serve as information carrier for user. There had been implemented an automatic and manual division of labor. One of its feature was recommending pages to another information seeker. All sessions and past findings were persisted and stored for future collaborative searching.

Despite of many efforts made in developing such systems, probably none of them has been widely adopted. Perhaps it was caused partly by its non-trivial nature, partly by lack of concept how to integrate them with other parts of collaboration in organizations.

Another ideas associated with CIS are Social Search and Collaborative Filtering. First one is about how social interactions could help in searching together. What is interesting,  despite of rather weak ties between people in social networks, their enhancement may be already observed in collaborative networks. Second definition referred to provide more relevant search results based on user past behavior, but also community of users displaying similar interests. It is noteworthy that it is an example of asynchronous interaction, because its value is based on past actions – in contrast with CIS where emphasis is laid to active users communication. Collaborative Filtering has been applied in many domains: industry, financial, insurance or web. At present the last one is most common and it’s used in e-commerce business. CF methods make a base for recommender systems predicting users preferences. It is so broad topic, that certainly deserves a separate article.

CIS Barriers

Regardless of all these researches, CIS is facing many challenges nowadays. One of them is information security in the company. How to struggle out of situation when team members do not have the same security profile or when some person cannot even share with others what has been found? Discussed systems cannot be only created for information seeking, but also they need to  provide managing security, support situations when results were not found because of permissions or situations when it is necessary to view a new document created in cooperation process. If it is not enough, there are various organization’s barriers hindering CIS idea. They are divided into categories – organizational, technical, individual, and team. They consist of things such as organization culture and structure, multiple and un-integrated systems, individual person perception or varied conflicts appeared during team work. Barriers and their implications have been described in detail in document “Barriers to Collaborative Information Seeking in Organizations” by Arvind Karunakaran and Madhu Reddy.

Collaborative information seeking is exciting field of research and one of the search trend. Another absorbing topic is gamification adopting in IR systems. This is going to be a subject of my next article.

Uncover hidden insights using Information Retrieval and Social Media

Arjen de Vries’ talk at ESSIR 2013 (Granada, Spain) highlighted the opportunities and difficulties in using Information Retrieval (IR) and social media to both make sense of unstructured data that a computer cannot easily make sense of by itself and realise deeper hidden information.

Today social media has become more important as interactions on sites have developed to include user-generated content of all types from ratings, comments and experiences to uploaded images, blogs and videos. Users now not only consume content and products but also co-create, interact with other users and help categorise content through means of tagging or hash-tags. All of which may result in ‘data consumption’ traces.

Some of these social media platforms like Twitter provide access to a variety of data such as user profiles, their connections with other users, their shared or published content and even how they react to each other’s content through comments and ratings. Analysis of this data can provide new insights.

One example is a case from CIW research. They calculated a top music artist popularity chart for each of the following different music sites: EchoNest, Last.fm and Spotify. Further research was then done for the band The Black Keys, where it was found that their popularity did not vary over time for either Last.fm or Spotify. However when using bit.ly data from tweets about the band for the very same period, it was found that interest in them rocketed after their Grammy win in the States, information that was not apparent from the previous research.

Using social media data to enrich information

The challenge that remains for IR research however is that these social media platforms vary in functionality. What they let users do will often determine the usefulness of the resultant data. For example, YouTube and Flickr will only let the up-loader tag their own content, while the film site IMDb allows tagging but the tags are not registered personally, rather they go into a pool. Arjen cited ‘Red Hot Chilli Peppers’ as a simple example of the usefulness of such social media data in allowing disambiguation either through implicit metadata from a user comment about eating chillies and/or information derived from the organisational data from say Flickr where a picture of red hot chilli peppers is grouped with other pictures of fruit and vegetables.

The key point here is that researchers often bemoan the fact that they do not always have access to log server files. Social media data left by users about content or objects can at times provide a richer representation in matching an information need and the response to that need. The potential benefits here for Information Retrieval are several:

  • The expanded content representation
  • The reduction in the ‘vocabulary gap(s)’ between content creator, indexers and information seekers
  • The increase in diversity of view on the same content
  • And the opportunity to make better assumptions about a user’s context and the variety of contexts that may exist

Allowing all users to tag all available content improves retrieval tasks

Where information about users on media sites is ‘open,’ (sometimes it is not, sites like Facebook and Linkedin are notoriously difficult to retrieve data from) there is the ability to discover which user labels what item with what word, and in some cases, even what rating they give. In essence many new sources play the role of anchor texts, be they tags, ratings, tweets, comments or reviews. The standard triangle of user, tag and item allows a unifying research approach based on random walks and can answer many questions. The talk emphasised that the area clearly has ample opportunity for researchers to make their mark.

One example of the research potential was shown in the case where LibraryThing.com was used to detect synonyms. The website allows users to keep a record of books they have read, tag, rate and comment on them. From these connections a synonym detector was created, allowing the example query word of ‘humour,’ to have a proposed list of synonyms created that included: ‘humor (US English), funny, humorous and British humour.

Analysis of this type of research has shown that allowing all users to tag all available content improves retrieval tasks, and that combining tags and ratings may both improve search and recommendation tasks even though there are some cases where lost relations between user, tag and item may occur.

The takeaway from this talk was that there is no one means or approach in retrieving information from Social Media due to the many complexities involved – not least because the limitations of user interactivity in some platforms but also due to limitations in the usage of data accessed. As Arjen demonstrated though, it is possible in certain cases to be innovative in the collection of rewarding data – an approach that may be utilized more and more, particularly as users/customers move closer to product and service providers through increased online interactions.

Solving Diversity in Information Retrieval

How to solve diversity in information retrieval and techniques for handling ambiguous queries was a topic of interest at the SIGIR 2013 conference in Dublin, Ireland, which I attended recently.

The issue of Diversity in Information Retrieval was covered at a number of presentations at the conference. It is search engine independent, since it uses only the set of result documents as input. When applied to the world of search it basically means an aim to produce a search result that covers as many of the relevant topics as possible.

This is done by retrieving, say 100-500 documents, instead of the normal 10.
These documents are then clustered based on their contents to create a number
of topic clusters. The search result is then constructed by selecting
(the normal 10) documents from the clusters in a round-robin fashion. This will
hopefully create a diverse search result, with as broad coverage as possible.

The technique can not only be used to solve the problem of ambiguous queries,
but also queries with several sub-topics associated with it. By iteratively
running a clustering algorithm on the result documents with 2 to 5 (or so)
clusters and measuring the separation between them and choosing the outcome
with the greatest separation, a diverse result set of documents can be created.
The clusters can also be used to ask follow up questions to the user, where
he/she is allowed to click on one of several tag clouds, containing the most
central terms of each cluster.

A cluster set of size 2 with a good separation would indicate that the query
may be ambiguous, with two different semantics meanings, while a size of 3-5
likely means that the there are a number of sub topics identified in the
results. In a way these clusters can be seen as a dynamic facet, but it is
still shallow since it only operates on the returned documents. Yet, it does
not require any additional knowledge about the documents other than the
information that is returned. This could also be extended by using topic
labelling to present the user with a single term or phrase, instead of a tag

Regarding the conference itself I found it to be a nice and professional arrangement with lots of in depth topics and nice evening activities, including a historical tour of Dublin.

Tutorial: Optimising Your Content for Findability

This tutorial was done on the 6th of November at J. Boye 2012 conference in Aarhus Denmark. Tutorial was done by Kristian Norling.

Findability and Your Content

As the amount of content continues to increase, new approaches are required to provide good user experiences. Findability has been introduced as a new term among content strategists and information architects and is most easily explained as:

“A state where all information is findable and an approach to reaching that state.”

Search technology is readily used to make information findable, but as many have realized technology alone is unfortunately not enough. To achieve findability additional activities across several important dimensions such as business, user, information and organisation are needed.

Search engine optimisation is one aspect of findability and many of the principles from SEO works in a intranet or website search context. This is sometimes called Enterprise Search Engine Optimisation (ESEO). Getting findability to work well for your website or intranet is a difficult task, that needs continuos work. It requires stamina, persistence, endurance, patience and of course time and money (resources).

Tutorial Topics

In this tutorial you will take a deep dive into the many aspects of findability, with some good practices on how to improve findability:

  • Enterprise Search Engines vs Web Search
  • Governance
  • Organisation
  • User involvement
  • Optimise content for findability
  • Metadata
  • Search Analytics

Brief Outline

We will start some very brief theory and then use real examples and also talk about what organisations that are most satisfied with their findability do.

Experience level

Participants should have some intranet/website experience. A basic understanding of HTML, with some previous work with content management will make your tutorial experience even better. A bonus if you have done some Search Engine Optimisation (SEO) for public websites.

Findability day in Stockholm – search trends and customer insights

Last Thursday about 50 of Findwise customers, friends and people from the industry gathered in Stockholm for a Findability day (#findday12). The purpose was simply to share experiences from choosing, implementing and developing search and findability solutions for all types of business and use cases.

Martin White, who has been in the intranet business since 1996, held the keynote speech about “Why business success depends on search”.
Among other things he spoke about why the work starts once search is implemented, how a search team should be staffed and what the top priority areas are for larger companies.
Martin has also published an article about Enterprise Search Team Management  that gives valuable insight in how to staff a search initiative. The latest research note from Martin White on Enterprise search trends and developments.

Henrik Sunnefeldt, SKF, and Joakim Hallin, SEB, were next on stage and shared their experiences from working with larger search implementations.
Henrik, who is program manager for search at SKF, showed several examples of how search can be applied within an enterprise (intranet, internet, apps, Search-as-a-Service etc) to deliver value to both employees and customers.
As for SEB, Joakim described how SEB has worked actively with search for the past two years. The most popular and successful implementation is a Global People Search. The presentation showed how SEB have changed their way of working; from country specific phone books to a single interface that also contains skills, biographies, tags and more.

During the day we also had the opportunity to listen to three expert presentations about Big data (by Daniel Ling and Magnus Ebbeson), Hydra – a content processing framework – video and presentation (by Joel Westberg) and Better Business, Protection & Revenue (by David Kemp from Autonomy).
As for Big data, there is also a good introduction here on the Findability blog.

Niklas Olsson and Patric Jansson from KTH came on stage at 15:30 and described how they have been running their swift-footed search project during the last year. There are some great learnings from working early with requirements and putting effort into the data quality.

Least, but not last, the day ended with Kristian Norling from Findwise who gave a presentation on the results from the Enterprise Search and Findability Survey. 170 respondents from all over the world filled out the survey during the spring 2012 that showed quite some interesting patterns.
Did you for example know that in many organisations search is owned either by IT (58%) or Communication (29%), that 45% have no specified budget for search and 48% of the participants have less than 1 dedicated person working with search?  Furtermore, 44,4% have a search strategy in place or are planning to have one in 2012/13.
The survey results are also discussed in one of the latest UX-postcasts from James Royal-Lawson and Per Axbom.

Thank you to all presenters and participants who contributed to making Findability day 2012 inspiring!

We are currently looking into arranging Findability days in Copenhagen in September, Oslo in October and Stockholm early next spring. If you have ideas (speakers you would like to hear, case studies that you would like insight in etc), please let us know.

A look at European Conference on Information Retrieval (ECIR) 2012

European Conference on Information Retrieval

The 34th European Conference on Information Retrieval was held  1-5 April 2011, in the lovely but crowded city of Barcelona, Spain. The core conference attracted over 100 attendees, with a total of 35 accepted full papers, 28 posters, and 7 demos being presented. As opposed to the previous year, which had 2 parallel sessions, this year’s conference included a single running session. The accepted papers covered a diverse range of topics, and were divided into query representation, blog and online-community search, semi-structured retrieval, applications, evaluation, retrieval models, classification, categorisation and clustering, image and video retrieval, and systems efficiency.

The best paper award went to Guido Zuccon, Leif Azzopardi, Dell Zhang and Jun Wang for their work entitled “Top-k Retrieval using Facility Location Analysis” and presented by Leif Azzopardi during the retrieval models session. The authors propose using facility location analysis taken from the discipline of operations research to address the top-k retrieval problem of finding “the optimal set of k documents from a number of relevant documents given the user’s query”.

Meanwhile, “Predicting IMDB Movie Ratings using Social Media” by Andrei Oghina, Mathias Breuss, Manos Tsagkias and Maarten de Rijke won the best poster award. With a different goal from the best paper, the authors of the poster experiment with a prediction model for rating movies using a set of qualitative and quantitative features extracted from the stream of two social media channels, YouTube and Twitter. Their findings show that the highest predictive performance is obtained by combining features from both channels, and propose as future work to include other social media channels.

Workshop Days

The conference was preceded by a full day of workshops and tutorials running in parallel. I attended two workshops: Information Retrieval Over Query Sessions (SIR) during the morning and Task-Based and Aggregated Search (TBAS) in the afternoon. The second workshop ended with an interactive discussion. A third, full-day workshop was Searching 4 Fun!.

Industry Day

The last day was the Industry Day. Only 2 papers here, plus 5 oral contributions, and around 50 attendees. A strong focus of the talks given at the industry day was on opinion-mining: four of the six participating companies/institutions presented work on sentiment analysis and opinion mining from social media streams. Jussi Karlgren, from Gavagai, argued that sentiment analysis from social media can be used by companies for example in finding reviews or comments made about their product or service, analyse their market position, and predict price movements. Rianne Kaptein, from Oxyme, backed this up by adding that businesses are interested by what the consumers say about their brand, products or campaigns on social media streams. Furthermore, Hugo Zaragoza from Websays identified two basic needs inside a company: a need for help in reading so that someone can act, and a need for help in explaining so that it can convince. Very interesting topic indeed, and research in this direction will advance as companies become more aware of the business gains from opinion mining of social media.

Overall, ECIR 2012 was a very inspiring conference. It also seemed a very friendly conference, offering many opportunities to network with the fellow attendees. Despite that, several participants said that the number of attendees at this year’s conference has decreased in comparison with previous years. The workshops and the core conference gave me the impression that it has a strong focus on young researchers, as many of the accepted contributions had a student as a first author and presenter at the conference. The fact that there was only one session running at a time was a good decision in my opinion, as the attendees were not forced to miss presentations. Nevertheless, the workshops and tutorials were running in parallel, and although the proceedings of the workshops will be made freely available, I still feel that I missed something that day. The industry day was very exciting, offering the opportunity to share ideas between academia and industry. However, there were not so many presentations, and the topics were not as diverse. I propose that next year Findwise will be among the speakers at the Industry track!

ECIR 2013 will be held in Moscow, Russia, between 24-28 March. See you there!