Uncover hidden insights using Information Retrieval and Social Media

Arjen de Vries’ talk at ESSIR 2013 (Granada, Spain) highlighted the opportunities and difficulties in using Information Retrieval (IR) and social media to both make sense of unstructured data that a computer cannot easily make sense of by itself and realise deeper hidden information.

Today social media has become more important as interactions on sites have developed to include user-generated content of all types from ratings, comments and experiences to uploaded images, blogs and videos. Users now not only consume content and products but also co-create, interact with other users and help categorise content through means of tagging or hash-tags. All of which may result in ‘data consumption’ traces.

Some of these social media platforms like Twitter provide access to a variety of data such as user profiles, their connections with other users, their shared or published content and even how they react to each other’s content through comments and ratings. Analysis of this data can provide new insights.

One example is a case from CIW research. They calculated a top music artist popularity chart for each of the following different music sites: EchoNest, Last.fm and Spotify. Further research was then done for the band The Black Keys, where it was found that their popularity did not vary over time for either Last.fm or Spotify. However when using bit.ly data from tweets about the band for the very same period, it was found that interest in them rocketed after their Grammy win in the States, information that was not apparent from the previous research.

Using social media data to enrich information

The challenge that remains for IR research however is that these social media platforms vary in functionality. What they let users do will often determine the usefulness of the resultant data. For example, YouTube and Flickr will only let the up-loader tag their own content, while the film site IMDb allows tagging but the tags are not registered personally, rather they go into a pool. Arjen cited ‘Red Hot Chilli Peppers’ as a simple example of the usefulness of such social media data in allowing disambiguation either through implicit metadata from a user comment about eating chillies and/or information derived from the organisational data from say Flickr where a picture of red hot chilli peppers is grouped with other pictures of fruit and vegetables.

The key point here is that researchers often bemoan the fact that they do not always have access to log server files. Social media data left by users about content or objects can at times provide a richer representation in matching an information need and the response to that need. The potential benefits here for Information Retrieval are several:

  • The expanded content representation
  • The reduction in the ‘vocabulary gap(s)’ between content creator, indexers and information seekers
  • The increase in diversity of view on the same content
  • And the opportunity to make better assumptions about a user’s context and the variety of contexts that may exist

Allowing all users to tag all available content improves retrieval tasks

Where information about users on media sites is ‘open,’ (sometimes it is not, sites like Facebook and Linkedin are notoriously difficult to retrieve data from) there is the ability to discover which user labels what item with what word, and in some cases, even what rating they give. In essence many new sources play the role of anchor texts, be they tags, ratings, tweets, comments or reviews. The standard triangle of user, tag and item allows a unifying research approach based on random walks and can answer many questions. The talk emphasised that the area clearly has ample opportunity for researchers to make their mark.

One example of the research potential was shown in the case where LibraryThing.com was used to detect synonyms. The website allows users to keep a record of books they have read, tag, rate and comment on them. From these connections a synonym detector was created, allowing the example query word of ‘humour,’ to have a proposed list of synonyms created that included: ‘humor (US English), funny, humorous and British humour.

Analysis of this type of research has shown that allowing all users to tag all available content improves retrieval tasks, and that combining tags and ratings may both improve search and recommendation tasks even though there are some cases where lost relations between user, tag and item may occur.

The takeaway from this talk was that there is no one means or approach in retrieving information from Social Media due to the many complexities involved – not least because the limitations of user interactivity in some platforms but also due to limitations in the usage of data accessed. As Arjen demonstrated though, it is possible in certain cases to be innovative in the collection of rewarding data – an approach that may be utilized more and more, particularly as users/customers move closer to product and service providers through increased online interactions.

Overview of the 9th European Summer School in Information Retrieval

This was an excellent week of knowledge sharing and exchanging of ideas, seen mostly (but not exclusively) from an Information Retrieval (IR) research perspective, given the high percentage of PhD students in the 100+ audience. Here are just a few brief summaries of my favourite talks from a great crop that were made throughout the week.

Highlights for me started with the IR ‘guru’ Bruce Croft (University of Massachusetts; twitter: @wbc11) who stormed through over 150 slides over the morning session in which a speedy and concise history of IR was given, its foundations and formal models and the current IR research focus. Three main issues were seen to dominate: relevance, evaluation and users along with their information needs.

He emphasised that the IR research focus was still very much on document and query representations and the various/mixed retrieval models attempting to marry them in order to produce the most relevant results (with relevance here incorporating topical and people relevance, task context and novelty). A key shift within retrieval models however was noted from the processing of text into units of language, towards the use of distribution of word counts with more statistical and predictive properties in retrieval models and algorithms. The current research ‘gap’ was seen as dealing with the long(er) query at the specific passage-level answer.

Michail Salampasis (from the Institute of Software Technology and Interactive Systems, Vienna University of Technology) gave a challenging talk on Integrating IR Technologies for Professional Search in which he highlighted the increased difficulties when dealing with a specialized domain and a smaller user base. Enterprise Search (ES) was very much seen as comparable to IR Search in terms of relevance, evaluation, user informational needs and user interactions. It was rightly noted that within ES however, other factors also play a more important part such as system performance, incorporating new data, scalability, information freshness and the presence of multiple information sources along with the need for tuning for different applications.

IR and Social Media by Arjen de Vries (leader of the Interactive Information Retrieval research group Centrum Wiskunde & Informatica, Utrecht, Netherlands; twitter: @arjenpdevries). This was an inspirational talk telling researchers of the vast and varied social media data out there ready to be culled as a direct result of users’ interactions. He talked of having the key ideal information triangle of linked data between people (their connections and profiles), items and tags/ratings (endorsement and sharing). The take home messages were that social media can at times give IR research a rich resource of context that is an alternative to click data, although that finding one theory to address the various recommendation and retrieval tasks is going to be problematical. An example was shown where a band’s popularity was shown to be static after having won a Grammy on both Spotify and EchoNest, while analysis using bitly clearly showed a sharp increase in interest.

Tony Russell-Rose (UX Labs, London; twitter: @tonygrr) gave an entertaining talk, very much from the user perspective, on Designing the User Experience. He noted how the earliest classic IR models either lacked a user perspective completely or were too linear. He proposed four Dimensions of Search User Experience involving the user themselves (their level of expertise), their goal (its scope and complexity), the context (again its complexity) and the type of search mode to be employed (depending on whether the user was looking up something, learning or investigating). His talk went on to urge for the adoption of a principled approach to design using these dimensions, with particular reference to the differing contexts requiring differing designs. Finally there was a call to apply proven design patterns and principles but to look at search holistically – towards both the analysis and the sense making of information.

Norbert Fuhr (from the Faculty of Engineering Sciences at the University of Duisburg-Essen, Germany) gave a measured talk on Interactive IR. Firstly quantitative modelling and the Probability Ranking Principle (PRP) were looked at. The PRP ranks documents according to decreasing values of the probability of relevance (based on user choices represented as binary choices) that in turn yield optimum retrieval quality. Some obvious shortcomings with this ranking were shown: the user-assessment focus; the relevance judgements of documents are independent; the users’ search paths are often non-linear and the fact that their information need may alter during a search session.

The second more impactful part of the talk dealt with various cognitive models of information seeking and searching and showed how better understanding of the user has influenced interface designs to go beyond the more traditional query-result list paradigm.

Paul Clough (University of Sheffield) gave an authoritative talk on Multilingual Retrieval, with much information coming from his new book out on the subject. His talk mentioned the many reasons that multilingual search is becoming more important in its various forms as well as highlighting the more common stumbling blocks. He mentioned the many models employed today, including the more sophisticated language models. He stated that lab test results can often reach up to 99% of monolingual IR but it is often the cost and the uncertainty of results due to nuances in language that has prevented the presence of more multilingual IR systems. Again the users themselves and their needs have been taken into account more with regards system functionality.

There were of course other talks of note but some of the above topics will be covered in more depth in upcoming blog posts.

/Peter Voisey