Arjen de Vries’ talk at ESSIR 2013 (Granada, Spain) highlighted the opportunities and difficulties in using Information Retrieval (IR) and social media to both make sense of unstructured data that a computer cannot easily make sense of by itself and realise deeper hidden information.
Today social media has become more important as interactions on sites have developed to include user-generated content of all types from ratings, comments and experiences to uploaded images, blogs and videos. Users now not only consume content and products but also co-create, interact with other users and help categorise content through means of tagging or hash-tags. All of which may result in ‘data consumption’ traces.
Some of these social media platforms like Twitter provide access to a variety of data such as user profiles, their connections with other users, their shared or published content and even how they react to each other’s content through comments and ratings. Analysis of this data can provide new insights.
One example is a case from CIW research. They calculated a top music artist popularity chart for each of the following different music sites: EchoNest, Last.fm and Spotify. Further research was then done for the band The Black Keys, where it was found that their popularity did not vary over time for either Last.fm or Spotify. However when using bit.ly data from tweets about the band for the very same period, it was found that interest in them rocketed after their Grammy win in the States, information that was not apparent from the previous research.
Using social media data to enrich information
The challenge that remains for IR research however is that these social media platforms vary in functionality. What they let users do will often determine the usefulness of the resultant data. For example, YouTube and Flickr will only let the up-loader tag their own content, while the film site IMDb allows tagging but the tags are not registered personally, rather they go into a pool. Arjen cited ‘Red Hot Chilli Peppers’ as a simple example of the usefulness of such social media data in allowing disambiguation either through implicit metadata from a user comment about eating chillies and/or information derived from the organisational data from say Flickr where a picture of red hot chilli peppers is grouped with other pictures of fruit and vegetables.
The key point here is that researchers often bemoan the fact that they do not always have access to log server files. Social media data left by users about content or objects can at times provide a richer representation in matching an information need and the response to that need. The potential benefits here for Information Retrieval are several:
- The expanded content representation
- The reduction in the ‘vocabulary gap(s)’ between content creator, indexers and information seekers
- The increase in diversity of view on the same content
- And the opportunity to make better assumptions about a user’s context and the variety of contexts that may exist
Allowing all users to tag all available content improves retrieval tasks
Where information about users on media sites is ‘open,’ (sometimes it is not, sites like Facebook and Linkedin are notoriously difficult to retrieve data from) there is the ability to discover which user labels what item with what word, and in some cases, even what rating they give. In essence many new sources play the role of anchor texts, be they tags, ratings, tweets, comments or reviews. The standard triangle of user, tag and item allows a unifying research approach based on random walks and can answer many questions. The talk emphasised that the area clearly has ample opportunity for researchers to make their mark.
One example of the research potential was shown in the case where LibraryThing.com was used to detect synonyms. The website allows users to keep a record of books they have read, tag, rate and comment on them. From these connections a synonym detector was created, allowing the example query word of ‘humour,’ to have a proposed list of synonyms created that included: ‘humor (US English), funny, humorous and British humour.
Analysis of this type of research has shown that allowing all users to tag all available content improves retrieval tasks, and that combining tags and ratings may both improve search and recommendation tasks even though there are some cases where lost relations between user, tag and item may occur.
The takeaway from this talk was that there is no one means or approach in retrieving information from Social Media due to the many complexities involved – not least because the limitations of user interactivity in some platforms but also due to limitations in the usage of data accessed. As Arjen demonstrated though, it is possible in certain cases to be innovative in the collection of rewarding data – an approach that may be utilized more and more, particularly as users/customers move closer to product and service providers through increased online interactions.