The Findability blog

the enterprise search and findability blog by Tietoevry Findwise

Main menu

Skip to primary content
Skip to secondary content
  • Home
  • About

Tag Archives: Metadata

Post navigation

← Older posts

Five things to consider when migrating to SharePoint 2013

Posted on May 2, 2014 by
1

Search in SharePoint 2013 – Part 2: Five things to consider when migrating to SharePoint

Planning to migrate to SharePoint 2013? In this post, we provide a few ideas on what to take into consideration when you are planning a migration to SharePoint 2013. Note that this is by no means a comprehensive list of what to consider when migrating, but can surely help you get started! If you want more information than is found in this blogpost, please visit our website or contact us.

This post is the second in a series of four articles providing several best practices on how to implement and customise search in SharePoint. In the first post, we provided a brief overview of the differences in terms of search between the on-premise and cloud versions. You’ll find the first article here.

#1 Understand the benefits of migrating to SharePoint 2013

Here are some of the benefits of migrating content to SharePoint:

Metadata: The variety of metadata options in SharePoint are much bigger than the system fields that are applied to files in the File Share, and you can adapt the variety of metadata to fit your organisation.

Control: The workflows that are available in SharePoint can ensure that content follows a specific life-cycle. You can also make sure that metadata is applied in a controlled manner (having mandatory metadata fields for example). It is also easier to ensure correct spelling of titles and keywords if you use a controlled vocabulary in SharePoint’s Term Store.

Collaboration: When the content is migrated to SharePoint every user can benefit from the collaborative functions that are built-in to SharePoint — such as commenting, hashtags, wiki libraries, or discussion boards. On a File Share the content is only stored in files, which can make it difficult for a new employee in the area to guess the logical connection between the files in a folder.

Search: The search functionalities in Explorer are very limited and cannot deliver the same user experience as search in SharePoint can. Just make sure that the users actually use and see these benefits from search in SharePoint, and they will never go back to the primitive search of Explorer!

#2 Start with your content

You probably agree and understand the importance of the quality of content when it comes to findability. If you don’t take care of your content you cannot expect to have good search results. Having a lot of old content and duplicates will make it difficult for users to find the relevant results. Taking care of the content quality however creates a solid foundation for the search (see also this previous post or this one).

While you agree with this now, as you approach the deadline of migrating your content to the new solution and are pressed by time, it is likely that you will become increasingly willing to make compromises. Having an information architecture and migration strategy is very important in these situations in order to make sure that you don’t make too many exceptions on the way and end up with a solution that is far from your initial expectations. Planning in good time what content should be migrated and what should be archived is crucial. Think of this migration task as an opportunity for cleaning up.

A migration project can also be a good opportunity to look at the organisation’s complete information management strategy and a chance to look into where the content is stored in the organisation. As many of you might have experienced, the users can sometimes be very keen on keeping their stuff on File Shares instead of sharing and publishing it on SharePoint. One decision could be to only keep personal content on File Shares and get rid of the department or project folders. Either way, it is important to consider this in the plan if you decide to include content from other sources than the existing SharePoint platform. Moreover, always make sure that the decision is implemented with all employees.

You should always include the following questions in your content analysis and take the necessary measures to achieve better content quality:

  • Is the content still relevant, should it be archived, or removed?
  • Do you have a workflow for archiving old documents and periodically updating information?
  • Do you need to assign an owner to the content. Please note that editor and owner are two different roles that should not be mixed up when it comes to ensuring the quality of content.
  • Is the content well organized? Have you been using (customized) content types or are you planning to?
  • Is the content tagged with relevant keywords?
  • Are you using templates for documents? And if yes, are those well organized (headings, metadata etc)
  • Do you have control over which permissions are applied on documents?
  • Is the content targeted appropriately? Should it reside in a document library or as a blog post?
  • Are the taxonomies up to date?

#3 Understand the technology

The technical solution is often the starting point of a migration project. Understanding the technology, which often means understanding the possibilities within the new version, is crucial to the success of the project, but it should of course not be the only focus point.

Here are some aspects to include when performing an analysis of the technology that you are planning to use:

  • Determine the hardware and software requirements for your installation
  • Identify the customisations made on your existing search solution and create a plan on whether and how to move them to the new solution. Branding for example might not look so good after upgrading from SharePoint 2010 to 2013, since there was a big change in how the web parts are displayed and customised (more specifically, the XSLT has been replaced by HTML and JavaScript), so the design and branding should be thoroughly tested.
  • Also, the way the search scopes are defined has also changed between versions, so if you are using these be aware to either replicate them correctly or make the transition as easy for the users as possible.
  • Check the Microsoft Office list of discontinued or modified features
  • Map your requirements to the technical capabilities
  • Create a strategy for moving existing content. Will you be using a migration tool or will representatives from the users be responsible for moving the content? How many steps will you plan for and what content will you start and end with?
  • When it comes to mapping your requirements to the technical features, it becomes a bit tricky with SharePoint 2013 on finding the features that apply specifically to your SharePoint licensing. There are more than 10 SharePoint licensing plans to choose from. In the previous post we summarised the search features that differ amongst the main licenses, and discussed in particular those that might be missing from SharePoint Online.

Don’t forget to document any customisations that you plan to do. If you don’t plan to do the customisations yourself, ask the provider for appropriate documentation. Such documentation will make it easier for you and providers such as Findwise in future implementations.

#4 Where is home?

One decision that has to be made at some point is how will the users access the new solution, what will they see? While this can be described not only as a technical requirement, it is strongly correlated to the expected user behaviour and depends on the usage scenarios that the new solution is supposed to answer. Remember that SharePoint is also a collaboration tool; through the choice of the start page, you are also communicating to the users the way to work together.

Some alternatives for the starting page in an intranet solution based on SharePoint 2013 would be the following:

  • The Newsfeed. This is where the latest news relevant to the current user will be shown. It is the place where the users can view posts and updates from followed people, sites and documents, from the entire solution. SharePoint 2013 comes with many new social features and the newsfeed is a good start page choice if you plan to use these features. A link to the user’s Newsfeed is by default available in the global navigation bar, which makes the start page easy accessible.

  • The search page. Use the search center as the start page to keep users only one click away to the information they are looking for. A good start page choice if you’ve done your job in making the content findable. In addition to the search box, you can think of adding search-driven web parts delivering content that is relevant to the logged-in user. Show latest sales presentations if the logged-in user is working in the Sales department, and show financial spreadsheets if the user is working in the Finance department.

  • A specific site. Set as the starting point the root site of your intranet for example. Plan to always keep the information on that site’s homepage updated. You can give the users the latest news through a newsfeed web part, but also include a few useful links to other resources (such as external systems). Search-driven web parts can contribute to having the page updated with the latest content and potentially target the content depending on the logged-in user.

#5 Create a communication plan and get your users aboard!

After all the hard work on understanding the content, the technology and requirements, there is one more step that should not be neglected when planning the migration: user adoption. A big impact factor to user adoption is the way that the information will be transmitted to the team and the users.

Here are a few things to take into consideration when creating a communication plan for the migration:

  • Define who is part of the migration team, what roles they will play, and when they will become active in the migration rollout (for example, site owners need to create the structure of their site and should be notified when the sites are ready to use)
  • Provide training for site owners
  • Plan for a test group (gather feedback early in the process and prioritize ideas)
  • Potentially find a new name for the solution, with a powerful message
  • Communicate clearly the new release, and make it fun (release cake, introductory videos)
  • Communicate how users should make use of the new social features (including tagging and Newsfeed)
  • Communicate how users and site owners can ask for support
  • Share the governance plan with all users (everyone is expected to be a contributor)

After all your careful planning of the migration, it’s time for implementation! In the next post, we will give you a few tips on how to customise search in SharePoint Online.

Do you want more information, have further questions or need help? Stop by our website or contact us!

Posted in Enterprise Search, Information Architecture, Microsoft, Sharepoint | Tagged Collaboration, Communication, Content analysis, Metadata, Microsoft, Microsoft Office, Microsoft SharePoint, Migration, Requirements, SharePoint, SharePoint 2013, SharePoint Online, SharePoint’s Term Store, Social, Tips | 1 Reply

Enterprise search case study: Vårdaktörsportalen makes reliable information easy to find for health professionals

Posted on December 11, 2012 by
Reply

Vårdaktörsportalen (VAP) is a portal for health care providers made by Västra Götalandsregionen (VGR). This portal makes information from a number of reliable and authorised sources findable and accessible for the people who need it in their daily work. This stretches from doctors and nurses to medical secretaries primarily located in the region of Västra Götaland, Sweden. The site and most of its features and information is also accesible (in Swedish) to anyone through http://vap.vgregion.se. In November 2012 the first version of this site went live and Findwise had a big role in the creation of this search centric site. The main source of information for VAP is the regional guidelines found in a document repository within VGR but some other external sources are also included. These include trustworthy authorities like

  • Socialstyrelsen
  • Läkemedelsverket
  • SBU
  • TLV
  • 1177 – the public health care information site for citizens

This search solution is built around the open source search engine Apache Solr and our common tools for processing and indexing. For this site we have also implemented a rather unique metadata enhancement service that automatically extracts keywords from the document to index and attaches it as metadata. The keyword extraction is based on information from the medical term database SweMeSH. More information (in Swedish) can be found on Google Code. We also include synonyms to keywords to increase recall, making it easier to find documents regardless of what synonym used.

The metadata enhancement service was included because the quality of metadata on the external sites was not very good. VGR will work with the above mentioned authorities to try and make them understand that they would benefit if they improved their data. The source 1177 stands out with very good meta data and overall good quality texts.

We have conducted a user study to see how this first version is able to satisfy user demands. The result of that study shows that some local sources are missing, but a general positive feedback on the idea and the graphical design was collected.

Findwise is looking forward to continue working on VAP with VGR in the future to make it an even better tool.

Related links:

  • http://vap.vgregion.se
  • http://webbfunktion.com/grafisk-form-pa-vardaktorsportalen/
  • http://vardaktorsportalen.se/
Posted in Content refinement, Data Processing, Enterprise Search, Findability, Information quality, Knowledge management, Open source, Solr | Tagged Apache Solr, Data management, Information retrieval, Metadata, open-source search engine, search centric site, search engine, search solution, Sweden, Västra Götaland, Web portal | Leave a reply

People, Topics and Information Flow Key for Findability

Posted on August 20, 2012 by
2

Understanding and utilizing the context of both people and topic (subject) is the future of enterprise search and findability. As we have seen the last few years, the amount of information that is created within organisations and elsewhere is growing exponentially. This makes it harder, day-by-day, to find the information that is relevant at any given moment. By organizing information based on topic, by using text analytics, better metadata, adding user tagging, sentiment analysis etc. it is possible to make findability better. A few examples are mentioned in this blog post series on information flow from 2010. The whole point of findability boils down to improving the information flow and access at any given time. Example on Information Flow from the Intranet of Region Västra Götaland.

In order to make sense of any arbitrary information we as humans usually need the help of someone familiar with the topic to help us makes sense of it and understand it. By both addressing the challenge of finding people with the right knowledge and finding the right information, we can contextually make the information more relevant and easier to find.

For example by doing search analytics and looking at usage patterns in general or by looking at how people with the same usage (search) patterns are going about finding information, we can give better suggestions. Also, recommendations of information produced or liked by people who are like you have a better chance of being relevant to you. By using Social Network Analysis, we should be able to find patterns in what information is in demand and how the informations flows. The analysis can of course also be used to find the supernodes, meaning the people through which information and connections flow. For example, email is a under-utilized source of information flow, knowledge, context and social network analysis.

On the 28th of August, at the World Café in Oslo, Kristian Norling will talk about findability and collaboration, with a focus on people and topic centric solutions. Examples from Region Västra Götaland and other projects will be made.

Posted in Business, Findability, Findwise, Future development, Presentation, Search, Uncategorized | Tagged Enterprise Search, findability, Information, Knowledge, Knowledge representation, Kristian Norling, Metadata, Oslo, topic centric solutions, World Café | 2 Replies

Data and Search Going Big?

Posted on April 25, 2012 by
1

A few enterprise search specialists from Findwise recently attended the Scandinavian Developer Conference 2012. One of the tracks was Big Data, which is very much related to search. It had some interesting talks about how to handle large amounts of data in an efficient way. Special thanks to Theo Hultberg, Jim Webber and Tim Berglund!

The theme was that you should choose a storage system which is well suited for the task. This may seem like an obvious point, but for a long time this was simply ignored; I’m talking about the era of relational databases. Don’t get me wrong, sometimes a relational database is the very best for the job, but in many cases it isn’t.

Data is jagged by nature, i.e. not all objects have the same properties. This is why we shouldn’t force them to fit into a square table, instead everything should be denormalized! The application accessing the data will be aware of the information structure and will handle it accordingly. This will also avoid expensive assembly operations (such as joins) to get the data in the format we want when retrieving it. Why should you split up your data if you are going to assemble it over and over again? Also remember that disk space is cheap, pre-compute as much as possible. The design of a Big Data system should be governed by how the data will be retrieved.

Another step away from the relational databases is the relaxation of some of the ACID properties: Atomicity, Consistency, Isolation and Durability. Again, this is along the lines of choosing the components best suited for the system. Decide which properties are a must have and which are not so important.

Relaxing the ACID properties, such as consistency, can give great performance gains. The NoSQL database Cassandra is eventually consistent and its write performance scales linearly up to 288 nodes (and probably even higher) which gives a write performance of over 1 million writes per second!

However, relaxation of these properties is not a new concept in the world of search engines. When indexing a document, it will usually take a number of seconds before it is searchable. This is called eventual consistency, i.e. the state of the search engine will be brought from one valid state to another, within a sufficiently long period of time. Do we really need documents that were just submitted to the search engine to be
searchable instantly? Most likely, no. Isolation is another property that is not crucial to a search engine. Since a document in an index doesn’t have any explicit relations to any other documents in the same index, there isn’t a great need for isolation. If two writes for the same document are submitted at the same time, there is probably something wrong in another part of the system.

So what does all this mean for search? There is an interesting challenge in storing jagged data in large amounts and then making good use out of it. To search in vast amounts jagged data, you need a lot of querytime field mappings (to make relevant data searchable) … or do you? There is also the issue of retaining a good relevancy model, which is absolutely vital to a search engine. How do you measure the relevance of arbitrary metadata and then weigh it all together? Maybe we need to think in new ways about relevance all together?

Whomever can solve these problems in a good way with a minimum amount of manual labor, is a name we’ll be hearing from a lot in the future.

Posted in Big Data, Conference, Search, Search Watch, Technology | Tagged ACID, Atomicity, big data, conference, data, Data management, Database, Database management systems, Database theory, Databases, Durability, Enterprise Search, enterprise search specialists, Isolation, Jim Webber, Linearizability, Metadata, relational database, scandev, search, search engine, search engines, Theo Hultberg, Tim Berglund, Transaction processing | 1 Reply

Automated Testing of Enterprise Search Solutions

Posted on March 8, 2012 by
Reply

Quality assuring an enterprise search solution is challenging, yet important. The challenge is to be able to do continuous follow-up of the quality of the solution during implementation but also after release, when the solution is in production and operated by an operations team. Testing is important, but it is also costly – unless it can be automated.

So what kind of testing is specific for a search application? And what of that can be automated?

The whole idea of Enterprise Search is to provide the right information to the right people at the right time. The information made findable is normally stored in many different information systems and the information in these systems is constantly changing. In the end, every enterprise search solution operates in a context where the requirements of the end-users and the available content changes on a daily basis. In other words, assuring the quality of enterprise search is about assuring the quality of the information and the way that information is accessed by and delivered to the end-users.

During our engagements over the years, we have set routines and developed tools for automated testing of enterprise search. What we specifically want to track in an automated fashion is:

  • Completeness
  • Freshness
  • Access restrictions
  • Metadata quality
  • Performance
  • Relevance

Allow me to take a few moments and describe what this means.

Completeness testing

Completeness tests aim to make sure that the search index is complete – that all information objects (such as web pages and documents) that are supposed to be searchable are really searchable. In addition, completeness testing provides proof that the correct parts of the information objects are indexed for retrieval, e.g. all pages in a multi-page document, as well as titles and other searchable metadata. It is also important to monitor that information that should not be searchable is indeed not indexed, e.g. headers and footers of web pages.

Freshness testing

Freshness tests aim to make sure that the search index is up to date, i.e. new content that has been added to a source (such as a document management system) becomes searchable, deleted content is removed automatically from the search index and updated content is updated in the search index – all in due time.

Testing access restrictions

If an enterprise search solution provides access to access-controlled information, it is of uttermost importance to be able to prove that security is never compromised. Testing access restrictions aim to do precisely that. What one needs to monitor is that existing document-level security works, i.e. that people who should have access to an information object really has access and that people who shouldn’t have access, don’t have access. The tricky part is to monitor that a change in access privileges in for instance Active Directory or in the access restrictions (the ACL) for a particular document is handled in the search index as well in due time.

Testing metadata quality

Each information object in the search index contains a set of fields containing metadata and text, e.g. a title, the text body, an author, a timestamp containing last modification date, information on file format, a keywords field and many more.

In an enterprise search setting, many different information models implemented in the source systems need to be harmonized into one common domain model (schema/index profile/information model) in the search index. This means information regarding a creator of an information object in one system and a publisher of an information object in another system can be stored in a common author metadata field in the search index in a common, defined format such as Firstname Lastname regardless of formatting in the source system. Unless you have a common model in the index, you can’t provide features like cross-system filtering with facets.

So how do you track that the metadata in the search index stays in good shape? This is the aim of metadata testing. The test cases provided for metadata testing need to check that the metadata in the search index conforms to the defined domain model and formatting even when the underlying content changes in the source systems.

Performance testing

Performance testing is probably the easiest type of tests you can create and run. In the end you will have a threshold or pain limit in milliseconds under which a query in the enterprise search solution will be required to provide an answer even under peak times with high query loads. Normally you will also be monitoring issues like RAM and processor capacity usage of the software components of your solution to be able to generate automatic alerts to the maintenance team if the hardware is under too much pressure.

Relevance testing

Quality assuring the relevance model of an enterprise search solution is tricky. Largely because relevance in a result set is to some extent subjective. However, when implementing search, one does need to set a relevance model that presupposes a set of business rules for what type of content is to be deemed more important than other. For example, when making documents in a document management system searchable, a typical business rule would be that documents tagged with Status=Approved must always be deemed more important than documents with any other status (such as Preliminary or Deprecated). Another typical rule is that a document for which a query term can be found in the title or in the keywords metadata field is most likely more important than documents where the query term is found elsewhere in the text body.

What it all boils down to is the definition of the business rules for relevance. Once you have defined the rules that govern how the results are to be ranked, you can also create test cases, i.e. associate query terms with information objects that must be returned as top results given these terms.

Automating it all

Once you have defined you test cases for all the above mentioned types of tests in a test plan, you are ready to automate, i.e. enter the test plan into a test automation framework. The beauty of it all is that you can automate regression testing during the implementation phase of an enterprise search solution, i.e. continuously test that new development does not break such parts of the solution that worked as intended before. This is in particular important if you add new information sources to your enterprise search solution, when there is a high risk that the relevance model that worked fine yesterday all of the sudden gets out of order. In addition, after the release of the enterprise search solution, the test automation framework will assist the operations team in monitoring that the solution behaves as expected even after the implementation team has left the building. All in all this leads to continuously good quality of the solution while lowering the costs for monitoring.

Posted in Development, Enterprise Search, Governance, Search, Testing | Tagged author, common author, Concept Search, content management systems, Document Management System, Enterprise content management, Enterprise Search, enterprise search setting, enterprise search solution, Index, Information retrieval, information systems, Information technology management, Metadata, RAM, Relevance, search application, search index, search index stays, Searching, software components | Leave a reply

Content Choreography?

Posted on October 27, 2011 by
Reply

Is getting the right content to the right users and customers a priority for you and your organisation? Do you drown in too much information? With some insight into how to manage content your answer is probably “Yes!”.

Today we have loads of channels to choose from, e-mails, internet/intranets, Yammer feeds, blogs and different collaboration platforms and social media services. Some content is more beneficial in one channel and other content in another channel. But how do you make sure the right information reaches the right users, in the right channels?

Content Choreography aims to handle all that; Content, strategy, format and delivery.

We need to tailor the user/customer experience in order to achieve good Findability. How? Taxonomy, Metadata and Search!
Taxonomy to ensure that we speak the same language, metadata to classify the content to fulfill a certain task or objective and search to deliver it to the right channel.

Need more information about Content Choreography?
Join us in our joint seminar with KnowIT, Nov 22nd: Future Choreography of Content Management, where Seth Earley – CEO at Early Associates will speak about Content Choreography – The Art of Dynamic Web Content. Seth Earley have more than 20 years experience in the field and is a very eloquent and interesting speaker. He will share his thoughts and ideas gathered from a number of large customers worldwide.

More information and registration can be found here.

Posted in Information Architecture, Information management, Strategy, Uncategorized | Tagged CEO, Content, content management, content management systems, Data management, Early Associates, eloquent and interesting speaker, findability, Information science, internet/intranets, Intranet, Knowledge representation, Metadata, Seth Earley, social media services, Web Content, web design, Yammer | Leave a reply

Collaborative, Social and Adaptive Relevance in Enterprise Search

Posted on June 30, 2011 by
Reply

Providing spot-on results with good relevance in Enterprise Search solutions is one of the hardest tasks when working with findability. Sure, it is doable to work out a generic model for ranking results based on the organization’s most common requirements on findability in conjunction with available metadata of the information made findable. But is it enough?

The burning question is: How can you ensure that the generic relevance model does not get outdated once the Findability solution has been in use for a month, half a year, a year and the implementation crew is long gone?

Findwise recently released a large Enterprise Findability solution at a customer in the electrical power industry in Sweden. In the project we identified personalized and adaptive relevance as two key requirements for the findability solution to provide real, future-proof value-in-use to a large set of people with fundamentally different roles within the company. This blog post will focus on the latter requirement, adaptiveness: How can we make sure that an Enterprise Findability solution returns search results that become better and better as the solution is used?

Let user behavior improve the behavior of the search tool

The Enterprise Findability solution rolled out at the power company contains two features that, put together, build the foundation of a continuously improving relevance model:

  1. A feature that promotes popular content given a query term – “social relevance”
  2. A feature that continuously changes the relevance model by boosting the relevance of popular documents – “adaptive relevance

Social relevance

Inspired by e-commerce actors on the web, the delivered Enterprise Findability solution uses the logged behavior of its users to promote popular content. When an end-user searches for, e.g. “terawatt hours”, the solution by default offers search results ranked and sorted according to the generic relevance model. This is what any search tool would do. But this solution also uses search logs to promote popular content just as e-commerce sites have been doing for years – “Other people searching for ‘terawatt hours’ viewed ‘Current power production’ (intranet page), ‘Definition of terms in the electrical power industry’ (PDF document)” etc.

By combining the intel of the search logs (where the end-user behavior of an Enterprise Findability solution is constantly collected) and the best bets (editorially provided “sponsored links”) with the regular search result, end-users are presented with a rich set of information answering their original question from different angles. And the best part of it is that the social relevance feature constantly improves as the tool is used. People get better results as time goes by.

Adaptive relevance

In addition to the social relevance feature, the vast amount of real search behavior compiled in the search logs is used for improving the generic relevance model as well. The solution tracks changes in popularity of content and adapts the document-level scores of documents and web pages in the search index accordingly. If a document is accessed often through the search tool, the document will be deemed “more important” and start climbing towards top positions in the search result. And if a previously popular document becomes less popular as time goes by, the document’s impact on the relevance model is decreased. In the end, content that has great importance for a limited amount of time (such as news items and weekly lunch menus) will first peek and then dip in the search index. The search index and the generic relevance model attached to it will stay fresh.

From generic to personalized search experience

This blog post has pinpointed a couple of solutions for a continuously-improving, generic relevance model in an Enterprise Findability solution. Obviously, generic models are generic, i.e. good enough for the many, not perfect for the few. There are great ways to address personalization solving many of the role-based challenges of Enterprise Findability, but let’s leave that to another, future blog post. Stay tuned!

Posted in Business, Findability | Tagged e-commerce actors, e-commerce sites, Enterprise Search, findability, findability solution, Index, Information retrieval, Metadata, PDF, Personalization, personalized search experience, real search behavior, regular search result, Relevance, search index, search logs, search result, search tool, Sweden, Web search engine | Leave a reply

European Conference on Information Retrieval (ECIR) 2011 in retrospect

Posted on April 27, 2011 by
1

The European Conference on Information Retrieval (ECIR) 2011 took place in Dublin last week, 18-21 April. In this blogpost I would try to highlight some of the papers and talks from the conference which caught my attention and back it up with what other attendees said about it.

First, I was intrigued by the session on evaluation for IR and especially the topic of Croudsourcing. In my opition, the paper A Methodology for Evaluating Aggregated Search Results, which also got the prize for best student paper, was among the most pedagogically presented ones. It deals with the task of incorporating search results from a number of different sources, called verticals, into Web search results. By using a small number of human judgements for a given query the authors present the way to evaluate any possible permutation of verticals in the result presentation. I think that this methodology should be adopted in the world of Enterprise search, since it is exactly there where we crawl, index and present information from a number of different sources – Web, databases, fileshares, etc. The prerequisites are really minimal and low cost but the return value, the user experience, seems quite high.

Amazon Mechanical Turk, or the Artificial Artificial Intelligence, which is the marketplace for Croudsourcing, provides a way for a ridiculously small sum of money to perform evaluation, relevance assessment or any task for which you would need humans to give you some judgements. Leaving aside ethical issues, two papers in the conference presented ways of how you can utilize this service for some IR tasks.

Evgeniy Gabrilovich from Yahoo! Research, who won the Karen Sparck Jones award for 2010, gave a very interesting keynote talk on Computational Advertising. Up to now, it has never struck me how hard advertising in Information Retrieval systems is actually. I liked one of his points on the future of Ads – by using product feeds, one can automatically create product description via Text Summarization and Natural Language Generation and index this, thus avoiding bid words.

Another interesting and very pedagogically presented paper was about the gensim package by Radim Řehůřek. I definitely think we can use it in some of our projects. In general, text categorization and IR for social network were the dominant tracks. In one of the social networks tracks, Oscar Täckström presented a neat way of discovering fine-grained sentiment where some coarse-grained supervision is available. It really hooked me on trying it for any of our customers where sentiment analysis is required.

Thorsten Joachims, the last of the keynote speakers, gave a very inspiring talk on The Value of User Feedback. He put forward the idea of designing retrieval systems for feedback. In stead of just looking at the clicklogs post factum one can think of a system which uses the clicks feedback to learn, thus creating a better ranker for a given query and a given user need. In a single session, we can use click feedback to disambiguate the query and deliver results on the run which are of immediate benefit to the users.

Unfortunately, I guess I could have missed other interesting presentations but with two parallel sessions and several workshops there was a limit to what I could devour. What surprised me though, was that there were very few papers by the industry. We do try to solve exactly the same problems and tackle the same issues as academia. We, at Findwise, have constantly flagged the huge benefit of good, relevant Metadata for the task of achieving better search performace, which was also touched upon in the paper “Topic Classification in Social Media using Metadata from Hyperlinked Objects”.

It was really great to visit Dublin and attent ECIR 2011. It was an inspiring conference and I do believe that at next ECIR we, from Findwise, can be on the podium, sharing our knowledge and hands-on experience on Enterprise search and IR.

Sláinte!

Posted in Enterprise Search, Findwise, Internet search, Relevancy, Research, Search, Technology, User Experience | Tagged Amazon, Artificial Artificial Intelligence, Document classification, Dublin, European Conference on Information Retrieval, Evgeniy Gabrilovich, hard advertising, Information retrieval, Information science, Metadata, Oscar Täckström, retrieval systems, Science, search performace, search results, social media, social network, Storage, Thorsten Joachims, Web search results, Yahoo | 1 Reply

Open Source Tools for Text Analytics

Posted on March 21, 2011 by
Reply

Recently, both clients of Findwise as well as the Enterprise Search community in general are increasingly showing interest in text analytics in order to get a higher business value out of their (often large) volumes of unstructured information.

Text Analytics merges techniques from linguistics, computer science, machine learning, statistics and many of the central algorithms in this field are publically available as open source tools and packages with easily accessible APIs. While many customers of commercial Enterprise Search solutions, such as Automomy, IBM Omnifind, Microsoft FAST ESP, etc., have long benefitted from some sort of Text Analytics (e.g. Entity Extraction, Keyword Extraction and document summarization), the open source components have now come a long way in providing alternative, free of charge solutions with similar performance and feature set.

As every modern enterprise search architecture today has some kind of document processing that is extensible by additional stages or APIs (for example the Open Pipeline with Solr or the pipeline that comes with Microsoft FAST) – the opportunity for plugging new text analytics stages to existing search implementations is open and ready for new innovation.

Among the most popular applications of text analytics that have emerged lately are customized entity extraction, sentiment analysis and document classification – each with a set of open source alternatives (such as Balie, OpenNLP and GATE) readily available for customization and implementation to your document processing.

Regardless of your industry domain, these techniques open up for a wide variety of new ways to interpret the content and discover new trends from your unstructured textual data – be it through sentiment analysis to support the decision making process, trend analysis or relevance model of search, or entity extraction in order to navigate your content by entities (such as company name or person), the enhancement of your texts by meta-data tagging or finding similar and related content.

How are you taking advantage of modern text analytics?

Posted in Data Processing, Open Pipeline, Open source, Search | Tagged Analytics, Apache Solr, Artificial intelligence, central algorithms, charge solutions, Computational linguistics, Data analysis, Data management, document processing, enterprise search architecture, Findwise, IBM, machine learning, Metadata, Microsoft, Named entity recognition, Natural Language Processing, Open Pipeline, Open source tools, Science, search implementations, Text analytics, Text mining | Leave a reply

If a Piece of Content is Never Read, Does it Really Exist?

Posted on December 10, 2010 by
3

Since ancient times, information technology has developed from carvings in rock and wood to cell phones and Facebook. Still, the basic purpose remains the same; to facilitate communication between people separated by space and time. Therefore one can measure the successfulness of any information tool by two axes: how easy it is to create information and how easy it is to consume it. Being a Findability expert, I spend a large part of my life focusing on the latter. Therefore it troubles me that so many organizations wait so long when they are introducing new content management systems before looking at search. If I had a nickel for every time I heard “we are currently busy with building our new intranet/web page/collaboration tool and will look at search when the project is finished” I would definitely have had a few quarters by now.

I like to say that I am in the information marketing business. What I mean by that is that Findability is all about marketing information so that the consumers, your employees, can find the piece of information they need. And just as an industrialist would not construct a factory before doing a marketing plan, you should not build a new information repository without thinking about how the content created in that repository will reach its target audience. When marketing information, search is one of your most important channels.

While a enterprise search solution can definitely smooth out imperfections in information structure and quality using intelligent algorithms, spending a little time thinking about how you can make it easier for a search engine to deliver relevant results presented in a user friendly way can really make it shine. Some questions you can ask yourself are:

  • How can we make tagging so convenient that we have good metadata for presenting and filtering results using facets? Many search solutions have automated tagging functionality to take load off users.
  • How can we use search as an integration platform to pull in content from other sources instead of making costly one-time integrations?
  • How will the new information repository fit into an existing search solution, for example are we changing the metadata model and how should the documents be ranked compared to other sources?
  • Should we migrate content from an old system to the new one or just freeze information creation in the old one and have a search box that let’s the user find information from both?
  • Can we use search to avoid creating duplicate information by encouraging users to make searches before typing new content or even doing implicit searches while the user is typing?

So does a piece of content that no one ever reads exist? Well in terms of bits on a disk in a data center, yes, but in terms of business value definitely no. Designing your information repository for Findability will have great returns in improved efficiency and user satisfaction.

Posted in Findability, Governance, Search, Strategy | Tagged cell phones, content management, content management systems, Facebook, findability, Information retrieval, Information science, Information technology, intranet/web page/collaboration tool, Knowledge representation, Metadata, quality using intelligent algorithms, search box, search engine, search solution, search solutions, Tag, web design, Web search engine | 3 Replies

Post navigation

← Older posts

Recent Posts

  • Semantic Annotation (how to make stuff findable, and more)
  • Building a chatbot – that actually works
  • Design Elements of Search – Zero Results Page
  • Design Elements of Search – Landing Page
  • Design Elements of Search – Results

Recent Comments

  • Fashion Styles on Beyond Office 365 – knowledge graphs, Microsoft Graph & AI!
  • Fashion Styles on Beyond Office 365 – knowledge graphs, Microsoft Graph & AI!
  • vitaminler.ra6.org on Beyond Office 365 – knowledge graphs, Microsoft Graph & AI!
  • Health Fitness on Beyond Office 365 – knowledge graphs, Microsoft Graph & AI!
  • polipropilenovye-Meshki05.Ru on Beyond Office 365 – knowledge graphs, Microsoft Graph & AI!

Tags

Apache Software Foundation Apache Solr business intelligence content management systems Document Management System Enterprise Search Facebook findability Findwise Google Google Search Appliance Human-computer interaction IBM Index Information Information retrieval Information science internet search engines Intranet Knowledge representation Kristian Norling Lucene M&A Metadata Microsoft Microsoft SharePoint search analytics search engine search engines search experience Searching search platform search result search results search solution search solutions search technology SharePoint Social information processing Technical communication usability Web 2.0 web design Web search engine World Wide Web
Find us on Google+

Categories

Archives

Proudly powered by WordPress