3 easy ways to integrate external data sources with SharePoint Online

Introduction

SharePoint Online provides powerful tools able to search through various types of data. At Findwise we have worked with Microsoft search applications since the begining of the FAST era. If you have questions about or need help with integration of external sources – feel free to write me a couple of lines: lukasz.wojcik@findwise.com

 

Lets get started! First you must provide some content to SharePoint.

Here are some solutions you can choose from to feed SharePoint Online with your data:

Pushing data to SharePoint list using RESTful service

Using Business Connectivity Service

Using custom connector in hybrid infrastructure

 

Pushing data to SharePoint list using RESTful service

The most simple method to put some data in SharePoint is to write it directly to the SharePoint lists.

SharePoint Online exposes a REST API which can be used to manipulate lists.

Following steps will guide you through pushing data to SharePoint lists.

1. No token, no ride

First things first. In order to use any manipulation in SharePoint you must obtain an access token.

To do so, you must follow these steps:

  1. Handle page load event
  2. In the load event handler, read either of the following request parameters:
    • AppContext
    • AppContextToken
    • AccessToken
    • SPAppToken
  3. Create a SharePointContextToken from previously retrieved token using JsonWebSecurityTokenHandler
  4. Get the access token string using OAuth2S2SClient

2.     Know your list

By the time you want to manipulate your list you should probably have known your list name but you may not know its ID.

So, if you want to retrieve lists, you should call a GET method:

/_api/Web/lists

with header:

Authorization=Bearer <access token>

with content type:

application/atom+xml;type=entry

and accept header:

application/atom+xml

3.     Add entries to the list

Once you finally retrieve your list, you are ready to actually push your data.

There are few additional steps that need to be taken in order to execute POST request needed to add the items to the list:

  1. Get context info by calling POST method:
    /_api/contextinfo
  2. Get the form digest from received context info xml
  3. Get the list item entity type full name from the list data by calling GET method:
    /_api/Web/lists(guid'<list ID>')
  4. Form query string used to add new item to the list:
    {'__metadata':{'type':'" + <list item entity type full name> + "'}, 'Title':'" + <new item name> + "'}}
  5. Get the list items data by calling POST method:
/_api/Web/lists(guid'<list ID>')/Items

with headers:

Authorization=Bearer <access token>
X-RequestDigest=<form digest>

with content type:

application/json;odata=verbose

and accept header:

application/json;odata=verbose
  1. Write the byte array created upon the query string to the request stream.

That’s all, you’ve just added an entry to your list.

A full example code can be found here:

https://github.com/OfficeDev/SharePoint-Add-in-REST-OData-BasicDataOperations

 

 

Using Business Connectivity Service

SharePoint can gather searchable data by itself in a process called crawling. Crawling is getting all data from a specified content source and indexing its metadata.

There are various possible content sources that SharePoint can crawl using its built-in mechanisms, such as:

  • SharePoint Sites
  • Web Sites
  • File Shares
  • Exchange Public Folders
  • Line of Business Data
  • Custom Repository

In first four types of content you can choose multiple start addresses that are base paths where crawling process starts looking for data to index.

SharePoint Sites include all SharePoint Server and SharePoint Foundation sites available at the addresses specified as start addresses.

Web Sites include all sites over the Internet.

File Shares include files available via FTP or SMB protocols.

Exchange Public Folders include messages, discussions and collaborative content in Exchange servers.

Line of Business Data and Custom Repository include custom made connectors that provide any type of data. These are described in another method of connecting external data below.

To use first four types of content, all you have to do is to specify addresses where the crawling process should start its operation. Alternatively you can specify crawling schedule which will automatically start indexing data at the time specified in schedule.

There are two types of crawling:

  • Full – slower, indexes all encountered data, replacing any already existing data by new version
  • Incremental – faster, compares dates of encountered data and existing data and indexes the data only if the existing data is outdated

Though these methods are very simple and easy to use, they provide very limited flexibility and if you need more personalized way of storing your data in SharePoint which will be searchable in the future you should use more advanced technique involving creating Business Data Connectivity model, which is described below.

 

 

Using custom connector in hybrid infrastructure

Business Connectivity Service is a powerful extension but to get the most out of it, you must make some effort to prepare the Business Data Connectivity model used to define the structure of data you want to be able to search through.

1.     Create Business Data Connectivity Model

There are two simple ways to create the Business Data Connectivity model:

  • Using Microsoft SharePoint Designer
  • Using Microsoft Visual Studio

The Business Data Connectivity model is in fact stored in XML file so there’s the third way of creating the model – the hard way – edit the file manually.
Although, editing the Business Data Connectivity model file is not that easy as using visual designers, in many cases it’s the only way  to add some advanced functionalities, so it is advised to get familiar with the Business Data Connectivity model file structure.

Using Microsoft SharePoint Designer and Microsoft Visual Studio methods involve connecting to SharePoint On-Premise where the model is deployed. After the deployment the model needs to be exported to a package which can be installed on destination site.

1.1. Create Business Data Connectivity Model using Microsoft SharePoint Designer

The simplest  way to get started with Business Data Connectivity Model is to:

  • Run Microsoft SharePoint Designer
  • Connect to the destination site
  • Select External Content Types from the Navigation pane
  • Select External Content Type button from the External Content Types ribbon menu

The SharePoint Designer allows to choose the external data source from:

  • .NET assembly
  • Database connection
  • WCF web-service

The advantage of this method is that the model is automatically created from data discovered from data source.
For example if you choose database connection as a source of your data, the designer allows you to pick the database entities (such as tables, views, etc.) as a source and guides you through adding operations you want to be performed during the search process.
Saved model is automatically deployed in connected site and ready to use.

The disadvantage of this method is that only simple data types are supported and you won’t be able to add operations providing functionality of downloading attachments or checking user permissions to view searched elements,
thus adding parts responsible of these functionalities to the model file may be required.

1.2. Create Business Data Connectivity Model using Microsoft Visual Studio

In order to use the Visual Studio to create the Business Data Connectivity Model you must be running the environment on a system with SharePoint installed
To create the Business Data Connectivity Model you must take a few steps:

  • Run Visual Studio with administrative privileges
  • Create new project and select the SharePoint Project from SharePoint from either Visual C# or Visual Basic templates
  • Select Farm Solution and connect to your SharePoint site
  • Add new item to your newly created SharePoint project and select the Business Data Connectivity Model

Your new BDC Model can be now designed either in built-in SharePoint BDC Designer or in built-in XML Editor but with only one of them at time.

The advantage of designing the model in visual designer is that all defined methods are automatically generated in corresponding service source code.
Once project is built, it can be deployed directly to the connected destination site with a single click.

The disadvantage however is that you must define all fields of your business data by yourself and also create corresponding business model class.
You must also provide your connection with external system such as database.

While this method is very convenient when deploying the solution on SharePoint On-Premise, you must bear in mind that SharePoint Online doesn’t allow additional .NET assemblies that often come along with the model when creating a SharePoint Project containing a Business Data Connectivity Model.

2.     Export Business Data Connectivity Model

Once the model is created it needs to be exported to a package that can be installed on a destination site.

If you created your model in SharePoint Designer you can just right click on the model and select Export BCD model.

If you created your model in Visual Studio, you can export the model by selecting Publish command from Build menu. The simplest way to save the package is to select filesystem as a destination and then point where the package file should be saved.

3.     Import Business Data Connectivity Model into destination system

Once you have an installation package, you can import it as a solution in your SharePoint site settings.

To do so, navigate to the site settings and the to the solutions, where you can click on the Upload solution button and select the installation package.

 

Since SharePoint Online doesn’t allow using your own code as a data connector, you can use Hybrid infrastructure which involves using Business Data Connectivity Model on SharePoint Online side and the .NET assembly containing all the logic on correlated SharePoint On-Premise side. The logic provides all the necessary connections to data sources, formatting the data and all other customer required processing.

 

 

 

Conclusion

As you can see, integrating external data seem to be pretty simple and straight forward, but it still needs some effort to do it properly.

In the future posts I’ll cover the methods described above with details and some examples.

Time of intelligence: from Lucene/Solr revolution 2017

Lucene/Solr revolution 2017 has ended with me, Daniel Gómez Villanueva, and Tomasz Sobczak from Findwise on the spot.

First of all, I would like to thank LucidWorks for such a great conference, gathering this talented community and engaging companies all together. I would also like to thank all the companies reaching out to us. We will see you all very soon.

Some takeaways from Lucene/Solr revolution 2017

The conference basically met all of my expectations specially when it comes to the session talks. They gave ideas, inspired and reflected the capabilities of Solr and how competent platform it is when it comes to search and analytics.

So, what is the key take-away from this year’s conference? As usual the talks about relevance attract most audience showing that it is still a concern of search experts and companies out there. But what is different in this years relevance talks from previous years is that, if you want to achieve better result you need to add intelligent layers above/into your platform to achieve it. It is no longer lucrative nor profitable to spend time to tune field weights and boosts to satisfy the end users. Talk from Home Depo: “User Behaviour Driven Intelligent Query Re-ranking and Suggestion”, “Learning to rank with Apache Solr and bees“ from Bloomberg,  “An Intelligent, Personalized Information Retrieval Environment” from Sandia National Laboratories, are just a few examples of many talks where they show how intelligence comes to rescue and lets us achieve what is desired.

Get smarter with Solr

Even if we want to use what is provided out of the box by Solr, we need to be smarter. “A Multifaceted Look at Faceting – Using Facets ‘Under the Hood’ to Facilitate Relevant Search” by LucidWorks shows how they use faceting techniques to extract keywords, understand query language and rescore documents. “Art and Science Come Together When Mastering Relevance Ranking” by Wolters Kluwer is another example where they change/tune Solr default similarity model and apply advanced index time boost techniques to achieve better result. All of this shows that we need to be smarter when it comes to relevance engineering. The time of tuning and tweaking is over. It is the time of intelligence, human intelligence if I may call it.

Thanks again to LucidWorks and the amazing Solr Community. See you all next year. If not sooner.

Writer: Mohammad Shadab, search consultant / head of Solr and fusion solutions at Findwise

Web crawling is the last resort

Data source analysis is one of the crucial parts of an enterprise search deployment project. Search engine results quality strongly depends on an indexed data quality. In case of web-based sources, there are two basic ways of reaching the data: internal and external. Internal method involves reading the data directly from its storage place, such as a database, filesystem files or API. Documents are read by some criteria or all documents are read, depending on requirements. External technique relies on reading a rendered HTML with content via HTTP, the same way as it is read by human users. Reaching further documents (so called content discovery) is achieved by following hyperlinks present in the content or with a sitemap. This method is called a web crawling.

The crawling, in contrary to a direct source reading, does not require particular preparations. In a minimal variant, just a starting URL is required and that’s it. Content encoding is detected automatically, off the shelf components extract text from the HTML. The web crawling may appear as a quick and easy way to collect a content to be indexed. But after deeper analysis, it turns out to have multiple serious drawbacks.

Continue reading

Digital recycling & knowledge growth

How do we prevent the digital debris of human clutter and mess? And to what extent will future digital platforms guide us in knowledge creation and use?

Start making sense, and the art of making sense!

People and the Post, Postal History from the Smithsonian's  National Postal Museum

People and the Post, Postal History from the Smithsonian’s National Postal Museum

Mankind’s preoccupation for much of this century has to become fully digitalized. Utilities, software, services and platforms are all becoming an ‘intertwingled’ reality for all of us. Being mobile, the blurring of the borders between the workplace and recreational life plus the ease of digital creation are creating information overloads and (out-of-sight) digital landfills. While digital content creation is cheaper to create and store, its volume and its uncared for status makes it harder for everyone else to find and consume the bits they really need (and have some provenance for peace of mind).

Fear not. A collection of emerging digital technologies exist that can both support and maintain future sustainable digital recycling – things like: Cognitive Computing, Artificial Intelligence; Natural Language Processing; Machine Learning and the like, Semantics adding meaning to shared concepts, and Graphs linking our content and information resources. With good information management practice and having the appropriate supporting tools to tinker with, there is a great opportunity to not only automate knowledge digitization but to augment it.

Automation

In the content continuum (from its creation to its disposal) there is a great need for automating processes as much as possible in order to reduce the amount of obsolete or hidden (currently value-less) digital content. Digital knowledge recycling is difficult as nearly every document or content creator is, by nature, reluctant to add further digital tags (a.k.a. metadata) describing their content or documents once they have been created. What’s more experience shows this is inefficient on a number of accounts, one of which is inconsistency.

Most digital documents (and most digital content, unless intended to sell something publicly) therefore lack the proper recycling resource descriptors that can help with e.g. classification, topic description or annotation with domain specific (shared, consistent) concepts. Such descriptions add appropriate meaning or context to content, aiding its further digital reuse (consumption). Without them, the problem of findability is likely to remain omnipresent across many intranets and searched resources.

Smartphones generate content automatically, often without the user thinking or realizing. All kinds of resource descriptors (time, place etc.) are created automatically through movement and mobile usage. With the addition of further machine learning and algorithms, online services such as Google Photos use these descriptors (and some automatic annotation of their own) to add more contextual data before classifying pictures into collections. This improved data quality (read: metadata addition and improved findability) allows us to find the pictures or timeline we want more easily.

In the very same manner, workplace content or documents can now have this same type of supporting technical platform that automatically adds additional business specific context and meaning. This could include data from users: their profiles, departments or their system user behaviour patterns.

For real organizational agility though a further extra layer of automatic annotation (tagging) and classification is needed – achieved using shared models of the business. These models can be expressed through a combination of various controlled vocabularies (taxonomies) that can be further joined through relationships (ontologies) and finally published (publicly or privately) as domain models as linked data (in graphs). Within this layer exist not just synonyms, but alternative and preferred labels, and more importantly relationships can be expressed between concepts – hence the graph: concepts being the dots (nodes) with relationships the joining lines (vertices). Using certain tools, the certain relationships between concepts can be further given a weighting.

This added layer generates a higher quality of automated context, meaning and consistency for the annotation (tagging) of content and documents alike. The very same layer feeds information architecture in the navigation of resources (e.g. websites). In Search, it helps to disambiguate between queries (e.g. apple the fruit, or apple the organization?).

This digital helper application layer works very much in the same smooth manner as e.g. Google Photos, i.e. in the background, without troubling the user.

This automation however, will not work without sustainable organizing principles, applied in information management practices and tools. We still need a bit of human touch! (Just as Google Photos added theirs behind the scenes earlier, as a work in progress)

Augmentation

This codification or digitalization of knowledge allows content to be annotated, classified and navigated more efficiently. We are all becoming more aware of the Google Knowledge Graph or the Microsoft Graph that can connect content and people. The analogy of connecting the dots in a graph is like linking digital concepts and their known relationships or values.

Augmentation can take shape in a number of forms. A user searching for a particular query can be presented not only with the most appropriate search results (via the sense-making connections and relationships) but also can be presented with related ideas they had not thought of or were unaware of – new knowledge and serendipity!

Search, semantic, and cognitive platforms have now reached a much more useful level than in earlier days of AI. Through further techniques new knowledge can also be discovered by inference, using the known relationships within the graph to fill in missing knowledge.

Key to all of this though is the building of a supporting back-end platform for continuous improvement in the content continuum. Technically, something that is easier to start than one may first suspect.

Sustainable Organising Principles to the Digital Workplace

 


View Fredric Landqvist's LinkedIn profileFredric Landqvist research blog
View Peter Voisey's LinkedIn profilePeter Voisey

Choosing an Open Source search engine: Solr or Elasticsearch?

There has never been a better time to be a search and open source enthusiast than 2017. Far behind are the old days of information retrieval being a field only available for academia and experts.

Now we have plenty of search engines that allow us not only to search, but also navigate and discover our information. We are going to be focusing on two of the leading search engines that happed to be open source projects: Elasticsearch and Solr.

Comparison (Solr and Elasticsearch)

Organisations and Support

Solr is an Apache sub-project developed parallelly along Lucene. Thanks to this it has synchronized releases and benefits directly from any new Lucene feature.

Lucidworks (previously Lucid Imagination) is the main company supporting Solr. They provide development resources for Solr, commercial support, consulting services, technical training and commercial software around Solr. Lucidworks is based in San Francisco and offer their services in the USA and all rest of the world through strategic partners. Lucidworks have historically employed around a third of the most active Solr and Lucene committers, contributing most of the Solr code base and organizing the Lucene/Revolution conference every year.

Elasticsearch is an open-source product driven by the company Elastic (formerly known as Elasticsearch). This approach creates a good balance between the open-source community contributing to the product and the company making long term plans for future functionality as well as ensuring transparency and quality.

Elastic comprehends not only Elasticsearch but a set of open-source products called the Elastic stack: Elassticsearch, Kibana, Logstash, and Beats. The company offers support over the whole Elastic stack and a set of commercial products called X-Pack, all included, in different tiers of subscriptions. They offer trainings every second week around the world and organize the ElasticON user conferences.

Ecosystem

Solr is an Apache project and by being so it benefits from a large variety of apache projects that can be used along with it. The first and foremost example is its Lucene core (http://lucene.apache.org/core/) that is released on the same schedule and from which it receives all its main functionalities. The other main project is Zookeper that handles SolrCloud clusters configuration and distribution.

On the information gathering side there is Apache Nutch, a web crawler, and Flume , a distributed log collector.

When it comes to process information, there are no end to Apache projects, the most commonly used alongside Solr are Mahout for machine learning, Tika for document text and metadata extraction and Spark for data processing.

The big advantage lies in the big data management and storage, with the highly popular Hadoop  library as well as Hive, HBase, and Cassandra databases. Solr has support to store the index in a Hadoop Highly Distributed File System for high resilience.

Elasticsearch is owned by the Elastic company that drives and develops all the products on its ecosystem, which makes it very easy to use together.

The main open-source products of the Elastic stack along Elasticsearch are Beats, Logstash and Kibana. Beats is a modular platform to build different lightweight data collectors. Logstash is a data processing pipeline. Kibana is a visualization platform where you can build your own data visualization, but already has many build-in tools to create dashboards over your Elasticsearch data.

Elastic also develop a set of products that are available under subscription: X-Pack. Right now, X-Pack includes five producs: Security, Alerting, Monitoring, Reporting, and Graph. They all deliver a layer of functionality over the Elastic Stack that is described by its name. Most of them are included as a part of Elasticsearch and Kibana.

Strengths

Solr

  • Many interfaces, many clients, many languages.
  • A query is as simple as solr/select?q=query.
  • Easy to preconfigure.
  • Base product will always be complete in functionality, commercial is an addon.

Elasticsearch

  • Everything can be done with a JSON HTTP request.
  • Optimized for time-based information.
  • Tightly coupled ecosystem.

Base product will contain the base and is expandable, commercial are additional features.

solr vs elasticsearch comparison open source search engine

Conclusion – Solr or Elasticsearch?

If you are already using one of them and do not explicitly need a feature exclusive of the other, there is no big incentive in making a migration.

In any case, as the common answer when it comes to hardware sizing recommendations for any of them: “It depends.” It depends on the amount of data, the expected growth, the type of data, the available software ecosystem around each, and mostly the features that your requirements and ambitions demand; just to name a few.

 

At Findwise we can help you make a Platform evaluation study to find the perfect match for your organization and your information.

 

Written by: Daniel Gómez Villanueva – Findability and Search Expert

Using search technologies to create apps that even leaves Apple impressed

At Findwise we love to see how we can use the power of search technologies in ways that goes beyond the typical search box application.

One thing that has exploded the last few years is of course apps in smartphones and tablets. It’s no longer enough to store your knowledge in databases that are kept behind locked doors. Professionals of today want to have instant access to knowledge and information right where they are. Whether if it’s working at the factory floor or when showcasing new products for customers.

When you think of enterprise search today, you should consider it as a central hub of knowledge rather than just a classical search page on the intranet. Because when an enterprise search solution is in place, when information from different places have been normalized and indexed in one place, then there really are no limits for what you can do with the information.

By building this central hub of knowledge it’s simple to make that knowledge available for other applications and services within or outside of the organization. Smartphone and tablet applications is one great example.

Integrating mobile apps with search engine technologies works really well because of four reasons:

  • It’s fast. Search engines can find the right information using advanced queries or filtering options in a very short time, almost regardless of how big the index is.
  • It’s lightweight. The information handled by the device should only be what is needed by the device, no more, no less.
  • It’s easy to work with. Most search engine technologies provides a simple REST interface that’s easy to integrate with.
  • A unified interface for any content. If the content already is indexed by the enterprise search solution, then you use the same interface to access any kind of information.

 

We are working together with SKF, a company that has transformed itself from a traditional industry company into a knowledge engineering company over the last years. I think it’s safe to say that Findwise have been a big part of that journey by helping SKF create their enterprise search solution.

And of course, since we love new challenges, we have also helped them create a few mobile apps. In particular there are two different apps that we have helped out with:

  • SKF Shelf – A portable product brochures archive. The main use case is quick and easy access to product information for sales reps when visiting customers.
  • MPLP (mobile product landing page) – A mobile web app that you get to if you scan QR-codes printed on the package.

 

And even more recently the tech giant Apple has noticed how the apps makes the day to day work of employees easier.


SKF Shelf3

Read more about Enterprise Search at Findwise

What will happen in the information sector in 2017?

As we look back at 2016, we can say that it has been an exciting and groundbreaking year that has changed how we handle information. Let’s look back at the major developments from 2016 and list key focus areas that will play an important role in 2017.

3 trends from 2016 that will lay basis for shaping the year 2017

Cloud

There has been a massive shift towards cloud, not only using the cloud for hosting services but building on top of cloud-based services. This has affected all IT projects, especially the Enterprise Search market when Google decided to discontinue GSA and replace it with a cloud based Springboard. More official information on Springboard is still to be published in writing, but reach out to us if you are keen on hearing about the latest developments.

There are clear reasons why search is moving towards the cloud. Some of the main ones being machine learning and the amount of data. We have an astonishing amount of information available, and the cloud is simply the best way to handle this overflow. Development in the cloud is faster, the cloud gives practically unlimited processing power and the latest developments available in the cloud are at an affordable price.

Machine learning

One area that has taken huge steps forward has been machine learning. It is nowadays being used in everyday applications. Google wrote a very informative blog post about how they use Cloud machine learning in various scenarios. But Google is not alone in this space – today, everyone is doing machine learning. A very welcome development was the formation of Partnership on AI by Amazon, Google, Facebook, IBM and Microsoft.

We have seen how machine learning helps us in many areas. One good example is health care and IBM Watson managing to find a rare type of leukemia in 10 minutes. This type of expert assistance is becoming more common. While we know that it is still a long path to come before AI becomes smarter than human beings, we are taking leaps forward and this can be seen by DeepMind beating a human at the complex board game Go.

Internet of Things

Another important area is IoT. In 2016 most IoT projects have, in addition to consumer solutions, touched industry, creating a smart city, energy utilization or connected cars. Companies have realized that they nowadays can track any physical object to the benefits of being able to serve machines before they break, streamline or build better services or even create completely new business based on data knowledge. On the consumer side, we’ve in 2016 seen how IoT has become mainstream with unfortunate effect of poorly secured devices being used for massive attacks.

 

3 predictions for key developments happening in 2017

As we move towards the year 2017, we see that these trends from 2016 have positive effects on how information will be handled. We will have even more data and even more effective ways to use it. Here are three predictions for how we will see the information space evolve in 2017.

Insight engine

The collaboration with computers are changing. For decades, we have been giving tasks to computers and waited for their answer. This is slowly changing so that we start to collaborate with computers and even expect computers to take the initiative. The developments behind this is in machine learning and human language understanding. We no longer only index information and search it with free text. Nowadays, we can build a computer understanding information. This information includes everything from IoT data points to human created documents and data from other AI systems. This enables building an insight engine that can help us formulate the right question or even giving us insight based on information to a question we never ask. This will revolutionize how we handle our information how we interact with our user interfaces.

We will see virtual private assistants that users will be happy to use and train so that they can help us to use information like never before in our daily live. Google Now, in its current form, is merely the first step of something like this, being active towards bringing information to the user.

Search-driven analytics

The way we use and interact with data is changing. With collected information about pretty much anything, we have almost any information right at our fingertips and need effective ways to learn from this – in real time. In 2017, we will see a shift away from classic BI systems towards search-driven evolutions of this. We already have Kibana Dashboards with TimeLion and ThoughtSpot but these are only the first examples of how search is revolutionizing how we interact with data. Advanced analytics available for anyone within the organization, to get answers and predictions directly in graphs and diagrams, is what 2017 insights will be all about.

Conversational UIs

We have seen the rise of Chatbots in 2016. In 2017, this trend will also take on how we interact with enterprise systems. A smart conversational user interface builds on top of the same foundations as an enterprise search platform. It is highly personalized, contextually smart and builds its answers from information in various systems and information in many forms.

Imagine discussing future business focus areas with a machine that challenges us in our ideas and backs everything with data based facts. Imagine your enterprise search responding to your search with a question asking you to detail what you actually are achieving.

 

What are your thoughts on the future developement?

How do you see the 2017 change the way we interact with our information? Comment or reach out in other ways to discuss this further and have a Happy Year 2017!

 

Written by: Ivar Ekman

Elastic Stack 5.0 is released

At a first glance, the major Elasticsearch version bump might seem frightening. Going from version 2.4.x to 5.0 is a big jump, but there’s no need to worry. The main reason is to align versions between the different products in the stack. Having all products on the same version will make it a lot easier to handle future upgrades and simplify the overall experience for both new and existing users.

All products in the stack have been updated, some more than others. Here are a few highlights regarding Elasticsearch 5.0 that we recommend you to read before upgrading. Or schedule an appointment with us and we’ll help you out!

New relevance model

Elasticsearch prior version 5 used the default scoring algorithm TF/IDF. From now on the default algorithm is BM25.

Depending on the nature of your indexed information, a re-index operation might give you slightly different results and most likely more relevant.

Re-index from remote

This new feature of the Elasticsearch API is really useful when for example upgrading from old clusters. By specifying a remote cluster in the API call, you can easily transfer old documents to your newly created 5.0 cluster without going through a rolling node upgrade procedure.

Ingest Node

There’s a new node type in town. Starting from version 5.0, Elasticsearch gives you the possibility to do simple data manipulation within a running cluster prior indexing. This is useful if you prefer a more simplistic architecture without Logstash instances, but still require to do some alterations to your data.

Most core processors found in Logstash are available. Often used ones include:

  • Date Processor
  • Convert processor
  • Grok Processor
  • Rename Processor
  • JSON Processor

Search and Aggregations

The search API has been refactored to be more clever regarding which indices are hit, but also if aggregations need to be recalculated or not when issuing range queries. By looking at when indices were last modified, range aggregations can be cached and only recalculated if really needed. This improvement is really useful for the typical log analytic case with time series data. You will notice speed improvements in your Kibana dashboards.

New data structures

Lucence 6.0 introduces a new feature called dimensional points, which uses the k-d tree geo-spatial data structure to enable fast single- and multi-dimensional numeric range and geo-spatial point-in-shape filtering. Elasticsearch 5.0 implements a variant called block k-d tree specifically designed for efficient IO, which gives significant performance boosts when indexing as well as filtering.

Should I upgrade?

If your typical use case involves geo-spatial queries and filtering, we definitely recommend that you upgrade your cluster and re-index your documents to gain the performance boost. Due to the simplicity in upgrading or even migrating data to a completely new cluster, it will be worth the time getting your Elastic Stack up to date and ready for features to come.

In case you need help, don’t hesitate to contact us and we will guide you through the process.

Written by: Joar Svensson, Consultant Findwise

How to improve search relevance using machine learning and statistics – Apache Solr Learning to Rank

In search, the relevance denotes how well a retrieved document or set of documents meets the information need of the user. Natural languages, synonymy, homonymy, term frequency, norms, relevance model, boosting, performance, subjectivity… the reasons why search relevancy still remains a hard problem are multiple. This article will deal with how can machine learning and search statistics improve the relevance using the learning to rank plugin which will be included in a newer version of Solr.  If you want more information than is provided in this blogpost, be sure to visit our website or contact us!

Background

Considering an Intranet search solution where users can be divided into two groups: developers and sales.
Search and clicks statistics are collected and the following picture illustrates a specific search performed 569 times by the users with the click statistics for each document.

Example of a search with click statistics

Example of a search with click statistics

As noticed, the top search hit, of which the score is computed from term frequency, index documents frequency and field-length norm, is less relevant (got less clicks) than documents with lower scores. Instead of trying to manually tweak the relevancy model and the different field boosts, which will probably lead, by tweaking for a specific query, to decrease the global relevancy for other queries, we will try to learn from the users click statistics to automatically generate a relevancy model.

Architecture, features and model

Architecture

Using search and click statistics as training data, the search engine can learn, from input features and with a ranking model, the probability that a document is relevant for a user.

Search architecture

Search architecture with a ranking training model

Features

With the Solr learning to rank plugin, features can be defined using standard Solr queries.
For my specific example, I will choose the following features:
– originalScore: Original score computed by Solr
– queryMatchTitle: Boolean stating if the user query text match the document title
– queryMatchDescription: Boolean stating if the user query text match the document description
– isPowerPoint: Boolean stating if the document type is PowerPoint

[
   { "name": "isPowerPoint",
     "class": "org.apache.solr.ltr.feature.SolrFeature",
     "params":{ "fq": ["{!terms f=filetype }.pptx"] }
   },
   {
    "name":"originalScore",
    "class":"org.apache.solr.ltr.feature.OriginalScoreFeature",
    "params":{}
   },
   {
    "name" : " queryMatchTitle",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!field f=title}${user_query}" }
   },
   {
    "name" : " queryMatchDescription",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!field f=description}${user_query}" }
   }
]

Training a ranking model

From the statistics click data, a training set (X, y), where X is the feature input vector and y a boolean stating if the user clicked a document or not, can be generated and used to compute a linear model using regression which will output the weight of each feature.

Example of statistics:

{
    q: "i3",
    docId: "91075403",
    userId: "507f1f77bcf86cd799439011",
    clicked: 0,
    score: 3,43		
},
{
    q: "i3",
    docId: "82034458",
    userId: "507f1f77bcf86cd799439011",
    clicked: 1
    score: 3,43		
},
{
    q: "coucou",
    docId: "1246732",
    userId: "507f1f77bcf86cd799439011",
    clicked: 0	
    score: 3,42	
}

Training data generated:
X= (originalScore, queryMatchTitle, queryMatchDescription, isPowerPoint)
y= (clicked)

{
    originalScore: 3,43,
    queryMatchTitle: 1
    queryMatchDescription: 1,
    isPowerPoint: 1, 
    clicked: 0		
},
{
    originalScore: 3,43,
    queryMatchTitle: 0
    queryMatchDescription: 1,
    isPowerPoint: 0, 
    clicked: 1			
},
{
    originalScore: 3,42	
    queryMatchTitle: 1
    queryMatchDescription: 0,
    isPowerPoint: 1, 
    clicked: 0		
}

Once the training is completed and the different features weight computed, the model can be sent to Solr using the following format:

{
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"myModelName",
    "features":[
        { "name": "originalScore"},
        { "name": "queryMatchTitle"},
        { "name": "queryMatchDescription"},
        { "name": "isPowerPoint"},
    ],
    "params":{
        "weights": {
            "originalScore": 0.6783,
            "queryMatchTitle": 0.4833,
            "queryMatchDescription": 0.7844,
            "isPowerPoint": 0.321
      }
    }
}

Run a Rerank Query

Solr LTR plugin allows to easily apply the re-rank model on the documents result by adding rq={!ltr model=myModelName reRankDocs=25} to the query.

Personalization of the search result

If your statistics data include information about users, specific re-rank model can be trained according different user groups.
In my current example, I trained a specific model for the developer group and for the sales representatives.

Dev model:

{
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"devModel",
    "features":[
        { "name": "originalScore"},
        { "name": "queryMatchTitle"},
        { "name": "queryMatchDescription"},
        { "name": "isPowerPoint"},
    ],
    "params":{
        "weights": {
            "originalScore": 0.6421,
            "queryMatchTitle": 0.4561,
            "queryMatchDescription": 0.5124,
            "isPowerPoint": 0.017
      }
    }
}

Sales model:

{
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"salesModel",
    "features":[
        { "name": "originalScore"},
        { "name": "queryMatchTitle"},
        { "name": "queryMatchDescription"},
        { "name": "isPowerPoint"},
    ],
    "params":{
        "weights": {
            "originalScore": 0.712,
            "queryMatchTitle": 0.582,
            "queryMatchDescription": 0.243,
            "isPowerPoint": 0.623
      }
    }
}

From the statistics data, the system learnt that a PowerPoint document is more relevant for a sales representative than for a developers.

Developer search

Developer search with re-ranking

Sales representative search

Sales representative search with re-ranking

To conclude, with a search system continuously trained from a flow of statistics, not only the search relevance will be more customized and personalized to your users, but the relevance will also be automatically adapted to the users behavior change.

If you want more information, have further questions or need help please visit our website or contact us!

Solr LTR plugin to be release soon: https://github.com/bloomberg/lucene-solr/tree/master-ltr-plugin-release/solr/contrib/ltr

Involuntarily digital footprints violate personal integrity (learn about GDPR)

The aim of this blog post is to make “average Joe” understand how the new upcoming General Data Protection Regulation (GDPR) affects his everyday life.

To start with, let’s sort some expressions out.

Digital footprint

According to Wikipedia, there are two main classifications for digital footprints;
• Passive digital footprint – Data collected without the owner’s knowledge.
• Active digital footprint – Data released deliberately by the user himself (i.e. sharing an image on Facebook).

Personal integrity

Integrity could be described as the quality of being honest and having strong moral principles. In general, it’s a personal choice how to choose your standpoint in the question of integrity. Gossiping about secrets told in confidence is an example to illustrate with. Publishing images of others without their knowledge is another (this might even be illegal).

This illustrative case could be you

To understand what GDPR is about and how it affects your everyday life I will illustrate by an example that I hope you could recognize yourself in.

Imagine: You live in an apartment in a mid-size facility with other people (we can choose to call them neighbours). In front of the facilities there is a space dedicated for parking cars. One day a neighbour of yours chooses to move and therefore hires a real-estate agent, helping out with selling the apartment.

As you are somewhat curious about what the apartments in your neighbourhood is worth, you look the advertisement for the apartment up on the internet. When you find the apartment you see your own car on the picture in the parking space. On top of this you discover that the registration number of the car is fully visible.

Should you care?

According to Datainspektionen, registration numbers is considered as “personal data”. So the first mistake by the broker being done here is creating a passive digital footprint for you. The second mistake by the broker being done is breaking the law. In Sweden it is not allowed to publish personal data without acknowledgement by the owner.

The moral compass of the broker should be questioned here. A passive digital footprint in your name is created, your personal integrity has been violated and the law has been broken.

On top of that: GDPR starts in may 2018. You have the right to be forgotten whenever you want (you can push companies to remove your personal data from their systems).

Is there a business case?

A lawyer could probably build a business case around suing real-estate brokers for publishing pictures of cars registration numbers without the owner’s acknowledgement.

As a regular citizen you should probably not get to agitated about a picture of your cars registration number? Or maybe you should, it depends on your level of personal integrity. As the modern society evolves, the amount of different types of information being digitalized grows by the day.

By this example, I hope “average Joe” now understands what digital footprints, personal integrity and GDPR is. Maybe this got you thinking and you want to know more about GDPR.

There are probably two ways to see on this in a sober way. Live with your personal data being spread (and get used to that you soon won’t have anything personal anymore) or maybe it’s time to stick the neck out and say “hey, stop publishing my personal data without asking me”.

No matter if you want it or not, you are affected by GDPR.

 

 

Written by: Markus Edström