A few enterprise search specialists from Findwise recently attended the Scandinavian Developer Conference 2012. One of the tracks was Big Data, which is very much related to search. It had some interesting talks about how to handle large amounts of data in an efficient way. Special thanks to Theo Hultberg, Jim Webber and Tim Berglund!
The theme was that you should choose a storage system which is well suited for the task. This may seem like an obvious point, but for a long time this was simply ignored; I’m talking about the era of relational databases. Don’t get me wrong, sometimes a relational database is the very best for the job, but in many cases it isn’t.
Data is jagged by nature, i.e. not all objects have the same properties. This is why we shouldn’t force them to fit into a square table, instead everything should be denormalized! The application accessing the data will be aware of the information structure and will handle it accordingly. This will also avoid expensive assembly operations (such as joins) to get the data in the format we want when retrieving it. Why should you split up your data if you are going to assemble it over and over again? Also remember that disk space is cheap, pre-compute as much as possible. The design of a Big Data system should be governed by how the data will be retrieved.
Another step away from the relational databases is the relaxation of some of the ACID properties: Atomicity, Consistency, Isolation and Durability. Again, this is along the lines of choosing the components best suited for the system. Decide which properties are a must have and which are not so important.
Relaxing the ACID properties, such as consistency, can give great performance gains. The NoSQL database Cassandra is eventually consistent and its write performance scales linearly up to 288 nodes (and probably even higher) which gives a write performance of over 1 million writes per second!
However, relaxation of these properties is not a new concept in the world of search engines. When indexing a document, it will usually take a number of seconds before it is searchable. This is called eventual consistency, i.e. the state of the search engine will be brought from one valid state to another, within a sufficiently long period of time. Do we really need documents that were just submitted to the search engine to be
searchable instantly? Most likely, no. Isolation is another property that is not crucial to a search engine. Since a document in an index doesn’t have any explicit relations to any other documents in the same index, there isn’t a great need for isolation. If two writes for the same document are submitted at the same time, there is probably something wrong in another part of the system.
So what does all this mean for search? There is an interesting challenge in storing jagged data in large amounts and then making good use out of it. To search in vast amounts jagged data, you need a lot of querytime field mappings (to make relevant data searchable) … or do you? There is also the issue of retaining a good relevancy model, which is absolutely vital to a search engine. How do you measure the relevance of arbitrary metadata and then weigh it all together? Maybe we need to think in new ways about relevance all together?
Whomever can solve these problems in a good way with a minimum amount of manual labor, is a name we’ll be hearing from a lot in the future.