The term “big data”‘ve been listening to it all the 2013 but until recent weeks have not had the opportunity to implement it. The requirements were:
In the context of a web application with a high load of requests to server with multi-field search, it is necessary to implement a system of low latency and high availability in order to perform queries against a relational database with maximum efficiency.
Although we do know in some detail databases no-sql, distributed file systems and indexing servers as Solr or Elasticsearch , we still had a picture of the “whole”, so the first thing we did, was to buy a very interesting book and I recommend reading to anyone who wants to deepen these concepts: BigData Manning
The book is read in a couple of days. Already with the clearest concepts we have devoted three days to make architectural spikes and concept testing, all with jmeter to test the performance and validate the solution.
The solution that we found optimal was as follows:
- Implementation of an automatic transfer service between Elasticsearch and relational database
- Implementation of a queue
- Implementation of a consumer
In terms of components, some quite novel concepts are introduced such as rivers, the batch layer, the speed layer and layer query. Explaining the new components of this architecture big data, include the following:
Allows you to deploy a stream (river) data in order to re-create the dataset read. Most likely choose an implementation of rivers of Elasticsearch, for simplicity. But after performance tests, based on considerations of fault-tolerance, availability, maintainability, complexity and performance … is also likely to use different technology.
The power of the river is a view on the data, need for encapsulation of the representation thereof and resource optimization.
Valen written up for the part of river considerations. The idea is to feed the river from a queuing system, representing the application endpoint. The structure and size of the queuing system, with the technological decision comes after performance testing high load to simulate realtime.
The most appropriate technologies were chosen to provide low latency which is formalized as a requirement and the results we predict a latency of 10ms. Having regard to the requirement for low latency, is ruled out using ecosystem languages used in our case java.
Now would go into the details of technology mapping, ie that tools will develop the solution. This time we used Elasticsearch , Redis , RxJava , PlayFramework , MySQL, MongoDB , Akka-Camel , scala , and Hadoop among others.