Data is becoming a key element for businesses nowadays. Every business wants to improve their efficiency analizing the data they generate periodically, but due to the volume of said information, it needs to be treated with a series of non-traditional tools and techniques.
The solution to treat this volume of information is to use Lambda architecture. It provides an approach that allows to combine the treatment of data in real time and deferred, in a scalable and fault tolerant way.
A Lambda architecture is comprised of three principal layers:
The streaming layer (Spark Streaming + Kafka) allows us to retrieve data in real time, obtaining information and knowledge from them.
The batch processing layer allows to analize data originating from different sources of origin in a quick and scalable fashion.
The server layer: real time data as well as batch processed data are stored for later exploitation, using the different available techniques based on their utilization (Cassandra, Neo4J, MongoDB...).
These technologies allows for the use of not only non-relational databases but NoSQL databases as well, some that stand out among them are:
- File Oriented
- Graphs
- Key - Value
This architecture allows creating prediction models (with tools such as Apache Mahout, Spark MLLib...) This is achieved through batch processing for the analysis of historic data stored on the servers and streaming data for real time predictions. The goal of these models is to provide added value to our business.
The benefits of Big Data are:
- Improve and speed up the information treatment process.
- Managing data in real time.
- Management of non-structural data and of diverse sources of origin.
- Cloud architecture of easy mantainance and horizontal scalability.