p19gkv5lqe1fgqdt4pqkg1jdl6.jpg
10 Big Data Software Requirements
What are the core software components in a big data solution that delivers analytics? Although requirements certainly vary from project to project, here are ten software building blocks found in many big data rollouts. This presentation originated at Information Management magazine.

Image: iStock
p19gkv67c31rrr8mf1tju18gl1le87.png
1. Hadoop and MapReduce
Hadoop is an open source software framework for storing and processing big data across large clusters of commodity hardware. MapReduce is a programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. Popular Hadoop offerings include Cloudera, Hortonworks and MapR, among others.

Image: Hadoop
p19gkv6ovkafml82ooq9g7e0t8.jpg
2. Database/File System
Hadoop Distributed File System (HDFS) manages the retrieval and storing of data and metadata required for computation. Other popular file system and database approaches include HBase or Cassandra – two NoSQL databases that are designed to manage extremely large data sets.

Image: iStock
p19gkv78481qmp1i1o7fa31q14st9.png
3. Pig High-Level Programming
Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. The language abstracts the programming from the Java MapReduce idium, which makes MapReduce programming high level – similar to that of SQL for relational database management systems. Pig was originally developed at Yahoo Research around 2006. In 2007, it was moved into the Apache Software Foundation.

Image: Pig/Hadoop
p19gkv7p3111pi1pct1mc9eigfh6a.png
4. Hive Data Warehousing
Apache Hive is a data warehouse platform built on top of Hadoop. It supports querying and managing large datasets across distributed storage. It leverages a SQL-like language called HiveQL. The language also allows traditional MapReduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Image: Edureka.co
p19gkv87nn18ch12vkebj12oj1fjvb.jpg
5. Cascading
Cascading is a Java application development framework for rich data analytics and data management apps running across “a variety of computing environments,” with an emphasis on Hadoop and API compatible distributions, according to Concurrent – the company behind Cascading.

Image: iStock
p19gkv8mvumf6ib2r1dct81ml8c.jpg
6. Big Data Integration Tools
Semi-automated modeling tools such as CR-X allow models to develop interactively at rapid speed, and the tools can help set up the database that will run the analytics. CR-X is a real time ETL (Extract, Transform, Load) big data integration tool and transformation engine.

Image: iStock
p19gkv99kujc8lcuiif1vas61hd.jpg
7. Analytic Databases
Specialized scale-out analytic databases such as Pivotal Greenplum or IBM Netezza offer very fast loading and reloading of data for the analytic models.

Image: iStock
p19gkv9nab18k5t7v1ih91b01a2me.jpg
8. Customer Satisfaction Considerations
Big data analytical packages from ISVs (such as ClickFox) run against the database to address business issues such as customer satisfaction.

Image: Pixabay
p19gkvasbp1p1h1j2116b71f7b1ka3f.jpg
9. Transactional Approaches
Transactional big-data projects can’t use Hadoop, since it is not real-time. For transactional systems that do not require a database with ACID (Atomicity, Consistency, Isolation, Durability) guarantees, NoSQL databases can be used – though consistency guarantees can be weak. Scale-out SQL databases, a new breed of offering, also is worth watching in this area. New entrants are emerging all the time.

Image: Pixabay
p19gkvbb6vo6e1lc2s7m1qv71qvcg.JPG
10. Piecing It All Together
The image above shows the major components pieced together into a complete big data solution.

Image: Wikibon
p19gkvgrvj13vi1not2jv1l6lk1uu.jpg
Special thanks to Wikibon for many of the perspectives shared in this slideshow.