10 Big Data Software Requirements
What are the core software components in a big data solution that delivers analytics? Although requirements certainly vary from project to project, here are ten software building blocks found in many big data rollouts. This presentation originated at Information Management magazine.

Image: iStock
1. Hadoop and MapReduce
Hadoop is an open source software framework for storing and processing big data across large clusters of commodity hardware. MapReduce is a programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. Popular Hadoop offerings include Cloudera, Hortonworks and MapR, among others.

Image: Hadoop
2. Database/File System
Hadoop Distributed File System (HDFS) manages the retrieval and storing of data and metadata required for computation. Other popular file system and database approaches include HBase or Cassandra – two NoSQL databases that are designed to manage extremely large data sets.

Image: iStock
3. Pig High-Level Programming
Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. The language abstracts the programming from the Java MapReduce idium, which makes MapReduce programming high level – similar to that of SQL for relational database management systems. Pig was originally developed at Yahoo Research around 2006. In 2007, it was moved into the Apache Software Foundation.

Image: Pig/Hadoop
4. Hive Data Warehousing
Apache Hive is a data warehouse platform built on top of Hadoop. It supports querying and managing large datasets across distributed storage. It leverages a SQL-like language called HiveQL. The language also allows traditional MapReduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Image: Edureka.co
5. Cascading
Cascading is a Java application development framework for rich data analytics and data management apps running across “a variety of computing environments,” with an emphasis on Hadoop and API compatible distributions, according to Concurrent – the company behind Cascading.

Image: iStock
6. Big Data Integration Tools
Semi-automated modeling tools such as CR-X allow models to develop interactively at rapid speed, and the tools can help set up the database that will run the analytics. CR-X is a real time ETL (Extract, Transform, Load) big data integration tool and transformation engine.

Image: iStock
7. Analytic Databases
Specialized scale-out analytic databases such as Pivotal Greenplum or IBM Netezza offer very fast loading and reloading of data for the analytic models.

Image: iStock
8. Customer Satisfaction Considerations
Big data analytical packages from ISVs (such as ClickFox) run against the database to address business issues such as customer satisfaction.

Image: Pixabay
9. Transactional Approaches
Transactional big-data projects can’t use Hadoop, since it is not real-time. For transactional systems that do not require a database with ACID (Atomicity, Consistency, Isolation, Durability) guarantees, NoSQL databases can be used – though consistency guarantees can be weak. Scale-out SQL databases, a new breed of offering, also is worth watching in this area. New entrants are emerging all the time.

Image: Pixabay
10. Piecing It All Together
The image above shows the major components pieced together into a complete big data solution.

Image: Wikibon
Special thanks to Wikibon for many of the perspectives shared in this slideshow.