HDM_Slide-Big Data.jpg
10 ways to protect data in the cloud
Big data is generated by a variety of different gadgets and sensors, including security devices. The new report from the Cloud Security Alliance -- “100 Best Practices in Big Data Security and Privacy” -- looks at the best practices that should be implemented for real-time security/compliance monitoring.
1. HDM p1ark88cbv18q2mdrrvk1cns184q8.jpg
1. Apply big data analytics to detect anomalous connections to cluster
Why? To ensure only authorized connections are allowed on a cluster, as this makes up part of the trusted big data environment.

How? Use solutions like TLS/SSL, Kerberos, Secure European System for Applications in a Multi-Vendor Environment (SESAME), Internet protocol security (IPsec), or secure shell (SSH) to establish trusted connections to and–if needed–within a cluster to prevent unauthorized connections. Use monitoring tools, like a security information and event management (SIEM) solution, to monitor anomalous connections. This could be, for instance, based on connection behavior (e.g., seeing a connection from a ‘bad Internet neighborhood’) or alerts being filed in the logs of the cluster systems, indicating an attempt to establish an unauthorized connection.
2. HDM p1ark88fslb631h9p1uf4u21v699.jpg
2. Mine logging events
Why? To ensure that the big data infrastructure remains compliant with the assigned risk acceptance profile of the infrastructure.

How? • Mine the events in log files to monitor for security, like in a SIEM tool. • Apply other algorithms or principles to mine events (such as machine learning) to get potential new security insights.
3. HDM p1ark88i9b12ch17sa1aci1ctd8eda.jpg
3. Implement front-end systems
Why? To parse requests ,and stop bad requests. Front-end systems are not new to security. Examples are routers, application-level firewalls and database-access firewalls. These systems typically parse the request (based on, for instance, syntax signatures or behavior profiles) and stop bad requests. The same principle can be used to focus on application or data requests in a big data infrastructure environment (e.g., MapReduce messages).

How? Deploy multi-stage levels of front-end systems. For example, utilize a router for the network; an application-level firewall to allow/block applications; and a dedicated big data front-end system to analyze typical big data inquiries (like Hadoop requests). Additional technology, such a software defined network (SDN), may be helpful for implementation and deployment.
4. HDM p1ark89dngmt018031236j3k1nagg.jpg
4. Consider cloud-level security
Why? To avoid becoming the “Achilles heel” of the big data infrastructure stack. Big data deployments are moving to the cloud. If such a deployment lives on a public cloud, this cloud becomes part of the big data infrastructure stack.

How? • Download “CSA Guidance for Critical Areas of Focus in Cloud Computing V3.0” • Implement other CSA best practices. • Encourage Cloud Service Providers to become CSA STAR-certified compliant.
5. HDM p1ark89jadc9a1nm51d83k5sfk0h.jpg
5. Utilize cluster-level security
Why? To ensure that security methodology for big data infrastructure is approached from multiple levels. Different components make up this infrastructure—the cluster being one of them.

How? Apply—where applicable—best security practices for the cluster. These include: • Use Kerberos or SESAME in a Hadoop cluster for authentication. • Secure the Hadoop distributed file system (HDFS) using file and directory permissions. • Utilize access control lists for access (e.g., role-based, attribute-based). • Apply information flow control using mandatory access control. The implementation of security controls also (heavily) depends on the cluster distribution being used. In case of strict security requirements (e.g., high confidentiality of the data being used), consider looking at solutions like Sqrrl, which provide fine-grained access control at the cell level.
6. HDM p1ark89q1f7pp8bq1aaf1tld2bqi.jpg
6. Apply application-level security
Why? To secure applications in the infrastructure stack. Over the last years, attackers have shifted their focus from operating systems to databases to applications.

How? • Apply secure software development best practices, like OWASP (owasp.org) for Web-based applications. • Execute vulnerability assessments and application penetration tests on the application on an ongoing and scheduled basis.

Why? To avoid legal issues when collecting and managing data. Due to laws and regulations that exist worldwide—specifically those that relate to privacy rights—individuals who gather data cannot monitor or use every data item collected. While many regulations are in-place to protect consumers, they also create a variety of challenges in the universe of big data collection that will hopefully be resolved over time.

How? Follow the laws and regulations (i.e. privacy laws) for each step in the data lifecycle. These include: • Collection of data • Storage of data • Transmission of data • Use of data • Destruction of data Physical and virtual locations for each step in the data lifecycle may not be the same.
7. HDM p1ark8a0ho15i4195pske1sv31nt8j.jpg
7. Adhere to laws and regulations
Why? To avoid legal issues when collecting and managing data. Due to laws and regulations that exist worldwide—specifically those that relate to privacy rights—individuals who gather data cannot monitor or use every data item collected. While many regulations are in-place to protect consumers, they also create a variety of challenges in the universe of big data collection that will hopefully be resolved over time.

How? Follow the laws and regulations (i.e. privacy laws) for each step in the data lifecycle. These include: • Collection of data • Storage of data • Transmission of data • Use of data • Destruction of data Physical and virtual locations for each step in the data lifecycle may not be the same.
8. HDM p1ark8a6ac62up4r1f8btch1fmek.jpg
8. Reflect on ethical considerations
Why? To address both technical and ethical questions that may arise. The fact that one has Big Data doesn’t necessarily mean that one can just use that data. There is always a fine line between what is: (1) technically possible; and (2) what is ethically correct. The latter is also impacted and related to legal regulations and, the organization’s culture, among other factors, to name a few.

How? There are no clear guidelines concerning ethical considerations related to big data usage. At minimum, big data users must take into account all applicable privacy and legal regulations. Additionally, users should consider ethical discussions related to their organizations, regions, businesses, and so forth.
9. HDM p1ark8ad62vd71pk2js31duu1kv5l.jpg
9. Monitor evasion attacks
Why? To avoid potential system attacks and/or unauthorized access. Evasion attacks are meant to circumvent big data infrastructure security measures and avoid detection. It is important to minimize these occurrences as much as possible.

How? As evasion attacks evolve constantly, it is not always easy to stop them. Following the implementation of a defense in-depth concept, consider applying different monitor algorithms (like machine learning) to mine the data. Look for insights related to potential evasion of monitoring besides signature-based/rule-based/anomaly-based/specification-based detection schemes.
10. HDM p1ark8ajqn17ddild7407asorm.jpg
10. Track data-poisoning attacks
Why? To prevent monitoring systems from being misled, crashing, misbehaving or providing misinterpreted data due to malformed data. These type of attacks are aimed at falsifying data, letting the monitoring system believe nothing is wrong.

How? • Consider applying front-end systems and behavioral methods to perform input validation, process the data, and determine right from wrong as much as possible. • It is also crucial to authenticate sources of data and maintain logs not only for preventing unauthorized data injection but also for establishing accountability. • Utilize the monitoring system for strange behavior, like a spike in the central processing unit (CPU) and memory load for prolonged periods of time, or disk space running full quickly.