DATA INTEGRITY IN CLOUD: ISSUES AND CURRENT SOLUTIONS
Keywords:
Big data security, log analysis, information flow control, anomaly detectionAbstract
Cloud computing is empowering new innovations for big data. At the heart, cloud analytic applications become the most-hyped revolution. Cloud analytic applications have remarkable benefits for big data processing, making it easy, fast, scalable and cost-effective; albeit, they pose many security risks. Security breaches due to malicious, vulnerable, or misconfigured analytic applications are considered the top security risks to big data. The risk is further expanded from the coupling of data analytics with the cloud. Effective security measures, delivered by cloud analytic providers, to detect such malicious and anomalous activities are still missing. This paper presents real-time security monitoring as a service (SMaaS). SMaaS is a novel framework that aims to detect security anomalies in cloud analytical applications running on Hadoop clusters. It aims to detect vulnerable, malicious, and misconfigured applications which violate data integrity and confidentiality. Towards achieving this goal, we are motivated by leveraging big data pipeline that mixes advanced software technologies (Apache NiFi, Hive, and Zeppelin) to automate the collection, management, analysis, and visualization of log data from multiple sources, making it cohesive and comprehensive for security inspection. SMaaS monitors a candidate application by collecting log data on real-time. Then, it leverages log data analysis to model the application's execution in terms of information flow. The information flow model is crucial for profiling processing activities conducted throughout the application's execution. Such model, in turn, enriches the detection of various types of security anomalies. We evaluate the detection effectiveness and performance efficiency of our framework. The experiments are conducted over benchmark applications. The evaluation results demonstrate that our system is a viable solution, yet very efficient. Our system does not make modification in the monitored cluster, nor does it impose overhead to the monitored cluster's performance.