Information Security in Big Data: Data Integrity

ABSTRACT– This paper focuses on the integrity of data which aims at assuring and maintaining the completeness and accuracy of data during its entire lifecycle. In other words, the study seeks to look into ways that ensure that data is secure and of the intended quality from the installation of data-related systems to their disposal. As well, this research emphasizes on the whole process of handling information while ensuring that it is free from external interference.

On that account, the paper highlights and develops various strategies and ideas that upon implementation would to some extent address the current big data security mess. This paper presents a number of problems to assist in its structuring. First, it raises the alarm on the possible threats associated with the maintaining the integrity of big data such as its complexity and security threats. The research also highlights the need to avoid installing software hastily while overlooking the need to maintain system quality. Further, the study presents two case studies from which lessons are drawn.

The first case is that of Hadoop big data infrastructure system while the second is the immense technical glitch on 8th July 2015. From these examples, the paper identifies four concepts that can be advanced to build a more comprehensive big data security system. These strategies are aimed at continually tracking and monitoring origins, decrypting and Encrypting data to manage threat intelligence, as well as real-time debugging or overhauling of security systems. Lastly, this study recommends investing adequate time, resources, technical skills while installing big data security infrastructure.

Keywords: Big data, Data integrity, Threat Intelligence, Lifecycle


With the current technological advancement, developing data mining poses a series of threats to the security and privacy of sensitive information[1]. Hybrid or public cloud computing, for instance, is today a popular big data technique. This is especially where storage, transfer, and sharing of information is involved [2]. Further, with the increased use of the internet using modern electronic devices such as smartphones and broadband the cases of hacking, tracking, leaking, and interfering with confidential information are on the rise [3]. These gadgets also contribute to the global increase in big data. The complexity of information handling techniques such as its capture, analysis, visualization, and updating is also on an upsurge. On that note, business should exercise caution when dealing with big data. Besides, they should be worried and aware of the potential threats facing the integrity of the company’s private information [4].


Regardless of how much businesses would like to dispute the subject of big data security is today caught up in a glaring sorry state [5]. It is unfortunate that these same companies that are unwilling to face up to this mess end up suffering severe data losses from which affects their operations immensely. This paper intends to address some challenges presented by the old systems. This is to improve certain aspects of these softwares, develop more efficient ones, or come up with alternative ideas that solve the problems associated with previous systems.

After installing new softwares relating to bigdata, most companies sit back and relax. They assume that by installing these systems, all is set for carrying out business [6].After all, they have invested a lot of resources, met the stipulated deadlines, aims, and objectives. Further, these companies take pride in enabling all the security settings in their database. Ironically, such firms end up being caught off-guard by security threats and hackers among other data-loss hazards [7]. What they forget is that the integrity of information is determined by ensuring a continuous big data assessment process. Business records should be evaluated on a regular basis throughout the lifecycle of the particular documentation [8]

One of the imminent danger facing big data is its timely nature [9]. This is in the sense that its softwares are presently being produced at a swift rate, despite the fact that it involves massive volumes of information under management. When developing applications, most organizations consider meeting deadlines and the extent to which software upgrades affect their daily operations.

Businesses hence rush to commission the data provider who develops a big data system in the shortest time possible. They overlook the quality and security-tightness of the system to continue running their transactions. In turn, the resulting data management system is characterized by a series of loopholes which hackers and other threats use to manipulate or corrupt the entire software. They are also lured to consider the cost implication of the applications. On that note, companies tend to engage in pasting new codes and ideas into the old systems other than overhauling the whole software. Therefore, the big data market is likely to suffer significant security breaches. According to a survey by Gartner, most organizations have neither taken security measures seriously nor developed the necessary infrastructure to protect their data.


Data integrity is the maintenance of data accuracy for any system that is responsible for storage and retrieval of large information. Designing a system which is capable of securing and guaranteeing the information and data storage is paramount for any organization. One of the new algorithm which has found much appreciation in the recent times is cloud computing [32]. This is the system whereby data storage is not dependent to the space in the computers. The data and information are stored in the cloud, with limitless space. Some of the characteristics of the cloud system are;

  • Excess Data storage

All cloud systems are managed by high performance storage area network. This means if any autonomous server tries to crack the stored data, there is protection from internal and external attacks [31].

  • Client and Support Stability

This system clearly indicates that the user can increase or decrease the data capacity in the cloud server at his or her wish. The stability of cloud system is a sure guarantee of future regeneration of every bit of data stored.

  • Accurate and Efficient Computing Environment.

Data should be operated in a self-motivating computing environment. Cloud services offers a platform where information can be deleted by the user, or any manipulation done at time of request. This makes the system more efficient to use.

Figure 1: Cloud architecture and data processing way


Many experts argues that whenever an opponent tries to append or delete the data from the cloud system, it is protected from any malicious intrusion. Flexible distributed   storage integrity auditing systems allows tracing of any intentional deletion or alteration of the data. Cloud computing has tremendous advantages to the user. The current third party auditors (TPA) method does not allow auditing for any form of intrusion. Moreover, it does not tell effective way for server failure, which predisposes the user to risk of losing data [31]. Along, it does not make rout for integrity whenever the user wants to access previous data.  Cloud Service providers (CSP) are giving the solution to all these challenges, since it guarantee security of the data.


Further, this paper takes the case of Hadoop system of big data management as an example of a vulnerable system [15].The target is to point out some of the problems associated with such previous programs and suggest possible improvements. Hadoop is a software that is open-source in structure. It is based on a Java programming framework which supports the storage as well as the processing of data sets that are incredibly massive [16].In order to achieve this, the system distributes such vast volumes of information models within a computing environment. It runs applications in groups of commodity hardware. Also, it is an integral part of Apache Software Foundation.

Some of the reasons why Hadoop is gaining popularity at a fast rate is that clients are enticed by its quick ability to store any information [17]. Likewise, it has a tremendous processing power and can handle concurrent tasks almost limitlessly. The computing power is directly

Proportional to the number of nodes used [12]. With the current increase in the variety of data such as from the internet and social media, Hadoop proofs to be flexible as it allows for storage of unprocessed data. Unstructured information such as videos, images, and texts canthus be stored without necessarily being sorted [13]. This aspect is in contrary to the traditional database systems. Business is also attracted to the low cost of the software. It is cheap because open-source frameworks are usually free and use commodity hardware while storing massive volumes of data [18]. By simply adding nodes to its structure, companies can grow their systems by scaling them to sore and process additional records. The only security feature in Hadoop is its fault tolerance. Applications and data are protected against the failure of hardware. Whenever one of the nodes goes down, the various tasks are redirected automatically to the remaining nodes ensuring that the computing distribution does not flop.By design, the software stores multiple copies of all the available records.

On the other hand, Hadoop presents a number of challenges that this paper seeks to address.

The first issue is concerning data security [19].Since the system stores information in a fragmented manner, the data is vulnerable to unauthorized access as well as prone to security threats such as malware attack. Despite numerous efforts to develop technologies and tools like the Kerberos protocol of authentication to surface the system, the Hadoop environments remain insecure [20].Second, the software lacks a comprehensive data governance and management structure. One of the key concerns is that the application does not have the necessary tools for standardizing and maintaining the quality of records. Another significant shortfall is that it lacks an easy-to-use, all-encompassing techniques for cleansing data, managing it, and governing metadata such as code lists, structures, and [21].Third, Hadoop presents a wide talent gap in the sense that it uses MapReduce skills which most of the entry-level programmers are fully conversant with. On the contrary, most computer programmers use the SQL technology which has not been put to full use by Hadoop. Also, since the software comprises of both art and science aspects, it requires a lower level of skills in hardware, operating systems, and kernel settings.

Lastly, the MapReduce programming is incompatible with most advanced tasks such as interactive and iterative analysis. The file-intensive nature of MapReduce makes its nodes unable to communicate unless through shuffles and sorts. Iterative algorithms, on the other hand, require multiple sort-reduce and map-shuffle phases in order to complete complex analytic computing [22].The Hadoop system hence only suits simple data requests as well as problems that can be split into independent units. Sadly, most companies get fascinated by these elementary features they cost them less to develop and hire IT technicians. They forget that the more basic a system is, the more insecure and vulnerable it is.

In light of these challenges, this paper has advanced a number of strategies to ensure the integrity of big data [23]. If businesses are to succeed in maintaining the accuracy, wholeness, and security of information, they should consider the following factors. These approaches aim at ensuring that information is not breached through leakage and tracking of confidential records,and is free from security risks such asintrusion by cybercriminals [24].Importantly, these aspects should be put in practice throughout the lifecycle of a given software.

  • Consistently track and monitor origins
  • Decrypting and Encrypting/ serializing data to enhance threat intelligence [24].
  • Real-time debugging/ overhauling of security systems
  • Invest adequate time, technical skills,and resources while installing security structures


Managing big data requires that original sources are closely monitored and tracked. This research bases its argument on the fact that an open-source software is vulnerable to security interference [25]. The cited Hadoop is a systemwhose layered structure, large level of data, and origin is considered a security threat. Ray Burgemeestre,a renowned security engineer wondered how users could identify that a particular system is secure even after enabling the recommended security settings [26].Such inconsistencies raise an alarm that most open-source applications need to pull up their socks while undertaking security measures. The greatest threat to such open-origin softwares isthe fragmentation of information into different clusters [27].It not only becomes difficult to maintain the confidentiality of the stored records but also, their integrity throughout the system’s lifespan. Bolke de Bruin, an ING bank researcher, notes that ignoring security concerns due to its presumed complexity presents a severe threat to big data management [26].


The big data infrastructure is such that it experiences an inflow of petabytes of highly sensitive information into its numerous interconnected clusters[27].One of the methods of securing open-source big data systems is the Kerberos Protocol. The concept behind it is that it is founded on authentication techniques which employ shared secrets. In order to validate information, particular individuals say X and Ywithin an organization are entrusted with a common confidential password [28].Communication between X and Y can thus be certified on the assumption that no other party(Z)is aware of the password. However, as they say, no secret exists between two individuals. There exists a media or network for inputting passwords. Suppose Z monitors the system constantly? They may,in the long run,end up tracing it. The Kerberos Protocol operates on the principle of secret-key cryptography to solve this problem.X and Y share cryptographic keys other than passwords [29].X becomes the encryptor whereas Y is the decryptor of the agreed symmetric secret code. Both parties must, therefore, possess the knowledge to code or decode the access key to operate the big data system.

Figure 2: Kerberos Protocol (TechNet, 2).

This research also presents a new idea of or secret key-combination. In such a case, there exists a shared access key (K). Further, either X or Y knows just a part of the login code. For this study, X knows the first part of the key (X1) whereas Y is only aware of the second part (Y2). Therefore, to complete the secret password  K= X1 + Y2.This ideology is aimed at combining some aspects of the old systems and those of the modern day security structures. This is after taking into account the fact that not all big data managers are highly skilled in technical knowledge. The K= X1 + Y2 model, however, appreciates that an average level of cryptographic detail is required in ensuring the security of big data origins [29].


The July 8thhitch should be an eye-opener to the big data enterprises. A computer fault affected all the mainland flights in the United States. Moreover, the incidence had serious economic implications.For a whole dayday, the NYSE was not in operation. Similarly, the Wall Street Journal’s website was down. All these incidences took place a day after the Chines stocks suffered a major drop [30].This means that with the volatile nature of the Chinese Stocks, this was a coordinated attack carried out by cyber terrorists [28].As well, itsheds lightson the information security team that no single software is free from intrusion.


This paper proposes that there is dire need tokeep updating big data systems throughout its lifecyclecontinuously. Also, it would be prudent to overhaul the whole system given certain conditions. The longer a particular software is available in the market, the higher the chances of identifying its vulnerabilities [29].Open-source big data structures are particularly at risk because attackers can dig deep into the original code prior and post-emergence of vulnerabilities. Studies show that it does not take more than one year before shortfalls are identified in the infrastructure market of big data [27].Hackers are by day devising new methods of encroaching into the existing softwares. Companies should, therefore,service their security system continually to identify any attempted intrusion and seal possible loopholes within it.

In addition, there are a number of emerging risks, especially in mobile security. Business should be updated regarding these developing threats to blacklist the related consumer-based applications [28]. Some of these attacks re-use old tactics that target mobile-specific services. An example is usingMan-in-the-middleby SideSteppers. They employ previous methods other than exploring possibly new techniques and vulnerabilities. Once they launch successful attacks on a single user, the hackers can gain access to the whole system leading to loss of both individual and company data [29].

Once a big data system has undergone a series of modification and upgrading there comes a time when it finally becomes obsolete. It could also be faced out by more advanced systems that are more robust and have improved security features [28]. Business should be keen to take note when the time is ripe for either replacing the whole system or updating it. In the cited case of 8th July technical glitch, we can learn that a lot of these big data systems need a total overhaul since they have been in existence [30]. They certainly have been severally multi-layered and stretched to accommodate particular functions until they cannot be scaled further. In such a case, it would be right for the business to acquire a new system even if it means finding alternative channels or shutting down business for some time. This is because, these old systems present the danger of reusing previous tactics making them vulnerable to hacking, tracking of data by cybercriminals[29].

Unfortunately, thereare instances when companies overlook the idea of a fresh system and instead, hire college students to make these out-of-date programs work on newer machines. In this case, the challenge is that despite the accessibility of the database, the source code is usually unavailable [24].The cheapest solution would be to thus merge new ideas into the old database other than designing a new information system and source code. Realistically, each additional layer of code pasted onto the big data infrastructure predisposes it more to a failure or seizure by security threats[27].


Whereas the rapid pace at which innovations in big data management systems are emerging may sound impressive, it poses a serious security problem.These fast-moving big data infrastructures include Hadoop, Kafka, and Spark among others [30]. Adrian speculates that enterprises that are developed on such systems are likely to terribly crumblewhen they go mainstream. He suggests that installing such systems is like constructing skyscrapers favelas in coding which are located in earthquake zones[24].This paper proposes that companies should instead focus on building their big data systems on a firm foundation of adequate resources and time. As a result, the maintenance and running costs of the software will always be lower than the initial expense throughout its lifecycle.

Most of the new codes are today written speedily to meet the so called “current wave of software development”in the Silicon Valley. Also they are designed to meet the modern day needs of the “angel investor” or the capital-venture funding model. Financiers greedily rush to scale up businesses so that they can monopolize the market in the shortest time possible.

This is usually done using network effects which increases the vulnerability of the software as the number of the platform user increases [25].The system could eventually work but software engineers use shortcuts such as the “duct-tape”in the process of coding to hold various parts together [28].In this case, the programmer gives explanations that illustrate the mess. The cycle begins with the first crushing of the software. What follows is a continued series of temptations to fix the mess using additional duct tapes. With accumulation of more information from different sources, the big datastructure will finally become extremely complex in the sense that particular codes interact with programs, data, and people [25]. Such an intricate system requires constantdebugging which is a tedious and expensive process throughout its lifecycle. In the end, the complexity and maintenance costs are too high to cause the whole system to crush.


In a nutshell, this study looks at ways to maintain the integrity of big data systems before the actual installation of the software and during its lifespan.It acknowledges that information systems should be under constant maintenance, upgrading, and on the lookout for possible threats [29].This is with regards to the ever growing volumes and complexity of data from different sources such as the social media especially in the current age of technological advancement.

Further, business must admit that the big data technology is being faced with numerous security challenges such as hacking, tracking, and open-source nature of information. The research draws its lessons from the 8th July 2015 Great Technical Glitch case and the Hadoop big data infrastructure system [13].Businessshould, therefore, not relax after successfully installing data management systems. Instead, companies should come up with ways to ensure that the integrity of their database is maintained with regards to the accuracy, security, quality and completeness of information. First, they should engage in continuous [23].Thirdly, they should undertake debugging or overhauling their softwares at the appropriate time. Finally, organizations should invest in technical skills, time, and other resources at a given point to avoid recurrent and unnecessary maintenance and running costs.