Differences between Hadoop V1, Hadoop V2, And Hadoop V3

Differences between Hadoop V1, Hadoop V2, And Hadoop V3

Big data and running applications are some of the core attributes of computers, especially in the current fast-paced information transference environment. Besides the typical computer architecture, there is a need for accentuating the processing of big data and information through applications and other forms of computing power. This study explores the difference between Hadoop V1, V2, and V3 as the central data storage and application running utilities in computers. The following discussion will constitute a three-tier approach of comparing the three models of Hadoop in the context of various computer concepts, primarily usage, functions of daemons, and security.

Background

Hadoop refers to open-source software that uses a cluster of hardware devices to run the software and different applications. Specifically, Hadoop is apache and java-based platform whose primary role is data storage and distribution within a clustered software system (Gangumalla, 2012). For example, for the proficiency of the sensitive energy industry, critical analysis of plants and their functions using Hadoop is enabled to predict maintenance, while the Internet of Things (IoT) feed big data to programs. Therefore, Hadoop can be termed as a platform for accentuating software and hardware effectiveness within an extensive system.

Closely connected to the role of Hadoop are file storage and distribution. According to Chakraborty et al. (2017), finding and using data in a larger computer system is one of an effective device’s core attributes. Therefore, Hadoop is more inclined to manage and distribute file information such that there are no delays in getting stored data or misdirection during information and data transference. Hadoop is the quintessential element of a power grid connection to a computer system in the previously provided example. If the complex computer interface does not have the right distribution manual and system, power cannot be accurately and effectively transferred.

The primary advantage of Hadoop is the capacity toe used across operating systems, primarily Windows, Linux, and Unix. For example, if one office runs a Windows system, the file storage and distribution are safe as long as the receiver also maximizes the benefits of Hadoop. In the current interconnected world, multiple operating systems allow open-source usage to benefit computer manufacturers and users. Therefore, Hadoop is optimally essential to foster globalized file storage and distribution. The following analysis exhibits the various daemons’ functions for three versions of Hadoop; Hadoop V1, V2, and V3.

Functions of Various Daemons in Hadoop V1

MapReduce Daemons in Hadoop V1

The simultaneous processing of massive amounts of data is done using the MapReduce software framework. The fault-tolerant and reliable nature of the MapReduce daemon in V1 makes a deal with massive clusters of data and information concurrently (Ghazi & Gangodkar, 2015). MapReduce VI stories data in the Hadoop Distributed File System (HDFS), while processing usually occurs in MapReduce phases. One of the critical differences between MapReduce in V1 from V2 and V3 is that it focuses on moving computed data instead of transporting data for computation purposes. The following are the main modules in the MapReduce V1.

  • MapReduce API – Programs the jobs to be performed in the HDFS.
  • MapReduce Framework – Defined by attributes such as sort/shuffle/merge that implements phases of MapReduce jobs.
  • MapReduce System – Constitute the supportive infrastructure to support the MapReduce V1 system and API.

NameNode Daemon in Hadoop V1

The NameNode’s primary role is to store metadata about data nodes, hence its storage in the Master Node. According to Chakraborty et al. (2017), the Name Node is the single point of failure compared to V2 and V3 since each cluster of data has a single Name Node. Therefore, there is a loose security structure such that when one machine is not available, the entire system collapses. In the previously provided example, a power grid’s interconnectivity would be hard with Name Node V1 on account of the Hadoop V1 Name Node’s passive configuration. Failure of communication between Name Nodes and Data Nodes renders the entire system inactive as it shuts down.

Secondary NameNode Daemon in Hadoop V1

The primary role of the Secondary NameNode is backup for all data and information on an hourly basis. In Hadoop V1, the Secondary NameNode has checkpoints for the data to determine when a backup is needed (Vasuja et al., 2019). For example, when a backup file is transferred to a new location, the MetaData is assigned to that new position. The Secondary NameNode is easily located upon the loss of the original file. Hadoop V1 is significantly keen on ensuring the security of information, hence creating backup storage. The alternative name for the Secondary NameNode is the checkpoint Node for monitoring the data and file path to ensure total security through backup.

DataNode in Hadoop V1

While some daemons such as the MapReduce, NameNode, and Secondary Name Node work on the central system, others are known as slave daemons for their role in actualizing the functions of the main daemons. In Hadoop V1, the DataNode is the most active slave daemon for its role in denoting the location or block in which a cluster of files is stored (Borthakur, 2008). An alternative explanation is that the DataNode serves the read/write requests from the clients, hence their possession of a significantly high memory in Hadoop V1. Such slavery functionalities optimize the Hadoop V1 in open-source data sharing.

JobTracker Daemon in Hadoop V1

JobTracker’s key role in MapReduce for Hadoop V1 (MRv1) is to map out tasks to specific nodes within the cluster. According to Hadoop in Real World (n.d.), JobTracker starts working by receiving MRv1 from the client. Thus, the daemon is vastly informed of information storage within the entire system. Secondly, the JobTracker “talks” to the previously mentioned NameNode to determine data location. Information about the clients and the NameNode informs JobTracker to execute tasks based on proximity with the available slots on any given node. Such an approach to data processing reduces clogging and queuing that delays the entire system.

TaskTracker Daemon in Hadoop V1

Hadoop V1 task tracker runs on almost all data nodes. The two main tasks for the node are mapper and reducer, which are assigned to the TaskTracker, depending on the prevailing needs. For example, when the MapReduce API programs schedule jobs to be performed in the HDFS, the TaskTracker is keenly observant of the nodes where data and files will land for processing purposes (Srijeyanthan, 2015). While the NameNode daemon in Hadoop V1 is often considered fatal, the TaskTracker is not fatal if the system fails. Such daemons are more inclined to support provision for the system while also accentuating the master and slave Daemons such as Secondary NameNode and DataNode, respectively.

 

Comparing Daemon Functionalities in Hadoop V1 and V2

MapReduce Daemons in Hadoop V2

MapReduce in Hadoop V2 is best known as YARN. MRV2 has replaced the JobTracker and TaskTracker in V1 such that they are now known as a resource manager and node manager, as discussed in the previous section. Unlike in V1, MRV2 undertakes task scheduling and execution of the slave node through the Application Master and Container, a key daemon process in V2 (Perwej et al., 2017).

NameNode Daemon in Hadoop V2

One of the core differences between Hadoop V1 and V2 is handling the single point of failure. According to (Ghazi & Gangodkar, 2015), Hadoop V2 introduces the HDFS High Availability. Thus, while NameNode in Hadoop V1 does not have a method sequencing, V2 ensures that one NameNode is on standby while others are on a working mode or state. The working NameNode in V2 handles clients in the cluster while maintaining a fast failover.

Secondary NameNode Daemon in Hadoop V2

While the Secondary NameNode in Hadoop V1 is known as a backup, it is more inclined to a helper’s functions in V2. Comparative analysis of the two versions indicates that as the second Hadoop gets more autonomous features, it can sustain itself regarding keeping all essential information, hence no need for backup. However, the Secondary NameNode is still relevant and necessary for helping in task scheduling.

DataNode in Hadoop V2

The service of DataNode as slave daemons in V1 and V2 is arguably the most common similarity between the two versions. However, unlike in V1, a larger volume of DataNode in V2 is divided into smaller blocks of data defined by sizes. The replicas of the DataNode have their sizes multiplied thrice, a testament to the improved version of Hadoop V2. Such additions accentuate the HDFS and YARN features in Hadoop V2.

JobTracker (ResourceManager) Daemon in Hadoop V2

While the JobTracker was special for execution in MRv1, it has been replaced with ResourceManager/ApplicationMaster in MapReduce for Hadoop Version 2 (MRv2). The ApplicationMaster works with the NodeManagers to negotiate resources from the resource manager. Thus, JobTracker in Hadoop V2 is more inclined to understand the library of specific frameworks, the containers, and approaches to consuming the requested resources from nodes.

TaskTracker (Node Manager) Daemon in Hadoop V2

While the daemon runs on all nodes in V1, it was replaced by Node Manager in V2. When the ApplicationMaster specifies tasks to be executed, the node manager launches and manages containers on a node. (Ghazi & Gangodkar, 2015) are more elaborate in stating that Node Manager in V2 is the slave daemon of Yarn. This explains the significance of upgrading open-source software to versions that effortlessly accentuate the master functions.

Comparing Daemon Functionalities in Hadoop V1, V2 and V3

MapReduce Daemons in Hadoop V3

As the most updated Hadoop version, the MapReduce daemon was reworked entirely to allow task heap management. Thus, there are new ways of configuring heap sizes such that they are auto-tuned according to the host’s prevailing memory. Consequently, HDFS supports functions that allow the erasure of code to foster fault tolerance in the system. MRV3 is significantly advanced compared to that in V1 and V2 (Ghazi & Gangodkar, 2015).

NameNode Daemon in Hadoop V3

Before V3, Hadoop supported only a single active and standby NameNode, which rendered the system relatively slow compared to the extensive changes in the nature of fast data exchanges and processing. However, DataFlair (n.d.) writes that V3 has more than two NameNodes, making the architecture less prone to failures. Still, V3 has three journal nodes that constate a system with three NameNodes, which adds to the entire system’s overall functionality.

Secondary NameNode Daemon in Hadoop V3

The Linux ephemeral port range was the default port for many Hadoop Services in V1 and V2. However, V3 has created a default port for the Secondary NameNode among other services (Ghazi & Gangodkar, 2015). Thus, while V1 and V2 encountered conflict with the application and failed to bind at startup, V3 offers services out of ephemeral range. The new additions in V3 create a distinction between how Secondary NameNode helps processes and tasks in V3 versus in V1 and V2.

DataNode in Hadoop V3

A high level of advancement is seen in that each HDFS block is stored in a separate file in the local file system. Unlike in V1 and possibly in V2, Hadoop V3 does not allow a DataNode to create a file in the same directory. Therefore, it can be argued that Hadoop V3 is beneficial to open-source data processing for having a completely different infrastructure.

JobTracker and TaskTracker in Hadoop V3

The two daemons were already removed in V2 such that they could not be found in V3. However, resource and node managers, their replacements, are still critical to the overall assignment of duties, processes, and functions to various file system nodes. In Hadoop V3, the designers introduced the Yarn Timeline Service, whose primary role is to store and retrieve the current and historical information on the application (“What is new in Hadoop 3? Explore the unique Hadoop 3 features”, n.d.). Thus, there is still a close connection between YARN and the application master in their capacity to schedule and cater to the various processes and tasks within the system. The Apache Hadoop Architecture shown in figure 1 below exemplifies the additions to V1 and V2. Notably, the resource manager and node manager have become some of the essential elements of Hadoop V3 due to their centrality in fast data execution (see figure 1).

Figure 1: Apache Hadoop YARN-Architecture

Source: Data Flair Training (2021).

Security of Hadoop V1, V2, and V3

V1 is a relatively insecure Hadoop due to the single point of failure occasioned by the single NameNode. Furthermore, given the possibility of a cluster of data due to the system’s low capacity to handle massive information, Hadoop V1 becomes even more vulnerable to attacks and internal failures within the file system (Perwej, 2019). For example, queues in a long process of processed information can render the entire file system clogged, which opens the rest of the connected system to insecurity and attacks. An alternative explanation is that the single NameNode is vulnerable and denies the entire system the needed security level for working in the current environment.

Active and Passive NameNodes in Hadoop Version 2 enhanced the system’s security as the file management had a level of autonomy. Activity, passivity, and HDFS High Availability ensured the systems would no longer be rendered inactive based on uncontrollable failures from a single point of failure (Perwej et al., 2017). The improved addition of working and standby nodes in V2 is a testament that more tasks attract more attacks and require optimal security. Hadoop is relatively more secure than V1, but not as much as V3, as discussed in the following section.

In Hadoop V3, a three-tier NameNode enhanced the system’s security as part of the accentuated slavery daemons. The more the nodes in a network, the higher the capacity for task processing within a heightened information and data transference framework. Hadoop v3 is the most secure file distribution system yet, on account of having multiple levels of upgrade for the daemons. For instance, with the removal of JobTracker and TaskTracker, and their replacement with resource and domain manager, it is easier for the system users to schedule processes and tasks even from a vast ephemeral range (“JobTracker and TaskTracker,” n.d.). Heightened security expands the scale and scope of Hadoop operations.

Conclusion

In the current fast-paced world, one of the critical elements of success for any organization or entity is the capacity to process large amounts of data concurrently and according to client and stakeholder expectations. Hadoop fulfills such a function by being an apache and java-based platform whose primary role is data storage and distribution within a clustered software system. This study has shown the distinctions between the three Hadoop versions, indicating the differences in the range of functionalities for the daemons. A critical analysis, comparison, and contrast of the three platforms also shows heightened security as the designers’ team added new functionalities to Hadoop. Overall, the Java-based platform is likely to grow as the demands of information transference and data processing in the 21st-century increase.