Radhika K, worked at Mindtree
Répondu il y a 78w · L'auteur dispose de réponses 558 et de vues de réponses 491.7k
Below are feature wise comparison between Hadoop 2 and Hadoop 3:
Under feature, first point is for Hadoop 2.x while second is for Hadoop 3.x
Apache 2.0, Open Source
Apache 2.0, Open Source
Minimum supported version of java
Minimum supported version of java is java 7.
Minimum supported version of java is java 8
Tolérance aux pannes
Tolérance aux pannes can be handled by replication (which is wastage of space)
Fault tolerance can be handled by erasure coding (follow this tutorial for more info about erasure coding)
For data balancing uses HDFS balancer.
For data balancing uses intra datanode balancer, which is invoked via the hdfs disk balancer CLI.
Uses 3X replication scheme
Support for erasure encoding in hdfs.
HDFS has 200% overhead in storage space
Storage overhead is only 50%
Storage overhead example
If there is 6 block so there will be 18 blocks occupied the space because of replication scheme.
If there is 6 block so there will be 9 block occupied the space 6 block and 3 for parity.
YARN timeline service
Uses an old timeline service which has scalability issues.
Improve the timeline service v2 and improves the scalability and reliability of timeline service.
Default ports range
In Hadoop 2.0 some default ports are Linux ephemeral port range. So at the time of startup they will be fail to bind.
But in hadoop 3.0 these ports have been moved out of the ephemeral range.
Les usages Ruche, pig, Tez, Hama, Giraph and other hadoop tools
Hive, pig, Tez, Hama, Giraph and other hadoop tools are available.
Compatible file system
HDFS (Default FS), FTP File system: This stores all its data on remotely accessible FTP servers. Amazon S3 (Simple Storage Service) file system Windows Azure Storage Blobs (WASB) file system.
It supports all the previous one as well as Microsoft Azure Data Lake filesystem.
Shubham Sinha, Big Data and Hadoop Enthusiast, Passionate about digging into Hadoop...
Répondu il y a 74w · L'auteur dispose de réponses 326 et de vues de réponses 580.5k
There are some of the major changes that they are trying to incorporate like:
Support for Erasure Encoding in HDFS
RAID implements EC through Entrelacement, in which the logically sequential data (such as a file) is divided into smaller units (such as bit, byte, or block) and stores consecutive units on different disks.Then for each stripe of original data cells, a certain number of parity cells are calculated and stored. This process is called codage. The error on any striping cell can be recovered through decoding calculation based on surviving data cells and parity cells.
Integrating EC with HDFS can maintain the same fault-tolerance with improved storage efficiency. As an example, a 3x replicated file with 6 blocks will consume 6*3 = 18 blocks of disk space. But with EC (6 data, 3 parity) deployment, it will only consume 9 blocks (6 data blocks + 3 parity blocks) of disk space. This only requires the storage overhead up to 50%.
YARN Timeline Service v.2
Hadoop is introducing a major revision of YARN Timeline Service i.e. v.2. YARN Timeline Service. It is developed to address two major challenges:
- Improving scalability and reliability of Timeline Service
- Enhancing usability by introducing flows and aggregation
YARN version 1 is limited to a single instance of writer/reader and does not scale well beyond small clusters. Version 2 uses a more scalable distributed writer architecture and a scalable backend storage. It separates the collection (writes) of data from serving (reads) of data. It uses distributed collectors, essentially one collector for each YARN application. The readers are separate instances that are dedicated to serving queries via REST API.
Timeline Service v.2 supports the notion of flows explicitly. In addition, it supports aggregating metrics at the flow level.
MapReduce Task-Level Native Optimization
In Hadoop 3, a native Java implementation has been added in MapReduce for the map output collector. For shuffle-intensive jobs, this improves the performance by 30% or more.
They added a native implementation of the map output collector. For shuffle-intensive jobs, this may provide speed-ups of 30% or more. They are working on native optimization for MapTask based on JNI. The basic idea is to add a NativeMapOutputCollector to handle key value pairs emitted by the mapper, therefore sort, spill, IFile serialization can all be done in native code. They are still working on the Merge code.
A single DataNode manages multiple disks. During a normal write operation, data is divided evenly and thus, disks are filled up evenly. But adding or replacing disks leads to skew within a DataNode. This situation was earlier not handled by the existing HDFS balancer. This concerns intra-DataNode skew. Now Hadoop 3 handles this situation by the new intra-DataNode balancing functionality, which is invoked via the hdfs diskbalancer CLI.
There are multiple other changes that Apache community is trying to incorporate in Hadoop 3:
Minimum Required Java Version in Hadoop 3 is Increased from 7 to 8
In Hadoop 3, all Hadoop JARs are compiled targeting a runtime version of Java 8.
Shell Script Rewrite
The Hadoop shell scripts have been rewritten to fix many bugs, resolve compatibility issues and change in some existing installation. It also incorporates some new features.
Shaded Client Jars
Les hadoop-client available in Hadoop 2.x releases pulls Hadoop’s transitive dependencies onto a Hadoop application’s classpath. This can create a problem if the versions of these transitive dependencies conflict with the versions used by the application.
So in Hadoop 3, we have new hadoop-client-api and hadoop-client-runtime artifacts that shade Hadoop’s dependencies into a single jar. hadoop-client-api is compile scope & hadoop-client-runtime is runtime scope, which contains relocated third party dependencies from hadoop-client. So, that you can bundle the dependencies into a jar and test the whole jar for version conflicts. This avoids leaking Hadoop’s dependencies onto the application’s classpath. For example, HBase can use to talk with a Hadoop cluster without seeing any of the implementation dependencies.
Support for Opportunistic Containers and Distributed Scheduling
A new ExecutionType has been introduced, i.e. Opportunistic containers, which can be dispatched for execution at a NodeManager even if there are no resources available at the moment of scheduling. In such a case, these containers will be queued at the NM, waiting for resources to be available for it to start. Opportunistic containers are of lower priority than the default Guaranteed containers and are therefore preempted, if needed, to make room for Guaranteed containers. This should improve cluster utilization.
Support for More than 2 NameNodes
Business critical deployments require higher degrees of fault-tolerance. So, in Hadoop 3 allows users to run multiple standby NameNodes. For instance, by configuring three NameNodes (1 active and 2 passive) and five JournalNodes, the cluster can tolerate the failure of two nodes.
Default Ports of Multiple Services have been Changed
Earlier, the default ports of multiple Hadoop services were in the Linux ephemeral port range (32768-61000). Unless a client program explicitly requests a specific port number, the port number used is an éphémère port number. So at startup, services would sometimes fail to bind to the port due to a conflict with another application.
Thus the conflicting ports with ephemeral range have been moved out of that range, affecting port numbers of multiple services, i.e. the NameNode, Secondary NameNode, DataNode, etc. Some of the important ones are
Namenode ports: 50470 --> 9871, 50070 --> 9870, 8020 --> 9820
Secondary NN ports: 50091 --> 9869, 50090 --> 9868
Datanode ports: 50020 --> 9867, 50010 --> 9866, 50475 --> 9865, 50075 --> 9864
Support for Filesystem Connector
Hadoop now supports integration with Microsoft Azure Data Lake and Aliyun Object Storage System. It can be used as an alternative Hadoop-compatible filesystem.
To know more you can go through this Expected Enhancements in Hadoop 3 blog.
Or alternatively you can go through this Hadoop 3 video:
Abhijeet Kumar, Software Development Engineer at Sentienz (2018-present)
Répondu il y a 4w
For me the major change is in fault tolerant part. Previously Hadoop uses replication (default is 3) to make the system fault tolerant. So, if a replication factor is 3 and you have two blocks, then total 6 blocks will be saved on the cluster and the cluster can handle a maximum of two hardware failures (Assume the cluster has 3 nodes).
From Hadoop 3 they started using Erasure Coding.
I’ll explain Erasure Coding diagrammatically, so you can understand better. (No high level words )
Here if you can see using linear algebra they are calculating A+B and A+2B and they store all these different blocks on 4 different machines. Here still you can handle two failures among these four. The total block used is 4 for handling 2 hardware failures. Before Hadoop 3, the total block created was 3*2 =6 ( For above Example ).
But, there are demerits also. Example: For 2 blocks and if you want to handle failure of two hardwares then at least you should have 4 nodes in your cluster. Whereas in Hadoop 2 and before versions it can be done using 3 node also. Its not recommendable for production use case. Once things are in their right place this will become a big revolution.
Rahul Anand, Hadoop Developer, BigData Trainer & BigData Analytics
Répondu il y a 100w · L'auteur dispose de réponses 302 et de vues de réponses 115.9k
Let’s Play with BigData more Smartly!!
Hadoop 3.x on the way!! with following features:
- Java 8 Minimum Runtime Version
Another major motivation for a new major release was bumping the minimum supported Java version to Java 8.
- Intra-DataNode Balancer
Intra-DataNode balancing functionality addresses the intra-node skew that can occur when disks are added or replaced.
- HDFS Erasure Coding
HDFS Erasure Coding is a major new feature, and one of the driving features for releasing Hadoop 3.0.0.
- Shell Script Rewrite
The Hadoop shell scripts have been rewritten with an eye toward unifying behavior, addressing numerous long-standing bugs, improving documentation, as well as adding new functionality.
- Support for more than 2 NameNodes.
The initial implementation of HDFS NameNode high-availability provided for a single active NameNode and a single Standby NameNode. By replicating edits to a quorum of three JournalNodes, this architecture is able to tolerate the failure of any one node in the system.
However, some deployments require higher degrees of fault-tolerance. This is enabled by this new feature, which allows users to run multiple standby NameNodes. For instance, by configuring three NameNodes and five JournalNodes, the cluster is able to tolerate the failure of two nodes rather than just one.
- Default ports of multiple services have been changed.
- Support for Multiple Standby NameNodes
Neeraj Sabharwal, 4+ years in Hadoop, 10+ years in Database, Cloud
Répondu il y a 124w
Voir cette Apache Hadoop 3.0.0-alpha1-SNAPSHOT
Hadoop 3.x Releases
- HADOOPMove to JDK8+Classpath isolation on by default HADOOP-11656Shell script rewrite HADOOP-9902Move default ports out of ephemeral range HDFS-9427
- HDFSRemoval of hftp in favor of webhdfs HDFS-5570Support for more than two standby NameNodes HDFS-6440Support for Erasure Codes in HDFS HDFS-7285
- MAPREDUCEDerive heap size or mapreduce.*.memory.mb automatically MAPREDUCE-5785
Satyam Kumar, Working At AcadGild
Répondu il y a 78w · L'auteur dispose de réponses 128 et de vues de réponses 360k
I would recommend users to go through below blogs which tells about the detailed new features about Hadoop-3.x and also the ways to install it.
Satyam Kumar| Hadoop Developer at Acadgild