Qu'est-ce que Apache Hadoop?

Arpita Bhojane

Arpita Bhojane, studied at Sinhgad College of Engineering, Pune

Mise à jour il y a 121w

Réponse d'origine: Qu'est-ce que Apache Hadoop?

The successful implementation of Google published papers' concepts describing Google File System and MapReduce resulted in the development of the Hadoop, a first class Apache project.
Hadoop proper, is a distributed master slave architecture as shown in figure that consists of the Hadoop Distributed File System (HDFS) for storage and MapReduce for computational capabilities. Traits intrinsic to Hadoop are data partition capabilities and parallel computation of large dataset. Its storage and computational capabilities scale with the addition of hosts to a Hadoop cluster, and can reach volume sizes in the petabytes on cluster with thousands of hosts.

Qu'est-ce que Apache Hadoop?

Who's using Hadoop?
Many high tech companies use Hadoop and it is penetrating in a braod range of sectors like enterprise, government and healthcare.
Facebook, twitter, Yahoo use Hadoop for data analytics, machine learning, search ranking, etc.
Hadoop limitations
High availability and security are identified as common areas of weaknesses across HDFS and MapReduce though they are receiving focus and are solved to some extent in 2.x versions

Shwati Kumar

Shwati Kumar, Big Data Lead at Mindtree (2018-present)

Répondu il y a 32w · L'auteur dispose de réponses 216 et de vues de réponses 444.9k

Already 100+ answers, but I will try to make the answer easiest to understand:

Apache Hadoop:

  1. The backbone of all the Big Data applications
  2. Still, the poster boy for Big Data tools
  3. Open source tool to handle huge volume of Data
  4. Written in Java, development is very easy
  5. Consists of 3 components:
    1. HDFS - Most reliable storage system on the planet
    2. MapReduce - Distributed processing layer
    3. Yarn - Resource management layer
  6. It efficiently processes large volumes of data on a cluster of commodity server
  7. Inspired from Google, but acquired complete Big Data market
  8. Flexibility to store and mine any type of data whether it is structured, semi-structured or unstructured
  9. Excels at processing data of complex nature
  10. Scale-out architecture divides workloads across many nodes
  11. Works in master-slave fashion
  12. As data is stored in a distributed manner in HDFS across the cluster, data is processed in parallel on a cluster of nodes
  13. Failures of nodes or tasks are recovered automatically by the framework
  14. If machine goes down, then also your data will be stored reliably
  15. Highly scalable - new nodes can be added to the cluster
  16. Easy to use - No need of client to deal with distributed computing

I think it is sufficient for a newbie to understand What is Hadoop

If you like my answer, please upvote..!!

Andrew Comstock

Andrew Comstock, works at Informatica Corporation

Répondu il y a 286w · L'auteur dispose de réponses 86 et de vues de réponses 110.3k

Réponse d'origine: What is Hadoop? What does it do? How can I relate it to Big Data and Cloud?

Many others will answer the question with likely more correct answers (and much more technical answers); however, based on your question, it doesn't seem like a technical answer is what you are looking for. I hope this helps! So before we state what Hadoop is, let's get some background.

Contexte
Hadoop is a very clever method for distributed computing and storage. For a long-time, people stored information on servers in big databases. And this mainly worked for the types and amount of data that people had. Basically, if you've some row- or column-based data, this works for you really well. However, in the last few years, certain companies wanted more données.

For example (and this is a simplified example), imagine a small Amazon; early on in their life, the only thing they are storing is user information and purchase / order history. If you have 1 million users and every makes 5 purchases a year, your user data is only 1 million records and your order database is 5 million records (and grows only 5 million records a year).

Sometime, later you decide to store every product a user adds to the shopping cart but never purchases, so you start storing aborted purchase information which is 10 million records a year (2 aborted purchases for every real purchase). Then, they decide they want to store every product every user looks at. They start storing that; every user might look at 10 products before each completed or aborted purchase. Now you've got 150 million records per year. Then you decide you want to store every single click that a user does... it isn't very long before you want to store billions or trillions of records every year.

And this is where traditional storage structures typically don't work. A single database with billions of records is going to be so large that, if it works, it's going to be slow. And this is where Hadoop comes in...

What is Hadoop?
Hadoop is a very clever piece of software designed to help the above situation. As we stated, we might have 5 billion records of what users clicked on. We want to analyze this data to create recommendations for people; so they can find products they want more quickly and buy more! How do we do this? We already stated above that it was too large (and faster servers are very expensive).

Leggi:  Que se passerait-il si je trouvais des millions de dollars 7 dans la rue et les mettais dans mon compte bancaire? Aurais-je un appel de la banque / de la police / de l'IRS? Devrais-je prouver que j'ai payé des impôts et que l'argent est à moi?

So Hadoop comes in and basically says: Buy a lot of commercial hardware (basically, cheap desktops you can get from anyone) and network them together. One of these desktops (or node) is going to be the brain of this thing. When you give it data, it'll automatically store it on one of the nodes later. And when you need to query it, you send the query to the brain, and the brain will ask each desktop to return whatever results it happens to be storing.

This is amazing. Instead of using expensive hardware, you use commercial hardware. And more, distribute the queries to the local hardware. It's equivalent to this question: Would you rather hire the fastest reader in the world who can read 1,000 pages a minute to read a billion pages or hire 75 average people that can read 15 pages a minute to each read 13 million pages for less money? The latter; it's faster and cheaper - but only if you've got a lot of data.

Qu'est-ce que le Big Data?
Big Data is simply an idea that the amount of data that we generate (and more importantly, collect) is increasing extremely quickly. More importantly, companies are recognizing that this data can be used to make more accurate predictions, and therefore, make them more money. Facebook, for example, knows how often you visit many websites (due to the pervasive Like on Facebook buttons) and wants to use that information to show you ads you are more likely to click on. Palantir wants to help governments fight crime and terrorism by using massive data to predict where issues may be (Collaborating with Google Ideas and Global Impact Award winners to Fight Human Trafficking | Palantir).

What is the Cloud?
The Cloud, as an idea, is tangentially related to the above trends. It is a slightly different method of distributed computing; instead of going from servers to desktops, you are going from servers to someone else's computers. In years past, if a company wanted to have a website or a online store, they would buy servers, hook those servers up to the Internet and then build their store. Cloud is a movement to eliminate the 'buy the servers' part; instead, they rent Amazon's, or Microsoft's, or anyone else's servers. The benefits are that they don't have to maintain them, worry about uptime, etc. On the consumer-side, Cloud has made it easier for consumers to backup important documents (Dropbox, Box), access purchased games (Steam) or music (Amazon Music Cloud, Google Music Cloud), etc. Basically, instead of using your computer and computing power, you use someone else's.

Of note, the Cloud and Big Data are somewhat at odds for the moment; Big Data is typically measured in terabytes and petabytes of data, which can be cost-prohibitive to put in the Cloud (again, think about one of Hadoop's benefits that you can use cheaper hardware) but this will change over time as bandwidth, technology and security (not mentioned but many online Cloud services won't work for many financial or healthcare companies... yet) improves. You can do Hadoop in the Cloud but I believe most groups don't, but that will change in the coming years. It is not 'if'... it is not 'when'... it's more a 'how soon'.

Parul Sharma

Parul Sharma, Big Data Developer at DataFlair (2017-present)

Répondu il y a 41w · L'auteur dispose de réponses 207 et de vues de réponses 153.9k

Apache Hadoop is the most popular and powerful grandes données tool, Hadoop provides world’s most reliable storage layer – HDFS, a batch Processing engine – MapReduceand a Resource Management Layer – FILS. Lets learn Hadoop Introduction in detail.

Introduction to Hadoop

Hadoop is an open-source tool from the ASF – Apache Software Foundation. Open source project means it is freely available and we can even change its source code as per the requirements. If certain functionality does not fulfill your need then you can change it according to your need. Most of Hadoop code is written by Yahoo, IBM, Facebook, Cloudera.

It provides an efficient framework for running jobs on multiple nodes of clusters. Grappe means a group of systems connected via LAN. Apache Hadoop provides parallel processing of data as it works on multiple machines simultaneously.

Big Data Hadoop Video Tutorial

By getting inspiration from Google, which has written a paper about the technologies. It is using like Carte-Réduire programming model as well as its file system (GFS). As Hadoop was originally written for the Nutch search engine project. When Doug Cutting and his team were working on it but very soon, it became a top-level project due to its huge popularity.

Hadoop is an open source framework written in Java. But this does not mean you can code only in Java. You can code in C, C++, Perl, python, ruby etc. You can code in any language but it will be more good to code in java as you will have lower level control of the code.

Leggi:  Puis-je retirer un petit montant de mon SIP après la formation d'un corpus ou devrai-je retirer le montant total?

Hadoop consists of three key parts –

There are many more insights to learn about Apache Hadoop. follow link: Hadoop Tutorial – Complete Apache Hadoop Guide For Beginners

Hadoop Features and Characteristics

Qu'est-ce que Apache Hadoop?

2.1. Open source

Apache Hadoop is an open source project. It means its code can be modified according to business requirements.

2.2. Distributed Processing

As data is stored in a distributed manner in HDFS across the cluster, data is processed in parallel on a cluster of nodes.

2.3. Fault Tolerance

By default 3 replicas of each bloc is stored across the cluster in Hadoop and it can be changed also as per the requirement. So if any node goes down, data on that node can be recovered from other nodes easily. Failures of nodes or tasks are recovered automatically by the framework. This is how Hadoop is fault tolerant.

2.4. Fiabilité

Due to replication of data in the cluster, data is reliably stored on the cluster of machine despite machine failures. If your machine goes down, then also your data will be stored reliably.

2.5. High Availability

Les données sont highly available and accessible despite hardware failure due to multiple copies of data. If a machine or few hardware crashes, then data will be accessed from another path.

There are many more features to learn about Apache Hadoop. follow link: Hadoop Features and Design Principles

Ritika Prasad

Ritika Prasad, PMP / Big Data / AWS / Azure / Trainer

Répondu il y a 51w · L'auteur dispose de réponses 233 et de vues de réponses 249.5k

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Qu'est-ce que Apache Hadoop?

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged data into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

The base Apache Hadoop framework is composed of the following modules:

  • Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
  • Système de fichiers distribué Hadoop (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
  • Hadoop YARN – a platform responsible for managing computing resources in clusters and using them for scheduling users' applications;
  • Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale data processing.

Some of the reasons organizations use Hadoop is its’ ability to store, manage and analyze vast amounts of structured and unstructured data quickly, reliably, flexibly and at low-cost.

  • Scalability and Performance – distributed processing of data local to each node in a cluster enables Hadoop to store, manage, process and analyze data at petabyte scale.
  • Fiabilité – large computing clusters are prone to failure of individual nodes in the cluster. Hadoop is fundamentally resilient – when a node fails processing is re-directed to the remaining nodes in the cluster and data is automatically re-replicated in preparation for future node failures.
  • La flexibilité – unlike traditional relational database management systems, you don’t have to created structured schemas before storing data. You can store data in any format, including semi-structured or unstructured formats, and then parse and apply schema to the data when read.
  • À bas prix – unlike proprietary software, Hadoop is open source and runs on low-cost commodity hardware.

More Information about Hadoop certification exam

Here I am providing you some of the useful information for learning the Hadoop certification exam.

Leggi:  Which is the best accounting software?

J'espère que cela aide!

If you are looking for any help in preparing for the Hadoop certification exams, please send me a message

Sangeeta Vishkarma

Sangeeta Vishkarma, Digital Marketing at Gurugram, Haryana, India (2016-present)

Répondu il y a 59w

Techstack

Object storage system provider Cloudian Inc. and deep learning company Skymind Inc. are teaming up to provide data management solutions for artificial intelligence systems.

Emerging technologies such as AI and machine learning are seeing rapid adoption as companies bet on their potential to tackle problems that simply couldn’t be solved before. But Cloudian and Skymind say AI and machine learning present their own challenges, as the massive data sets needed to train these systems become increasingly difficult to manage.

Qu'est-ce que Apache Hadoop?

To tackle this, Cloudian said, it’s joining with Skymind to offer a number of integrated and enterprise-ready AI and machine learning solutions based on its object storage platform. The plan is to integrate Cloudian’s HyperStore object storage systems for on-premises deployments with Skymind’s Deeplearning4j framework, which is used to train deep learning models and provide actionable intelligence from the data it crunches.

The Deeplearning4j framework is an alternative to Google Inc.’s better-known TensorFlow deep learning library. Both are based on the Java programming language, but Skymind said its offering is better-suited to enterprises because it comes with commercial support that’s not available with TensorFlow. In addition, the Deeplearning4j framework supports big-data technologies such as Apache Hadoop and Apache Spark, and it’s optimized for Intel Corp.’s newest-generation Xeon Phi server processors, IBM Corp.’s POWER series processors and Nvidia Corp.’s graphic processing units. Skymind claimed that this improves hardware utilization and therefore makes it possible to train artificial intelligence algorithms faster.

The new partnership centers around training data for AI and machine learning systems, the companies said. Basically, Cloudian provides the hardware to extract and store this data, before making it accessible to the Deeplearning4j framework, which trains the actual systems.

Qu'est-ce que Apache Hadoop?

“AI and machine learning presents two storage challenges,” said Michael Tso, co-founder and chief executive officer of Cloudian. “First, you need lots of capacity, because to have effective pattern recognition, you need a lot of training data to refine your detection algorithm.”

The second problem is that this training data needs to be “tagged: in order to improve the accuracy of machine learning algorithms.

Tso cited an example use case last year, when Cloudian demonstrated its object storage platform’s suitability for AI workloads, deploying its systems inside advertising billboards in Japan in order to deliver targeted advertising to individual drivers based on the likely preferences of each car owner’s demographic.

“If you have a car picture – a BMW in the rain at night, for example – you’ll want to tag the picture with the car type, weather conditions and time of day to improve your matching accuracy,” Tso explained. “Cloudian stores user-defined tags with the data to make that record keeping easy.”

The companies said their combined technologies should prove especially useful to industries such as healthcare, financial services, security and robotics.

They’re also touting a new security monitoring tool that they say is the first of several integrated AI- and machine learning-based solutions they plan to offer. The new tool is designed to identify attacks on corporate networks, using pattern recognition to identify suspicious activity. In this case, Cloudian’s HyperStore object storage systems take care of the data management, which in turn supports the training environment, algorithm development and ongoing refinement that’s handled by Deeplearning4j.

Screenshot of Cloudian and Skymind’s new network intrusion detection tool

“Deep learning can produce state-of-the-art accuracy on many problems, and it requires large datasets to do that,” said Skymind CEO Chris Nicholson. “That’s why it makes sense for a deep-learning company like Skymind to partner with a hyperscale storage provider like Cloudian. These are two crucial parts of a larger solution.

Big Data Analytics & Hadoop Training in Delhi | Top Certification

Techstack
133 A , Lane no 1 , Westend Marg
Behind Saket Metro Gate no 2, Saidul-al-Ajaib,
New Delhi-110030

P:

+91 9911941144, +91 9911921144T

Lascia un commento

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *

Questo sito usa Akismet per ridurre lo spam. Scopri come i tuoi dati vengono elaborati.