25

Introduction to Hadoop

 4 years ago
source link: https://www.tuicool.com/articles/Vf6ziub
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

What is Hadoop: Architecture, Modules, Advantages, History

Nov 3 ·3min read

RzuA32q.jpg!web

In this blog, I am going to provide a brief overview of Hadoop , which is the most popular frameworks for distributed processing of large data sets.

We will try to understand the architecture of Hadoop, its modules and will also look at the advantages of using Hadoop. In the end, we’ll take a look at its history to highlight some key people involved in the actual development of this great framework.

What is Hadoop ?

Hadoop is an open-source, Java-based framework from Apache which is used for storing, processing and analyzing data which are very huge in volume.

Hadoop is used for batch/ offline processing . It is a collection of software utilities which uses a network of many computers to solve problems involving large amounts of data and computation .

Architecture :

The architecture of Hadoop involves a package of the file system and operating system abstractions which is called the Hadoop Common package, containing a MapReduce engine (processing part) and the Hadoop Distributed File System (storage part).

The Hadoop Common package contains the JAR (Java ARchive) files and scripts needed to start Hadoop.

yIvUfaM.png!web

High Level Architecture of Hadoop

Every Hadoop cluster consists of a single master and multiple worker nodes. The Master node has a Job Tracker, Task Tracker, Name Node and Data Node while the Slave (worker node) can act as both a DataNode and TaskTracker. Also it is possible to have data-only and compute only worker nodes.

The standard startup and shutdown scripts require that Secure Shell (SSH) to be set up between nodes in the cluster. Apart from this Hadoop requires a JRE (Java Runtime Environment) of 1.6 or greater.

Modules of Hadoop :

The Hadoop framework is composed of the following modules :

  1. Hadoop Distributed File System (HDFS) : It includes the files that will be broken into blocks and will be stored in nodes over a distributed architecture. Using a distributed file system provides very high aggregate bandwidth across clusters
  2. Hadoop Yarn (Yet Another Resource Negotiator) : Used for job scheduling and managing the computing resources in clusters.
  3. Hadoop MapReduce : It is an algorithm which distributes the task into small pieces and assigns those pieces to many computers joined over the network, and assembles all the events to form the last event dataset.
  4. Hadoop Common : Includes Java Libraries that are used to start Hadoop and utilities which are needed by other Hadoop modules.

Advantages of Hadoop :

  1. Hadoop is fast , as it allows the users to quickly write and test distributed systems. It stores and processes huge amounts of any kind of data, quickly.
  2. Hadoop is efficient , and it automatically distributes the data and work across the machines. Through this, it utilizes the underlying parallelism of the CPU cores.
  3. Hadoop is Scalable , servers can be added or removed from the cluster dynamically. You can easily grow your system to handle more data by simply adding nodes.
  4. Hadoop is cost-effective in comparison to other traditional databases when it comes to storing and performing computations on data.
  5. Hadoop is an open-source framework and is compatible on all the platforms since it is based on Java.
  6. Hadoop library has been designed to detect and handle failures at the application layer. It does not rely on the hardware to provide fault-tolerance and high availability (FTHA).

History

Conclusion

So that’s it folks you’ve learned the basics of Hadoop framework. We saw the architecture of Hadoop and looked at the modules which make it the best framework for distributed processing. Also, we looked at some of the advantages of using Hadoop and tried to understand the timeline involved in its development process.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK