Licht Tech Limited, a Competency Enrichment and Skills Development Company is committed to empowering individuals and organisations with capacity building ...
Apache Hadoop is an open-source, Java-based software platform that manages data processing and storage for big data applications. The platform works by distributing Hadoop big data and analytics jobs across nodes in a computing cluster, breaking them down into smaller workloads that can be run in parallel. Some key benefits of Hadoop are scalability, resilience and flexibility. The Hadoop Distributed File System (HDFS) provides reliability and resiliency by replicating any node of the cluster to the other nodes of the cluster to protect against hardware or software failures. Hadoop flexibility allows the storage of any data format including structured and unstructured data.
The Apache Hadoop has three components. They are:
Hadoop Common: contains libraries and utilities needed by other Hadoop modules
Hadoop HDFS: Hadoop Distributed File System (HDFS) is the storage unit.
Hadoop MapReduce: Hadoop MapReduce is the processing unit.
Hadoop YARN: Yet Another Resource Negotiator (YARN) is a resource management unit.
Hadoop HDFS
Data is stored in a distributed manner in HDFS. There are two components of HDFS - name node and data node. While there is only one name node, there can be multiple data nodes.
HDFS is specially designed for storing huge datasets in commodity hardware. Hadoop enables you to use commodity machines as your data nodes. The features of HDFS are:
Provides distributed storage
Can be implemented on commodity hardware
Provides data security
Highly fault-tolerant - If one machine goes down, the data from that machine goes to the next machine
Hadoop MapReduce
Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce approach, the processing is done at the slave nodes, and the final result is sent to the master node. A data-containing code is used to process the entire data. This coded data is usually very small in comparison to the data itself. You only need to send a few kilobytes worth of code to perform a heavy-duty process on computers.
Hadoop YARN
Hadoop YARN stands for Yet Another Resource Negotiator. It is the resource management unit of Hadoop and is available as a component of Hadoop version 2. Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top of HDFS. It is responsible for managing cluster resources to make sure you don't overload one machine. It performs job scheduling to make sure that the jobs are scheduled in the right place.
Speed: Hadoop’s concurrent processing, MapReduce model, and HDFS let users run complex queries in just a few seconds.
Diversity: Hadoop’s HDFS can store different data formats, like structured, semi-structured, and unstructured.
Cost-Effective: Hadoop is an open-source data framework.
Resilient: Data stored in a node is replicated in other cluster nodes, ensuring fault tolerance.
Scalable: Since Hadoop functions in a distributed environment, you can easily add more servers.
There’s a steep learning curve. Running a query in Hadoop’s file system will require you to write MapReduce functions with Java, a process that is non-intuitive. Also, the ecosystem is made up of lots of components.
Not every dataset can be handled the same. Hadoop doesn’t give you a “one size fits all” advantage. Different components run things differently, and you need to sort them out with experience.
MapReduce is limited. Yes, it’s a great programming model, but MapReduce uses a file-intensive approach that isn’t ideal for real-time interactive iterative tasks or data analytics.
Security is an issue. There is a lot of data out there, and much of it is sensitive. Hadoop still needs to incorporate proper authentication, data encryption, provisioning, and frequent auditing practices.