Hadoop

HDFS Notes for beginners

Photo by Jan Kahánek on Unsplash

CDP version 7

HDFS is a core component of the Hadoop Ecosystem . HDFS has 2 key components

Namenode:

  • Manages Datanodes
  • Stores the metadata
  • Preferable to be built on enterprise machines with high CPU and RAM

Datanode:

  • Stores actual data
  • Could be built on commodity servers

Metadata on namenode is maintained through 2 files Editlog and the FSimage. In Hadoop we refer to the directory structure of the filesystem as Namespace which is handled by NameNode

Editlog keeps track of the recent changes made on the hdfs after the fsimage. However FSImage keeps tracks of every file system change made on hdfs since the beginning, It’s a point in time snapshot of the metadata . An Active namenode maintains both fsimage and editlogs , any modifications to its namespace are written to the shared edits of Journal Nodes. A Standby namenode reads data from any of the replica set of Journal Nodes and applies to its namespace. Let’s imagine a scenario when your active namenode fails . To build a namenode resilience it is ideal to use a Standby namenode on a separate rack of hardware.

Standby Namenode maintains recent fsimage after reading shared edit logs from a QJN(Quorum Journal Node) . In event of a failover, Standby namenode makes sure it has read all the edits from the Journal Nodes and promotes itself as an Active Name Node. For accurate information about the block locations, DataNodes are configured with the location of all the NameNodes, and they send block location information and heartbeats to all the NameNode machines.

References : https://blog.cloudera.com/a-guide-to-checkpointing-in-hadoop/

https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/fault-tolerance/topics/cr-namenode-architecture.html

Leave a Reply