HDFS

What is HDFS?

• HDFS is Hadoop’s own File system written in Java.
• HDFS is implemented based on Google File System (GFS).
• HDFS stores the data on the cluster.
• Files in HDFS are WRITE ONCE and not suitable for random writes.
• HDFS is optimized for large, streaming reads of files and not suitable for random reads.
• HDFS is not a regular file system. It is an abstraction layer that sits on top of native unix file system.

For interview prospective,if someone asks then we have to define as below:

HDFS Definition:

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. It has many similarities with existing distributed file systems. Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

What is HDFS Architecture?

HDFS is responsible for storing the data on Hadoop cluster.
You can read the files from HDFS and you can write the files to HDFS.
Data files are split into blocks before storing on the cluster.

What is the size of block in HDFS and default block size?

Typical size of each block is 64MB or 128MB.
Default block size Is 64MB

How will you configure the block size in HDFS?

You can configure the block size globally in hdfs-site.xml as follows: <property>

<name>dfs.block.size</name>

<value>134217728</value>

</property>

What is default Replication factor and how it works?

Blocks belongs to one file are stored on different data nodes.
Each block is replicated on multiple data nodes to ensure high reliability and fault- tolerance.
Default replication is three fold which means that each block exists in three different nodes. You can configure the replication factor globally in hdfs-site.xml as follows:

<property>

<name>dfs.replaceation</name>

<value>2</value>

</property>

How Metadata and NameNode is linked?

• MetaData includes filename, number of blocks, on which data nodes blocks are stored etc will be stored in NameNode.
• HDFS Architecture is based on Master/Slave Architecture which consist of NameNode and Data Nodes.

What is Data pipelining?

  • Client retrieves a list of DataNodes on which to place replicas of a block
  • Client writes block to the first DataNode
  • The first DataNode forwards the data to the next DataNode in the Pipeline

When all replicas are written, the Client moves on to write the next block in file

WHAT IS NAMENODE?

• NameNode is the Master node which is responsible for storing the meta-data related to files, blocks that make files and location of blocks In the cluster.
• NameNode must be running all the times.
• NameNode is a critical one point failure node.
• If the NameNode fails then the cluster becomes inaccessible.
• For the very reason, we have secondary NameNode.

What is HDFS namenode federation?

In this, one single namenode will contain the metadata, whereas multiple name nodes will contain the metadata about the block mapping of files and directories of subsets of the entire HDFS. For example, if HDFS contains two directories inside a directory, there may be two namenodes which maintain the two different directories so that the load will not be more and if one namenode fails, then the other can take over.
The list of sub-directories maintained by a name node is called a namespace volume.
Blocks for files belonging to a namespace is called block pool.
For these reasons, namenode will not become a single point of failure.

What is name node high availability?

A. The problem with the federation is that if one name node goes down, you cannot access the portion of the data that the namenode is taking care of.
In HDFS high availability, you will maintain two namenodes: one of which is active and the other stand by each namenode and contain the file system tree, block mapping of the entire HDFS, and the edits are shared across both the name nodes. In case of failure, other name node will take the charge.
Architectural changes:
-The namenode must use high available shared storage to share the edit log.Edit logs are read by StandbyNameNode when it takes the responsibility of ActiveNameNode.
-Data nodes should send block report to both the namenodes
-Check pointing is done by standby namenode

How can you control block size and replication factor at file level?

If you want to upload a file into HDFS with some specific block size and with some specific replication factor, you can do that by providing the configuration and its value while writing the file into HDFS.

Changing block size
hadoop fs -Ddfs.block.size=104876 -put file.txt /user/cloudvikas
hadoop fs -Ddfs.blocksize=104876 -put file.txt /user/cloudvikas
Changing replication factor
hadoop fs -setrep -w2 /my/file
or
hadoop fs -Ddfs.replication=2 /my/cloudvikas

How to control the number of reducers in a map reduce program?

A. By default, for every 1 GB of input data, 1 reducer will be created. But you can also override this property by using the below property
job.setNumReduceTasks(int n)
The above property will set the number of reducers based on the integer number you provide to the function as parameter.

How does Hadoop know how many mappers has to be started?

A. Number of mappers equals the number of input splits
Number of input splits(for a single file) = Cell(Size of file)/(Size of input split)
For example, if you have 1GB of data and the input split size is 128MB then 1024/128 gives you 8 so 8 mappers will be started.
In default situations, input split size equals to the block size so number of input splits is equal to the number of blocks. So, you can say that number of mappers is equal to the number of blocks.

What is Secondary NameNode?

Secondary NameNode
• Secondary NameNode performs the periodic checkpoints.
• Secondary NameNode performs the following tasks periodically:
-> Downloads the current name node image and edits log files,
and Joins them into new image.
-> Uploads the new image back to primary NameNode.
• Secondary NameNode is not exactly a hot backup of the actual NameNode because DataNodes can not connect to the Secondary NameNode in the case of NameNode failure.
• It is just used for recovery of NameNode in the case of NameNode failure.

What is DataNode?

• Datanode is the Slave node that stores the blocks of data on its local file system.
• Each DataNode sends Heartbeat and Blockreport to NameNode periodically(probably 6s).
• Receipt of a Heartbeat means that the DataNode is functioning properly.
• A Blockreport contains a list of all the blocks available on a DataNode.

How do you write file on the cluster?

‭User configures file replication factor (default 3) and block size (default 64MB). User requests the Hadoop client to write file on Hadoop cluster . Then Hadoop client splits the file into blocks. After that Hadoop client contacts the NameNode. NameNode returns the available Datanodes to Hadoop client,Hadoop client sends the first block to the first DataNode. After receiving the block, first DataNode sends the same block to the next DataNode and so on and forms the replication pipeline. DataNodes send acknowledgments to the NameNode after receiving the block successfully(based on Heartbeat periodically time).

Hadoop client repeats the same process for all the other blocks related to the file which user is writing on the cluster.When all the blocks are written on the cluster, the NameNode stores the meta-data information.

What are the concepts used in the Hadoop Framework?

Answer: The Hadoop Framework functions on two core concepts:

HDFS : Abbreviation for Hadoop Distributed File System, it is a Javabased

file system for scalable and reliable storage of large datasets. HDFS itself works on the MasterSlave Architecture and stores all its data in the form of blocks.

MapReduce : This is the programming model and the associated implementation for processing and generating large data sets. The Hadoop jobs are basically divided into two different tasks job. The map job breaks down the data set into keyvalue pairs or tuples. And then the reduce job takes the output of the map job and combines the data tuples into a smaller set of tuples.

How many Input Formats are there in Hadoop? Explain.

Answer: There are following three input formats in Hadoop –

  1. Text Input Format: The text input is the default input format in Hadoop.
  2. Sequence File Input Format: This input format is used to read files in sequence.
  3. Key Value Input Format: This input format is used for plain text files.

Why do the nodes are removed and added frequently in a Hadoop cluster?

Answer: The following features of Hadoop framework makes a Hadoop administrator to add (commission) and remove (decommission) Data Nodes in a Hadoop clusters – The Hadoop framework utilizes commodity hardware, and it is one of the important features of Hadoop framework. It results in a frequent DataNode crash in a Hadoop cluster.The ease of scale is yet another important feature of the Hadoop framework that is performed according to the rapid growth of data volume.

What is fault tolerance strategy?

♦When NameNode fails then Secondary NameNode comes into picture. The NameNode then has to he restored with the help of the merged copy of the NameNode image.
♦ DataNode sends a heartbeat message to the NameNode every 3 seconds to inform the NameNode that it is alive. If the NameNode doesn’t receive a heartbeat message from the DataNode in 10 mins, it considers the DataNode to be dead. It then accesses the replica of the block In some other DataNode.

What is Replication strategy in Hadoop?

The default replication factor is 3

The cluster is split in terms of racks where each rack contains DataNodes.

•1st replica is placed in the same node where client is running. If it is not free then the 1st replica is placed in any node in the same rack.
•2nd replica Is placed on different rack from the 1st rack.
•3rd replica is placed in the same rack as 2nd, but on a different node.

What do you understand by “Rack Awareness”?

Answer: In Hadoop, Rack Awareness is defined as the algorithm through which NameNode determines how the blocks and their replicas are stored in the Hadoop cluster. This is done via rack definitions that minimize the traffic between DataNodes within the same rack. Let’s take an example – we know that the default value of replication factor is 3. According to the “Replica Placement Policy” two copies of replicas for every block of data will be stored in a single rack whereas the third copy is stored in the different rack.

What are Shell command lines in HDFS?

1)Shell Command Line

1.hadoop fs{args}

2.hadoop dfs{args}

3.hdfs dfs {args}

How will you check whether hadoop deamons started or not?

Check whether hadoop deamons started or not.
[cloudvikas@localhost~]$ jps
It shows the following output:
3849 Jps

If Hadoop demons are not started then what will you do?

If hadoop deamons not started then start as follows:,
[cloudvikas@locaIhost ~]$ start-dfs.sh
[cloudvikas@locaIhost ~]$ start-yarn.sh

How will you check whether hadoop deamons started or not?

Check whether hadoop deamons started or not.
[cloudvikas(S)localhost ~]$ jps
It shows the following output:
4679 NodeManager
3969 NameNode
4393 ResourceManager
4088 DataNode 4800 Jps
4243 SecotifforvNameNode

How will you stop the Hadoop deamons?

Stop the Hadoop deamons as follows:
[cloudvikas@localhost ~]$ stop-dfs.sh
[cloudvikas@localhost ~]$ stop-yarn.sh

How will you check whether hadoop deamons stopped or not?

Check whether hadoop deamons stopped or not,
[cloudvikas@Iocalhost-]$ jps
It shows the following output:
3849 Jps

What do you know about the Speculative Execution?

Answer: In Hadoop, Speculative Execution is a process that takes place during the slower execution of a task at a node. In this process, the master node starts executing another instance of that same task on the other node. And the task which is finished first is accepted and the execution of other is
stopped by killing that.

Which Hadoop command lists the directory?

hdfs dfs -Is Shows the directory listing

How will you create directory in HDFS?

hdfs dfs-mkdir Creates the directory in HDFS

How will you Copy the file from local file system to HDFS?
hdfs dfs -put

How will you display the content of file in HDFS?
hdfs dfs -cat

How will you copy the file from HDFS to local file system?
hdfs dfs -get

How will you Removes the files?
hdfs dfs-rm

How will you removes the directory?
hdfs dfs -rmdir

How will you Removes the directory and its contents?
hdfs dfs -rm -r

What do you know about active and passive NameNodes?

Answer: In highavailability Hadoop architecture, two NameNodes are present.
Active NameNode – The NameNode that runs in Hadoop cluster, is the Active NameNode.
Passive NameNode – The standby NameNode that stores the same data as that of the Active NameNode is the Passive NameNode.
On the failure of active NameNode, the passive NameNode replaces it and takes the charge. In this way, there is always a running NameNode in the cluster and thus it never fails.

1.Show the directory listing of HDFS Root directory
$ hdfs dfs -ls /
2.Show the directory listing of /user directory
$ hdfs dfs -ls /user
3.Show the directory listing of user/india directory
$ hdfs dfs -ls /user/india
4) Show the directory listing of user’s home directory in HDFS
$ hdfs dfs -ls
5) Show the directory listing of Labs directory in HDFS
$ hdfs dfs-ls Labs
$ hdfs dfs -Is /user/india/Labs

  1. Create lc1 & lc2 directory in user’s home directory in HDFS
    $ hdfs dfs -mkdir lc1
    $ hdfs dfs -mkdir /user/india/lc2
  2. Copy hello.txt from local disk to lc1 directory
    $ hdfs dfs -put MyData/hello.txt lc1
  3. Copy hai.txt from local disk to lc1 directory in HDFS
    $ hdfs dfs -put MyData/hai.txt lc1
  4. Copy hellohai.txt from local disk to lc1 directory
    $ hdfs dfs -put MyData/hellohai.txt lc1
  1. Copy all text file from local disk to 1c2 directory
    $ hdfs dfs -put MyData/*,txt lc2
  2. Display the content of hcllo.txt file
    $ hdfs dfs -cat lcl/heilo.txt
  3. Display the content of hai,txt file
    $ hdfs dfs -cat lcl/hai.txt
  4. Copy the students.txt file from HDFS to local disk
    $ hdfs dfs -get lcl/students.txt students.txt
  5. Copy the students.txt file from HDFS to local disk as mystudents.txt $hdfs dfs -get lcl/students.txt MyData/mystudents.txt
  6. Delete all the file from lc1 directory
    $ hdfs dfs -rm lc1/*
  7. Delete lcl directory
    $ hdfs dfs -rmdir lc1
  8. Delete user2 directory and its contents
    $ hdfs dfs -rm -r user2

How will you print the Hadoop version?

hadoop version

How will you list the Contents of the root directory in HDFS?

hadoop fs -ls /

How will you Report the amount of space used and available on currently mounted filesystem?

hadoop fs -df hdfs:/

How will you count the number of directories,files and bytes under the paths that match the specified file pattern?

hadoop fs -count hdfs:/

How will you check utility?

hadoop fsck – /

How do you run a cluster balancing utility?

hadoop balancer

How do you create a new directory named “hadoop” below the /user/training directory in HDFS?

hadoop fs -mkdir /user/training/hadoop

How will you add a sample text file from the local directory named “data” to the new directory you created in HDFS?

hadoop fs -put data/sample.txt /user/training/hadoop

How will you list the contents of this new directory in HDFS?

hadoop fs -ls /user/training/hadoop

How will you add the entire local directory called “retail” to the /user/training directory in HDFS?

hadoop fs -put data/retail /user/training/hadoop

Since /user/training is your home directory in HDFS, any command that does not have an absolute path is interpreted as relative to that directory.

The next command will therefore list your home directory, and should show the items you’ve just added there.

hadoop fs -ls

See how much space this directory occupies in HDFS.

hadoop fs -du -s -h hadoop/retail

Delete a file ‘customers’ from the “retail” directory.

hadoop fs -rm hadoop/retail/customers

Ensure this file is no longer in HDFS.

hadoop fs -ls hadoop/retail/customers

Delete all files from the “retail” directory using a wildcard.

hadoop fs -rm hadoop/retail/*

To empty the trash

hadoop fs –expunge

Finally, remove the entire retail directory and all of its contents in HDFS.

hadoop fs -rm -r hadoop/retail

Add the purchases.txt file from the local directory named “/home/training/” to the hadoop directory you created in HDFS

hadoop fs -copyFromLocal /home/training/purchases.txt hadoop/

To view the contents of your text file purchases.txt which is present in your hadoop directory.

hadoop fs -cat hadoop/purchases.txt

Move a directory from one location to other

hadoop fs -mv hadoop apache_hadoop

Add the purchases.txt file from “hadoop” directory which is present in HDFS directory to the directory “data” which is present in your local directory

hadoop fs -copyToLocal hadoop/purchases.txt /home/training/data

Default names of owner and group are training,training

Use ‘-chown’ to change owner name and group name simultaneously

hadoop fs -ls hadoop/purchases.txt

sudo -u hdfs hadoop fs -chown root:root hadoop/purchases.txt

Default replication factor to a file is 3.

Use ‘-setrep’ command to change replication factor of a file

hadoop fs -setrep -w 2 apache_hadoop/sample.txt

Copy a directory from one node in the cluster to another

Use ‘-distcp’ command to copy,

Use ‘-overwrite’ option to overwrite in an existing files

Use ‘-update’ command to synchronize both directories

hadoop fs -distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/hadoop

Command to make the name node leave safe mode

hadoop fs -expunge

sudo -u hdfs hdfs dfsadmin -safemode leave

List all the hadoop file system shell commands

hadoop fs

Last but not least, always ask for help!

hadoop fs -help

How is the DataNode failure handled by NameNode?
Answer: NameNode continuously receives a signal from all the DataNodes present in Hadoop cluster that specifies the proper function of the DataNode. The list of all the blocks present on a DataNode is stored in a block report. If a DataNode is failed in sending the signal to the NameNode,
it is marked dead after a specific time period. Then the NameNode replicates/copies the blocks of the dead node to another DataNode with the earlier created replicas.

Explain the NameNode recovery process.
Answer: The process of NameNode recovery helps to keep the Hadoop cluster running, and can be explained by the following steps –
Step 1: To start a new NameNode, utilize the file system metadata replica (FsImage).
Step 2: Configure the clients and DataNodes to acknowledge the new NameNode.
Step 3: Once the new Name completes the loading of last checkpoint FsImage and receives block
reports from the DataNodes, the new NameNode start serving the client.

Define “Checkpointing”. What is its benefit?

Answer: Checkpointing is a procedure to that compacts a FsImage and Edit log into a new FsImage. In this way, the NameNode handles the loading of the final inmemory state from the FsImage directly, instead of replaying an edit log. The secondary NameNode is responsible to perform the checkpointing process.

Name the modes in which Hadoop code can be run.

Answer: There are different modes to run Hadoop code –

1. Fullydistributed mode

2. Pseudodistributed mode

3. Standalone mode

Why is HDFS used for the applications with large data sets, and not for

the multiple small files?

Answer: HDFS is more efficient for a large number of data sets, maintained in a single file as compared to the small chunks of data stored in multiple files. As the NameNode performs storage of metadata for the file system in RAM, the amount of memory limits the number of files in HDFS file system. In simple words, more files will generate more metadata, that will, in turn, require more memory (RAM). It is recommended that metadata of a block, file, or directory should take 150 bytes.

Is HDFS faulttolerant?

If yes, how?

Answer: Yes, HDFS is highly faulttolerant. Whenever some data is stored on HDFS, the NameNode replicates (copies) that data to multiple DataNode. The value of default replication factor is 3 that can be changed as per your requirements. In case a DataNode goes down, the NameNode takes the data from replicas and copies it to another node, thus makes the data available automatically. In this way, HDFS has fault tolerance feature and known as fault tolerant.

Differentiate HDFS Block and Input Split.

Answer: The main difference between HDFS Block and the Input Split is that the HDFS Block is known to be the physical division of data whereas the Input Split is considered as the logical division of the data. For processing, HDFS first divides data into blocks and then stores all the blocks together, while the MapReduce first divides the data into input split and then assign this input split to the mapper function.