Sunday, November 28, 2010

Hadoop Distributed File Systems (HDFS) Interview Questions

Q1. What is HDFS  HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications

Q2. What does the statement "HDFS is block structured file system" means  It means that in HDFS individual files are broken into blocks of a fixed size. These blocks are stored across a cluster of one or more machines with data storage capacity

Q3. What does the term "Replication factor" mean  Replication factor is the number of times a file needs to be replicated in HDFS

Q4. What is the default replication factor in HDFS  3

Q5. What is the typical block size of an HDFS block  64Mb to 128Mb

Q6. What is the benefit of having such big block size (when compared to block size of linux file system like ext)  It allows HDFS to decrease the amount of metadata storage required per file (the list of blocks per file will be smaller as the size of individual blocks increases). Furthermore, it allows for fast streaming reads of data, by keeping large amounts of data sequentially laid out on the disk

Q7. Why is it recommended to have few very large files instead of a lot of small files in HDFS  This is because the Name node contains the meta data of each and every file in HDFS and more files means more metadata and since namenode loads all the metadata in memory for speed hence having a lot of files may make the metadata information big enough to exceed the size of the memory on the Name node

Q8. True/false question. What is the lowest granularity at which you can apply replication factor in HDSF
- You can choose replication factor per directory
- You can choose replication factor per file in a directory
- You can choose replication factor per block of a file

- True
- True
- False

Q9. What is a datanode in HDFS  ndividual machines in the HDFS cluster that hold blocks of data are called datanodes

Q10. What is a Namenode in HDSF  The Namenode stores all the metadata for the file system

Q11. What alternate way does HDFS provides to recover data in case a Namenode, without backup, fails and cannot be recovered  There is no way. If Namenode dies and there is no backup then there is no way to recover data

Q12. Describe how a HDFS client will read a file in HDFS, like will it talk to data node or namenode ... how will data flow etc  To open a file, a client contacts the Name Node and retrieves a list of locations for the blocks that comprise the file. These locations identify the Data Nodes which hold each block. Clients then read file data directly from the Data Node servers, possibly in parallel. The Name Node is not directly involved in this bulk data transfer, keeping its overhead to a minimum.

Q13. Using linux command line. how will you - List the the number of files in a HDFS directory
- Create a directory in HDFS
- Copy file from your local directory to HDSF
- hadoop fs -ls
- hadoop fs -mkdir
- hadoop fs -put localfile hdfsfile OR hadoop fs -copyFromLocal localfile hdfsfile

No comments:

Post a Comment