Friday, November 26, 2010

Hadoop Interview questions - Part 2

Q11. Give an example scenario where a cobiner can be used and where it cannot be used
There can be several examples following are the most common ones
- Scenario where you can use combiner
  Getting list of distinct words in a file

- Scenario where you cannot use a combiner
  Calculating mean of a list of numbers 

Q12. What is job tracker
Job Tracker is the service within Hadoop that runs Map Reduce jobs on the cluster

Q13. What are some typical functions of Job Tracker
The following are some typical tasks of Job Tracker
- Accepts jobs from clients
- It talks to the NameNode to determine the location of the data
- It locates TaskTracker nodes with available slots at or near the data
- It submits the work to the chosen Task Tracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker 

Q14. What is task tracker
Task Tracker is a node in the cluster that accepts tasks like Map, Reduce and Shuffle operations - from a JobTracker 

Q15. Whats the relationship between Jobs and Tasks in Hadoop
One job is broken down into one or many tasks in Hadoop. 

Q16. Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will hadoop do ?
It will restart the task again on some other task tracker and only if the task fails more than 4 (default setting and can be changed) times will it kill the job

Q17. Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this  
Speculative Execution 

Q18. How does speculative execution works in Hadoop  
Job tracker makes different task trackers process same input. When tasks complete, they announce this fact to the Job Tracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the Task Trackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first. 

Q19. Using command line in Linux, how will you 
- see all jobs running in the hadoop cluster
- kill a job
- hadoop job -list
- hadoop job -kill jobid 

Q20. What is Hadoop Streaming  
Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations 

Q21. What is the characteristic of streaming API that makes it flexible run map reduce jobs in languages like perl, ruby, awk etc.  
Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer phases of a Map Reduce job by having both Mappers and Reducers receive their input on stdin and emit output (key, value) pairs on stdout.


  1. I guess the answer to 11 can be generalized as in where the operation in the Mapper and Reducer is commutative.. that would be more precise.

  2. Yes, thats true. Thanks for the feedback Vaibhav.