Big Data and Hadoop Interview Questions
Q. What is a Big Data?
A. Big data is a term that describes the large volume of data set which is very difficult to capture, store, process, retrieve and analyze it with the help of database management tools or traditional data processing techniques.
Q. What are the characteristics of Big Data?
1.Volume- Organizations collect data from a variety of sources including social media, share market, airplane, e-commerce websites.
2.Variety- The type and nature of the data(Audio,image,video).
3.Velocity- The speed at which the data is generating is very large.
Q. How is analysis of Big Data useful for organizations?
A. The major goal of Big Data analysis is good decision making. Organizations will learn which areas to focus on and which areas are less important. It provides some early indicators that can prevent the company from a huge loss. Big data can be analyzed with the help of some software tools like Data Mining Tool, Text Analytics, Statical Method, Mainstream BI software and Visualization tools.
Q. What are the challenges in handling big data?
1. Difficulties - Capture,storage,search,sharing,analytics
2. Data Storage - Physical storage,acquisition,space and power cost
3. Data processing - Information and content management.
Q. What is the basic difference between traditional RDBMS and Hadoop?
A. RDBMS is a traditional row-column database used for transactional systems to report and archive the data whereas Hadoop is an approach to store huge amount of data in the distributed file system and process it. Hadoop can handle bigger data than relational DB. RDBMS works on structured data unlike hadoop works on Unstructured data.
Q. What is Hadoop?
A. When Big Data emerged as a problem, Apache Hadoop evolved as a solution to it. Apache Hadoop is a framework which provides us various services or tools to store and process Big Data. Hadoop is used by major players including Google,IBM,Yahoo.
Q. In what format does hadoop handle data?
A. Hadoop handles data in key/value format.
Q. Why files are stored in redundant manner in HDFS?
A. To ensure durability against failure.
Q. What is HDFS ?
A. HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment. It follows master and slave architecture.
Q. What is a Namenode?
A. Name node is the master of the system. It is a high-availability machine and single point of failure in HDFS. Namenode holds the meta data for the HDFS like Namespace information, block information etc.
Q.What happens when datanode fails?
A. When datanode fails:First namenode detect the failure of datanode.On the failed datanode all the tasks are re-scheduled.Then jobTracker will assign the task to another datanode.
Q.What is a Datanode?
A. Data node is the place where actual data is stored. The data sent by name node is stored into data node.
Q.What is the default block size in hdfs?
A. Default block size is 64 mb.
Q.What is a MapReduce?
A. MapReduce is the heart of Hadoop. It is programming paradigm that process large data sets across hundreds or thousands of server in hadoop cluster. It is a framework using which we can write applications to process huge amounts of data in parallel.
Q.How many daemon processes run on a hadoop cluster?
A. Hadoop comprised of five separate daemons. NameNode, Secondary Namenode and Jobtracker runs on master node while DataNode and TaskTracker runs on slave node.
Q. What is a job tracker?
A. Job tracker is a daemon that runs on a namenode. It assigns the tasks to the different task tracker. There will be only one job tracker for name node but many task trackers for data nodes. It is the single point of failure. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.
Q. What is a task tracker?
A. Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work amongst different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.
Q. What is a heartbeat in HDFS?
A. A heartbeat is some kind of signal indicate that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker and they are unable to perform the assigned task.
Q. What is a block in HDFS?
A block is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB. Files in HDFS are broken down into block-sized chunks, which are stored as independent units.
Q. Are task trackers and job tracker present in separate machines?
A. Yes, task tracker and job tracker are present in different machines. A job tracker present in name node while task tracker present in data node. A job tracker is a single point of failure for the Hadoop MapReduce service.
Q. What is a Secondary Namenode?
A. Name Node is a primary node in which all the metadata is stored into fsimage and editlog files periodically. But, when name node down secondary node will be online but this node only have the read access to the fsimage and editlog files and dont have the write access to them . All the secondary node operations will be stored to temp folder. when name node back to online this temp folder will be copied to name node and the namenode will update the fsimage and editlog files.
Q. Name some companies that use Hadoop?
A. Some companies that use Hadoop are Yahoo (One of the biggest user & more than 80% code contributor to Hadoop), Facebook, Cloudera, Amazon, eBay, Twitter, IBM etc.
Q. Differentiate between Structured, Unstructured and Semi-structured data?
A. Data which can be stored in traditional database systems in the form of rows and columns can be referred to as Structured Data. Data which can be stored only partially in traditional database systems, for example, data in XML records can be referred to as semi structured data. Unorganized and raw data that cannot be categorized as semi structured or structured data is referred to as unstructured data. Text files,images,videos,email,customer service interactions,webpages,PDF files etc. are all examples of unstructured data.
Q. What are the main components of a Hadoop Application?
A. Core components of a Hadoop application are-
1. Hadoop Common
2. HDFS
3. Hadoop MapReduce
4. YAR
Q. What is the port number for NameNode, Task Tracker and Job Tracker?
1. NameNode-50070
2. Job Tracker-50030
3. Task Tracker-50060
Q. What happens when a user submits a Hadoop job when the NameNode is down. Does the job get in to hold or does it fail?
A. The Hadoop job fails when the NameNode is down.
Q. Can Hadoop handle streaming data?
A. Yes, through Technologies like Apache Kafka, Apache Flume, and Apache Spark it is possible to do large-scale streaming.
Q. What platform and Java version is required to run Hadoop?
A. Java 1.6.x or higher version are good for Hadoop. Linux and Windows are the supported operating system for Hadoop.
Q. What kind of Hardware is best for Hadoop?
A. The best configuration for executing Hadoop jobs is dual core machines or dual processors with 4GB or 8GB RAM that use ECC memory.
Q. What are the most common input formats defined in Hadoop?
1. Key Value Input Format
2. Text Input Format
3. Sequence File Input Format
Q. What happens when a data node fails?
A. If a data node fails the job tracker and name node will detect the failure. After that all tasks are re-scheduled on the failed node and then name node will replicate the user data to another node.
Q. Is it necessary to know java to learn Hadoop?
A. If you have a background in any programming language like Java, C, C++, PHP etc. It may be really helpful, but if you are nil in java, it is necessary to learn Java and also get the basic knowledge of SQL.
Q. Which data storage components are used by Hadoop?
A. HBase data storage component is used by Hadoop.