Thursday 29 June 2017

What is Avro Seriealization

*Avro-Serialization is the process of translating data structures or objects state into binary or textual form to transport the data over network or to store on some persisten storage.

*Once the data is transported over network or retrieved from the persistent storage, it needs to be deserialized again. Serialization is termed as marshalling and deserialization is termed as unmarshalling.

Features of Avro

Features of Avro

* Avro is a language-neutral data serialization system.

* It can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby).

* Avro creates binary structured format that is both compressible and splittable. Hence it can be efficiently used as the input to Hadoop MapReduce jobs.

* Avro provides rich data structures. For example, you can create a record that contains an array, an enumerated type, and a sub record. These datatypes can be created in any language, can be processed in Hadoop, and the results can be fed to a third language.

* Avro schemas defined in JSON, facilitate implementation in the languages that already have JSON libraries.

* Avro creates a self-describing file named Avro Data File, in which it stores data along with its schema in the metadata section.

* Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server exchange schemas in the connection handshake.

What is avro

* Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting, the father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes quite helpful, as it deals with data formats that can be processed by multiple languages. Avro is a preferred tool to serialize data in Hadoop.

* Avro has a schema-based system. A language-independent schema is associated with its read and write operations. Avro serializes the data which has a built-in schema. Avro serializes the data into a compact binary format, which can be deserialized by any application.

* Avro uses JSON format to declare the data structures. Presently, it supports languages such as Java, C, C++, C#, Python, and Ruby.

Thursday 22 June 2017

How to stop SCCM from installing Files on System or C drive

* SCCM by default install site server role component on first available NTFS drive.

* In order to avoid the files being installed, we need to create a magic file on the drive so that SCCM understand and ignore me.

*The Magical file name is "no_sms_on_drive.sms"

* You need to create this empty file and place on the drive which you want SCCM to ignore.

Monday 19 June 2017

Types of Big Data

Ther are 3 types of Big Data

1.Structured Data:
The data which can be stored and processed in table(rows and column) format is called as a structured data. Structured data is relatively simple to enter,store and analyze.
Exm- Relational database management system. 

2.Unstructured Data:
The data with unknown form or structure is called as unstructured data. The fundamental challenge of unstructured data sources is that they are difficult for nontechnical users and data analysts to understand and process.
Exm- Text files,images,videos,email,customer service interactions,webpages,PDF files,PPT,social media data etc. 

3.Semi-structured Data
Semi-structured data is data that is neither raw data nor organized in a rational model like a table. It may organized in tree pattern which is easier to analyze in some case. XML and JSON documents are semi structured documents.

What is Sqoop and why it is used ?

Sqoop:

Sqoop is an open source framework provided by Apache. It is a command-line interface application for efficiently transferring bulk data between Apache Hadoop and external datastores such as relational databases, enterprise data warehouses. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases. 

Sqoop Import:
The import tool is used to imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record in HDFS. 

Sqoop Export:
The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop contain records, which are called as rows in table.

Why Sqoop is used?
For Hadoop developers, the work starts after data is loaded into HDFS. For this, the data residing in the relational database management systems need to be transferred to HDFS and might need to transfer back to relational database management systems. So developers can always write custom scripts to transfer data in and out of Hadoop, but Apache Sqoop provides an alternative.
Sqoop uses MapReduce framework to import and export the data, which provides parallel mechanism as well as fault tolerance. Sqoop makes developers life easy by providing command line interface. Developers just need to provide basic information like source, destination and database authentication details in the sqoop command. 

Sqoop Connectors:
All the existing Database Management Systems are designed with SQL standard in mind. However, each DBMS differs with respect to dialect to some extent. So, this difference poses challenges when it comes to data transfers across the systems. Sqoop Connectors are components which help overcome these challenges. Data transfer between Sqoop and external storage system is made possible with the help of Sqoop's connectors. Sqoop has connectors for working with a range of popular relational databases, including MySQL, PostgreSQL, Oracle, SQL Server, and DB2. Each of these connectors knows how to interact with its associated DBMS. There is also a generic JDBC connector for connecting to any database that supports Java's JDBC protocol. In addition, Sqoop provides optimized MySQL and PostgreSQL connectors that use database-specific APIs to perform bulk transfers efficiently.

Architecture of Hadoop

Architecture of Hadoop:

Hadoop follows master-slave architecture means there is one master machine and multiple slave machines. The data that you give to hadoop is stored across these machines in the cluster. 

Two important components of hadoop are:

1.HDFS (Data storage)
2.Map-Reduce (Analyzing and Processing)

1. HDFS:
Hadoop Distributed File System is distributed file system used to store very huge amount of data. HDFS follows master-slave architecture means there is one master machine(Name Node) and multiple slave machines(Data Node).The data that you give to hadoop is stored across these machines in the cluster. 

Various components of HDFS are:

a) Blocks:
HDFS is block structured file system in which individual file is split into several blocks of equal size and stored across one or more machines in a cluster. HDFS blocks are 64 MB by default in Apache Hadoop and 128 MB by default in Cloudera Hadoop but it can be increased as per the need. If the file in HDFS is smaller than block size, then it does not occupy full block. Exm- If file size is 10 MB and HDFS block size is 128 MB then it takes only 10 MB of space.

b) Name Node:
Name node is the controller/master of the system. Name node spreads data to data node. It stores the metadata of all the files in HDFS. This metadata includes name, location of each block, block size and file permission.

c) Data Node-
Data node is the place where actual data is stored. The data sent by name node is stored into data node. They store and retrieve blocks when they are requested by client or name node.They perform operations such as block creation,deletion and replication as stated by the name node.

d) Secondary Name Node-
Many people think that Secondary Namenode is just a backup of primary Namenode in Hadoop but it is not a back up node. Name Node is a primary node in which all the metadata is stored into fsimage and editlog files periodically. But, when name node down, secondary node will be online but this node only have the read access to the fsimage and editlog files and don't have the write access to them. All the secondary node operations will be stored to temp folder. When name node back to online this temp folder will be copied to name node and the namenode will update the fsimage and editlog files. 

2. Map reduce:
MapReduce is a framework and processing technique using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. Map reduce program is written by default in java but we can use other language also like pig,apache pig.
The MapReduce algorithm contains two important tasks.

a) Map-
In map stage,mapper takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).

b) Reduce-
In reduce phase, reducer takes the output from a map as input and combines those data tuples into a smaller set of tuples. The reduce job is always performed after the map job.

Components of MapReduce: 

a) JobTracker:
Job tracker is a daemon that runs on a namenode. There will be only one job tracker for name node but many task trackers for data nodes. It assigns the tasks to the different task tracker. It is the single point of failure. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.
JobTracker process: 
1. JobTracker receives the requests from the client.
2. JobTracker talks to the NameNode to determine the location of the data.
3. JobTracker finds the best TaskTracker nodes to execute tasks.
4. The JobTracker submits the work to the chosen TaskTracker nodes.
5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals then work is scheduled on a different TaskTracker.
6. When the work is completed, the JobTracker updates its status and submit back the overall status of job back to the client.
7. The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted. 

b) TaskTracker:
Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work amongst different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster. 

TaskTracker process: 
1. The JobTracker submits the work to the TaskTracker nodes.
2. TaskTracker run the tasks and report the status of task to JobTracker. 
3. It has function of following the orders of the job tracker and updating the job tracker with its progress status periodically.
4. TaskTracker will be in constant communication with the JobTracker.
5. TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive, JobTracker will assign the task to another node.

Characteristics​ of Hadoop

Characteristics of Hadoop:

1. Robust: 
Handles hardware failure as data is stored in multiples node. 

2. Scalable:
Cluster size can be increased by adding more node.

3. Simple:
We can write parallel code. Focuses on moving code rather than data.

4. Portable:
Analyze structured,semi-structured and unstructured data. (Structured- In table format, Semi-structure- Not in table format but in well organised format(XML,JSON), Unstructured- Has no format(Text, Image,Videos).

5. Cost Effective:
Hadoop is open source and uses commodity hardware to store data so it is really cost effective as compared to traditional RDBMS.

6. Fault Tolerance:
If any node fails, the task are automatically redirected to other node. Multiple copies of all data are automatically stored. Due to this even if one node fails, same data is available on some other node also.

What are Daemons in Hadoop and types of dameons

Daemons:
Daemons are resident programs which constitute to hadoop running. In computing terms it is a process that runs in the background.
Hadoop has five such daemons. They are 1.NameNode,
2.Secondary NameNode
3.DataNode
4.JobTracker
5.TaskTracker.

Each daemons runs separately in its own JVM. We discuss about these daemons in this post as they are associated with HDFS and MapReduce.

Is Hadoop a Database ?

Hadoop is not a database: Hadoop is an distributed file system and not a database. It simply uses the filesystem provided by Linux to store data

What is Hadoop

What is Hadoop?

Apache Hadoop is an open-source java framework that is used to store,analyze and process Big data in a distributed environment across clusters of computers. It is used by Google, Facebook, Yahoo, Youtube, Twitter, LinkedIn and many more. Hadoop is top level apache project developed by Doug Cutting. It works on distributed system. It is inspired by google mapreduce algorithm for running distributed application. Hadoop could be good if you have a lot of data you do not know what to do and you do not want to loose it. 
Hadoop is initiated and led by YAHOO. hadoop is written in java programming language with some code in C and some commands in shell-scripts. Companies that offer hadoop services are - IBM, Amazon Web Services, Cloudera, Microsoft Azure, Dell. 

3V's of Big Data

3V's of Big Data:

1.Volume:
The amount of data which we deal with is of very large size of Peta bytes. 

2.Variety:
Data Comes in all type of format.(Text,audio,image,video).

3.Velocity:
The data is generating at a very fast rate. Velocity is the measure of how fast the data is coming in. For time critical applications faster processing is very important. Exm- Share marketing, Video streaming

Applications of Big Data

Application of Big data:

Some applications of big data are as follows:

1. Healthcare Providers: 
The big data is use in the field of medicine and healthcare. It is a great help for even physicians to keep track of all the patients history. 

2. Google Search: 
When we search anything google makes use of data science algorithms to deliver the best result for our searched query in fraction of seconds. Next time, when we search anything based on our previous search google gives us some recommendations. 

3. Education: 
Big data has great influence in the education world too. Today almost every course of learning is present online. Along with the online learning, there are many examples of the use of big data in the education industry. 

4. Recommender Systems:
A lot of companies used recommender system to promote their products/suggestions in accordance with user’s interest and relevance of information. Internet giants like Amazon, Google, Flipkart and many more uses this system to improve user experience. The recommendations are made based on previous search results for a user. 
Example- When we search any product on amazon, we always get recommendations about similar product. They not only help you to find relevant products from billions of products available with them, but also adds a lot to the user experience. 

5. Banking Zones and Fraud Detection:
Big data is hugely used in the fraud detection in the banking sectors. In banking sector, it finds out all the mischief tasks done. It detects the misuse of credit and debit cards, business clarity, public analytics for business and IT strategy fulfillment analytics. 

6. Super Market:
Big data analysis is also used in super market for market basket analysis. Market Basket Analysis is one of the most common and useful types of data analysis for marketing and retailing. The purpose of market basket analysis is to determine what products customers purchase together. A store could use this information to place products frequently sold together into the same area. 
Example- People who buy bread also buy butter or people who buy shampoo might also buy conditioner. 

7. Security Enforcement:
Big data is applied for improving national security enforcement. These techniques are used to detect and prevent cyber attack. Police force use big data tools to catch criminals and even predict criminal activity. 

Advantages of Big Dat

Advantages of Big Data:

1.Access to large volume of data.
2.Allows businesses to develop more effective strategies towards competitors in less time.
3.Improve Decision making capabilities.
4.Can analyse data easily.
5.Allows businesses to detect errors and fraud quickly. 
6.Offers businesses a chance to improve profits and customer service.
7.Integration of both structured and unstructured data.
8.Implementing new strategies and improve service dramatically.

Types of Big Data

Ther are 3 types of Big Data

1.Structured Data:
The data which can be stored and processed in table(rows and column) format is called as a structured data. Structured data is relatively simple to enter,store and analyze.
Exm- Relational database management system. 

2.Unstructured Data:
The data with unknown form or structure is called as unstructured data. The fundamental challenge of unstructured data sources is that they are difficult for nontechnical users and data analysts to understand and process.
Exm- Text files,images,videos,email,customer service interactions,webpages,PDF files,PPT,social media data etc. 

3.Semi-structured Data
Semi-structured data is data that is neither raw data nor organized in a rational model like a table. It may organized in tree pattern which is easier to analyze in some case. XML and JSON documents are semi structured documents.

Big Data and Hadoop - Introduction

What is Big Data?

The process of storing and analysing data to make some sense for the organization is called Big data. In simple terms, data which is very large in size and yet growing exponentially with time is called as Big data. 

Why we need Big Data?

For any application that contains limited amount of data we normally use Sql/Postgresql/Oracle/MySQL, but what in case of large applications like Facebook,Google,Youtube? This data is so large and complex that none of the traditional data management system is able to store and process it. 

Facebook generates 500+ TB data per day as people upload various images, videos, posts etc. Similarly sending text/multimedia messages, updating Facebook/WhatsApp status, comments etc. generates huge data. If we use traditional data processing applications (SQL/Oracle/MySQL) to handle it, it will lead to loss of efficiency. So in order to handle exponential growth of data, data analysis becomes a required task. To overcome this problem, we use Big data. Big data includes both structured and unstructured data. 
Traditional data management systems and existing tools are facing difficulties to process such a big data. R is one of the main computing tool used in statistical education and research. It is also widely used for data analysis and numerical computing in scientific research.

Where does Big Data come from?

1.Social data : This could be data coming from social media services such as Facebook Likes, photos and videos uploads, putting comments, Tweets and YouTube views.
2.Share Market: Stock exchange generates huge amount of data through its daily transaction.
3.E-commerce site: E-commerce Sites like Flipkart,Amazon,Snapdeal generates huge amount of data.
4.Airplane: Single airplane can generate 10+ TB of data in 30 minutes of a flight time. 

What is the need for storing such huge amount of data?

The main reason behind storing data is analysis. Data analysis is a process used to clean, transform and remodel data with a view to reach to a certain conclusion for a given situation. More accurate analyses leads to better decision making and better decision making leads to increase in efficiency and risk reduction.

Example-
1. When we search anything on e-commerce websites(Flipkart,Amazon), we get some recommendations of product that we search. The analysis of data that we entered is done by these websites, then accordingly the related products are displayed.
Example - When we search any smart phone, we get recommendations to buy back covers,screen guard etc.

2. Similarly, why facebook stores our images,videos? The reason is advertisement.
There are two types of marketing-
a) Global marketing - Show advertisement to all users.
b) Target marketing - Show advertisement to particular groups/people. So in target marketing, facebook analyses it's data and it shows advertisements to selected people. 
Exm - If advertiser wants to advertise for cricket kit and he/she wants to show that advertisement to only interested set of people so facebook tracks a record of all those people who are member of cricket groups or post anything related to cricket and displays it to them.

Big Data and Hadoop Interview Questions

Big Data and Hadoop Interview Questions

Q. What is a Big Data? 
A. Big data is a term that describes the large volume of data set which is very difficult to capture, store, process, retrieve and analyze it with the help of database management tools or traditional data processing techniques.

Q. What are the characteristics of Big Data? 
1.Volume- Organizations collect data from a variety of sources including social media, share market, airplane, e-commerce websites.
2.Variety- The type and nature of the data(Audio,image,video).
3.Velocity- The speed at which the data is generating is very large.

Q. How is analysis of Big Data useful for organizations? 
A. The major goal of Big Data analysis is good decision making. Organizations will learn which areas to focus on and which areas are less important. It provides some early indicators that can prevent the company from a huge loss. Big data can be analyzed with the help of some software tools like Data Mining Tool, Text Analytics, Statical Method, Mainstream BI software and Visualization tools.

Q. What are the challenges in handling big data? 
1. Difficulties - Capture,storage,search,sharing,analytics
2. Data Storage - Physical storage,acquisition,space and power cost
3. Data processing - Information and content management.

Q. What is the basic difference between traditional RDBMS and Hadoop? 
A. RDBMS is a traditional row-column database used for transactional systems to report and archive the data whereas Hadoop is an approach to store huge amount of data in the distributed file system and process it. Hadoop can handle bigger data than relational DB. RDBMS works on structured data unlike hadoop works on Unstructured data.

Q. What is Hadoop? 
A. When Big Data emerged as a problem, Apache Hadoop evolved as a solution to it. Apache Hadoop is a framework which provides us various services or tools to store and process Big Data. Hadoop is used by major players including Google,IBM,Yahoo.

Q. In what format does hadoop handle data? 
A. Hadoop handles data in key/value format.

Q. Why files are stored in redundant manner in HDFS? 
A. To ensure durability against failure.

Q. What is HDFS ? 
A. HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment. It follows master and slave architecture.

Q. What is a Namenode? 
A. Name node is the master of the system. It is a high-availability machine and single point of failure in HDFS. Namenode holds the meta data for the HDFS like Namespace information, block information etc.

Q.What happens when datanode fails? 
A. When datanode fails:First namenode detect the failure of datanode.On the failed datanode all the tasks are re-scheduled.Then jobTracker will assign the task to another datanode.

Q.What is a Datanode? 
A. Data node is the place where actual data is stored. The data sent by name node is stored into data node.

Q.What is the default block size in hdfs? 
A. Default block size is 64 mb.

Q.What is a MapReduce? 
A. MapReduce is the heart of Hadoop. It is programming paradigm that process large data sets across hundreds or thousands of server in hadoop cluster. It is a framework using which we can write applications to process huge amounts of data in parallel.

Q.How many daemon processes run on a hadoop cluster? 
A. Hadoop comprised of five separate daemons. NameNode, Secondary Namenode and Jobtracker runs on master node while DataNode and TaskTracker runs on slave node.

Q. What is a job tracker?
A. Job tracker is a daemon that runs on a namenode. It assigns the tasks to the different task tracker. There will be only one job tracker for name node but many task trackers for data nodes. It is the single point of failure. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.

Q. What is a task tracker? 
A. Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work amongst different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

Q. What is a heartbeat in HDFS? 
A. A heartbeat is some kind of signal indicate that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker and they are unable to perform the assigned task.

Q. What is a block in HDFS? 
A block is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB. Files in HDFS are broken down into block-sized chunks, which are stored as independent units.

Q. Are task trackers and job tracker present in separate machines? 
A. Yes, task tracker and job tracker are present in different machines. A job tracker present in name node while task tracker present in data node. A job tracker is a single point of failure for the Hadoop MapReduce service.

Q. What is a Secondary Namenode? 
A. Name Node is a primary node in which all the metadata is stored into fsimage and editlog files periodically. But, when name node down secondary node will be online but this node only have the read access to the fsimage and editlog files and dont have the write access to them . All the secondary node operations will be stored to temp folder. when name node back to online this temp folder will be copied to name node and the namenode will update the fsimage and editlog files.

Q. Name some companies that use Hadoop? 
A. Some companies that use Hadoop are Yahoo (One of the biggest user & more than 80% code contributor to Hadoop), Facebook, Cloudera, Amazon, eBay, Twitter, IBM etc.

Q. Differentiate between Structured, Unstructured and Semi-structured data? 
A. Data which can be stored in traditional database systems in the form of rows and columns can be referred to as Structured Data. Data which can be stored only partially in traditional database systems, for example, data in XML records can be referred to as semi structured data. Unorganized and raw data that cannot be categorized as semi structured or structured data is referred to as unstructured data. Text files,images,videos,email,customer service interactions,webpages,PDF files etc. are all examples of unstructured data.

Q. What are the main components of a Hadoop Application?
A. Core components of a Hadoop application are-
1. Hadoop Common
2. HDFS
3. Hadoop MapReduce
4. YAR

Q. What is the port number for NameNode, Task Tracker and Job Tracker? 
1. NameNode-50070
2. Job Tracker-50030
3. Task Tracker-50060

Q. What happens when a user submits a Hadoop job when the NameNode is down. Does the job get in to hold or does it fail? 
A. The Hadoop job fails when the NameNode is down.

Q. Can Hadoop handle streaming data?
A. Yes, through Technologies like Apache Kafka, Apache Flume, and Apache Spark it is possible to do large-scale streaming.

Q. What platform and Java version is required to run Hadoop?
A. Java 1.6.x or higher version are good for Hadoop. Linux and Windows are the supported operating system for Hadoop.

Q. What kind of Hardware is best for Hadoop?
A. The best configuration for executing Hadoop jobs is dual core machines or dual processors with 4GB or 8GB RAM that use ECC memory.

Q. What are the most common input formats defined in Hadoop?
1. Key Value Input Format
2. Text Input Format
3. Sequence File Input Format

Q. What happens when a data node fails?
A. If a data node fails the job tracker and name node will detect the failure. After that all tasks are re-scheduled on the failed node and then name node will replicate the user data to another node.

Q. Is it necessary to know java to learn Hadoop?
A. If you have a background in any programming language like Java, C, C++, PHP etc. It may be really helpful, but if you are nil in java, it is necessary to learn Java and also get the basic knowledge of SQL.

Q. Which data storage components are used by Hadoop?
A. HBase data storage component is used by Hadoop.