Friday 7 July 2017

Features of Apache Cassandra

* Elastic scalability - Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more data as per requirement.

* Always on architecture - Cassandra has no single point of failure and it is continuously available for business-critical applications that cannot afford a failure.

* Fast linear-scale performance - Cassandra is linearly scalable, i.e., it increases your throughput as you increase the number of nodes in the cluster. Therefore it maintains a quick response time.

* Flexible data storage - Cassandra accommodates all possible data formats including: structured, semi-structured, and unstructured. It can dynamically accommodate changes to your data structures according to your need.

* Easy data distribution - Cassandra provides the flexibility to distribute data where you need by replicating data across multiple data centers.

* Transaction support - Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID).

* Fast writes - Cassandra was designed to run on cheap commodity hardware. It performs blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the read efficiency.

Advantages of Apache Cassandra

* It is scalable, fault-tolerant, and consistent.

* It is a column-oriented database.

* Its distribution design is based on Amazon’s Dynamo and its data model on Google’s Bigtable.

* Created at Facebook, it differs sharply from relational database management systems.

* Cassandra implements a Dynamo-style replication model with no single point of failure, but adds a more powerful “column family” data model.

* Cassandra is being used by some of the biggest companies such as Facebook, Twitter, Cisco, Rackspace, ebay, Twitter, Netflix, and more.

What is Apache Cassandra

What is Apache Cassandra?

* Apache Cassandra is an open source, distributed and decentralized/distributed storage system (database), for managing very large amounts of structured data spread out across the world.

* It provides highly available service with no single point of failure.

What is mongo DB

MongoDB:

* MongoDB is a cross-platform document-oriented database system that avoids using the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas making the integration of data in certain types of applications easier and faster.

What is Apache HBase

Apache HBase :

* HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and is written in Java.

* It is developed as a part of Apache Hadoop project and runs on top of HDFS, providing BigTable-like capabilities for Hadoop.

Differences between NoSql and relationship database

NoSQL vs. Relational Database

The following table lists the points that differentiate a relational database from a NoSQL database.

Relational Database:

1. Relational Database DatabaseSupports powerful query language.

2. It has a fixed schema.

3. Follows ACID (Atomicity, Consistency, Isolation, and Durability).

4. Supports transactions.

NoSql:

1. Supports very simple query language.

2. No fixed schema.

3. It is only “eventually consistent”.

4. Does not support transactions.

Tuesday 4 July 2017

What is HAL or hardware abstraction layer l

* HAL hardware abstraction layer is used by the operating system to interact with the hardware.

* It is simply a dll call HAL.dll.

* What does HAL.dll does is that it provides the necessary drivers for your operating system to interact with hardware.

* The problem with a HAl.dll is that when you restore your operating system you need to use identical hardware for successful restore this was huge drawback with HAL.dll

* In new your operating system such as Windows 10 this dependency on HAL.dll is removed.

* Now when you restore your operating system the BIOS will detect the existing hardware or the new hardware on which the operating system is being restored.

* To configure this dynamic HAL you need to use a tool call BCDedit.

Bcdedit /set {current} detecthal yes

Sunday 2 July 2017

What is Data Model in Couch DB

Data Model

* Database is the outermost data structure/container in CouchDB.

* Each database is a collection of independent documents.

* Each document maintains its own data and self-contained schema.

* Document metadata contains revision information, which makes it possible to merge the differences occurred while the databases were disconnected.

* CouchDB implements multi version concurrency control, to avoid the need to lock the database field during writes.

Types of NoSql DB

These NoSQL databases are classified into three types and they are explained below.

1. Key-value Store − These databases are designed for storing data in key-value pairs and these databases will not have any schema. In these databases, each data value consists of an indexed key and a value for that key.

Examples − BerkeleyDB, Cassandra, DynamoDB, Riak.

2. Column Store − In these databases, data is stored in cells grouped in columns of data, and these columns are further grouped into Column families. These column families can contain any number of columns.

Examples − BigTable, HBase, and HyperTable.

3. Document Store − These are the databases developed on the basic idea of key-value stores where "documents" contain more complex data. Here, each document is assigned a unique key, which is used to retrieve the document. These are designed for storing, retrieving, and managing document-oriented information, also known as semi-structured data.

Examples − CouchDB and MongoDB.

What is NoSql DB

NoSQL Databases

* A NoSQL database (sometimes called as Not Only SQL) is a database that provides a mechanism to store and retrieve data other than the tabular relations used in relational databases.

* These databases are schema-free, support easy replication, have simple API, eventually consistent, and can handle huge amounts of data (big data).

* The primary objective of a NoSQL database is to have the following −

1. Simplicity of design,
2.Horizontal scaling, and
3.Finer control over availability.

Advantages of CouchDB for Big Data

Why CouchDB?

* CouchDB have an HTTP-based REST API, which helps to communicate with the database easily. And the simple structure of HTTP resources and methods (GET, PUT, DELETE) are easy to understand and use.

* As we store data in the flexible document-based structure, there is no need to worry about the structure of the data.

* Users are provided with powerful data mapping, which allows querying, combining, and filtering the information.

* CouchDB provides easy-to-use replication, using which you can copy, share, and synchronize the data between databases and machines.

What is CouchDB

What is CouchDB?

* CouchDB is an open source database developed by Apache software foundation. The focus is on the ease of use, embracing the web. It is a NoSQL document store database.

* It uses JSON, to store data (documents), java script as its query language to transform the documents, http protocol for api to access the documents, query the indices with the web browser.

Thursday 29 June 2017

What is Avro Seriealization

*Avro-Serialization is the process of translating data structures or objects state into binary or textual form to transport the data over network or to store on some persisten storage.

*Once the data is transported over network or retrieved from the persistent storage, it needs to be deserialized again. Serialization is termed as marshalling and deserialization is termed as unmarshalling.

Features of Avro

Features of Avro

* Avro is a language-neutral data serialization system.

* It can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby).

* Avro creates binary structured format that is both compressible and splittable. Hence it can be efficiently used as the input to Hadoop MapReduce jobs.

* Avro provides rich data structures. For example, you can create a record that contains an array, an enumerated type, and a sub record. These datatypes can be created in any language, can be processed in Hadoop, and the results can be fed to a third language.

* Avro schemas defined in JSON, facilitate implementation in the languages that already have JSON libraries.

* Avro creates a self-describing file named Avro Data File, in which it stores data along with its schema in the metadata section.

* Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server exchange schemas in the connection handshake.

What is avro

* Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting, the father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes quite helpful, as it deals with data formats that can be processed by multiple languages. Avro is a preferred tool to serialize data in Hadoop.

* Avro has a schema-based system. A language-independent schema is associated with its read and write operations. Avro serializes the data which has a built-in schema. Avro serializes the data into a compact binary format, which can be deserialized by any application.

* Avro uses JSON format to declare the data structures. Presently, it supports languages such as Java, C, C++, C#, Python, and Ruby.

Thursday 22 June 2017

How to stop SCCM from installing Files on System or C drive

* SCCM by default install site server role component on first available NTFS drive.

* In order to avoid the files being installed, we need to create a magic file on the drive so that SCCM understand and ignore me.

*The Magical file name is "no_sms_on_drive.sms"

* You need to create this empty file and place on the drive which you want SCCM to ignore.

Monday 19 June 2017

Types of Big Data

Ther are 3 types of Big Data

1.Structured Data:
The data which can be stored and processed in table(rows and column) format is called as a structured data. Structured data is relatively simple to enter,store and analyze.
Exm- Relational database management system. 

2.Unstructured Data:
The data with unknown form or structure is called as unstructured data. The fundamental challenge of unstructured data sources is that they are difficult for nontechnical users and data analysts to understand and process.
Exm- Text files,images,videos,email,customer service interactions,webpages,PDF files,PPT,social media data etc. 

3.Semi-structured Data
Semi-structured data is data that is neither raw data nor organized in a rational model like a table. It may organized in tree pattern which is easier to analyze in some case. XML and JSON documents are semi structured documents.

What is Sqoop and why it is used ?

Sqoop:

Sqoop is an open source framework provided by Apache. It is a command-line interface application for efficiently transferring bulk data between Apache Hadoop and external datastores such as relational databases, enterprise data warehouses. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases. 

Sqoop Import:
The import tool is used to imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record in HDFS. 

Sqoop Export:
The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop contain records, which are called as rows in table.

Why Sqoop is used?
For Hadoop developers, the work starts after data is loaded into HDFS. For this, the data residing in the relational database management systems need to be transferred to HDFS and might need to transfer back to relational database management systems. So developers can always write custom scripts to transfer data in and out of Hadoop, but Apache Sqoop provides an alternative.
Sqoop uses MapReduce framework to import and export the data, which provides parallel mechanism as well as fault tolerance. Sqoop makes developers life easy by providing command line interface. Developers just need to provide basic information like source, destination and database authentication details in the sqoop command. 

Sqoop Connectors:
All the existing Database Management Systems are designed with SQL standard in mind. However, each DBMS differs with respect to dialect to some extent. So, this difference poses challenges when it comes to data transfers across the systems. Sqoop Connectors are components which help overcome these challenges. Data transfer between Sqoop and external storage system is made possible with the help of Sqoop's connectors. Sqoop has connectors for working with a range of popular relational databases, including MySQL, PostgreSQL, Oracle, SQL Server, and DB2. Each of these connectors knows how to interact with its associated DBMS. There is also a generic JDBC connector for connecting to any database that supports Java's JDBC protocol. In addition, Sqoop provides optimized MySQL and PostgreSQL connectors that use database-specific APIs to perform bulk transfers efficiently.

Architecture of Hadoop

Architecture of Hadoop:

Hadoop follows master-slave architecture means there is one master machine and multiple slave machines. The data that you give to hadoop is stored across these machines in the cluster. 

Two important components of hadoop are:

1.HDFS (Data storage)
2.Map-Reduce (Analyzing and Processing)

1. HDFS:
Hadoop Distributed File System is distributed file system used to store very huge amount of data. HDFS follows master-slave architecture means there is one master machine(Name Node) and multiple slave machines(Data Node).The data that you give to hadoop is stored across these machines in the cluster. 

Various components of HDFS are:

a) Blocks:
HDFS is block structured file system in which individual file is split into several blocks of equal size and stored across one or more machines in a cluster. HDFS blocks are 64 MB by default in Apache Hadoop and 128 MB by default in Cloudera Hadoop but it can be increased as per the need. If the file in HDFS is smaller than block size, then it does not occupy full block. Exm- If file size is 10 MB and HDFS block size is 128 MB then it takes only 10 MB of space.

b) Name Node:
Name node is the controller/master of the system. Name node spreads data to data node. It stores the metadata of all the files in HDFS. This metadata includes name, location of each block, block size and file permission.

c) Data Node-
Data node is the place where actual data is stored. The data sent by name node is stored into data node. They store and retrieve blocks when they are requested by client or name node.They perform operations such as block creation,deletion and replication as stated by the name node.

d) Secondary Name Node-
Many people think that Secondary Namenode is just a backup of primary Namenode in Hadoop but it is not a back up node. Name Node is a primary node in which all the metadata is stored into fsimage and editlog files periodically. But, when name node down, secondary node will be online but this node only have the read access to the fsimage and editlog files and don't have the write access to them. All the secondary node operations will be stored to temp folder. When name node back to online this temp folder will be copied to name node and the namenode will update the fsimage and editlog files. 

2. Map reduce:
MapReduce is a framework and processing technique using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. Map reduce program is written by default in java but we can use other language also like pig,apache pig.
The MapReduce algorithm contains two important tasks.

a) Map-
In map stage,mapper takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).

b) Reduce-
In reduce phase, reducer takes the output from a map as input and combines those data tuples into a smaller set of tuples. The reduce job is always performed after the map job.

Components of MapReduce: 

a) JobTracker:
Job tracker is a daemon that runs on a namenode. There will be only one job tracker for name node but many task trackers for data nodes. It assigns the tasks to the different task tracker. It is the single point of failure. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.
JobTracker process: 
1. JobTracker receives the requests from the client.
2. JobTracker talks to the NameNode to determine the location of the data.
3. JobTracker finds the best TaskTracker nodes to execute tasks.
4. The JobTracker submits the work to the chosen TaskTracker nodes.
5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals then work is scheduled on a different TaskTracker.
6. When the work is completed, the JobTracker updates its status and submit back the overall status of job back to the client.
7. The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted. 

b) TaskTracker:
Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work amongst different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster. 

TaskTracker process: 
1. The JobTracker submits the work to the TaskTracker nodes.
2. TaskTracker run the tasks and report the status of task to JobTracker. 
3. It has function of following the orders of the job tracker and updating the job tracker with its progress status periodically.
4. TaskTracker will be in constant communication with the JobTracker.
5. TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive, JobTracker will assign the task to another node.

Characteristics​ of Hadoop

Characteristics of Hadoop:

1. Robust: 
Handles hardware failure as data is stored in multiples node. 

2. Scalable:
Cluster size can be increased by adding more node.

3. Simple:
We can write parallel code. Focuses on moving code rather than data.

4. Portable:
Analyze structured,semi-structured and unstructured data. (Structured- In table format, Semi-structure- Not in table format but in well organised format(XML,JSON), Unstructured- Has no format(Text, Image,Videos).

5. Cost Effective:
Hadoop is open source and uses commodity hardware to store data so it is really cost effective as compared to traditional RDBMS.

6. Fault Tolerance:
If any node fails, the task are automatically redirected to other node. Multiple copies of all data are automatically stored. Due to this even if one node fails, same data is available on some other node also.

What are Daemons in Hadoop and types of dameons

Daemons:
Daemons are resident programs which constitute to hadoop running. In computing terms it is a process that runs in the background.
Hadoop has five such daemons. They are 1.NameNode,
2.Secondary NameNode
3.DataNode
4.JobTracker
5.TaskTracker.

Each daemons runs separately in its own JVM. We discuss about these daemons in this post as they are associated with HDFS and MapReduce.

Is Hadoop a Database ?

Hadoop is not a database: Hadoop is an distributed file system and not a database. It simply uses the filesystem provided by Linux to store data

What is Hadoop

What is Hadoop?

Apache Hadoop is an open-source java framework that is used to store,analyze and process Big data in a distributed environment across clusters of computers. It is used by Google, Facebook, Yahoo, Youtube, Twitter, LinkedIn and many more. Hadoop is top level apache project developed by Doug Cutting. It works on distributed system. It is inspired by google mapreduce algorithm for running distributed application. Hadoop could be good if you have a lot of data you do not know what to do and you do not want to loose it. 
Hadoop is initiated and led by YAHOO. hadoop is written in java programming language with some code in C and some commands in shell-scripts. Companies that offer hadoop services are - IBM, Amazon Web Services, Cloudera, Microsoft Azure, Dell. 

3V's of Big Data

3V's of Big Data:

1.Volume:
The amount of data which we deal with is of very large size of Peta bytes. 

2.Variety:
Data Comes in all type of format.(Text,audio,image,video).

3.Velocity:
The data is generating at a very fast rate. Velocity is the measure of how fast the data is coming in. For time critical applications faster processing is very important. Exm- Share marketing, Video streaming

Applications of Big Data

Application of Big data:

Some applications of big data are as follows:

1. Healthcare Providers: 
The big data is use in the field of medicine and healthcare. It is a great help for even physicians to keep track of all the patients history. 

2. Google Search: 
When we search anything google makes use of data science algorithms to deliver the best result for our searched query in fraction of seconds. Next time, when we search anything based on our previous search google gives us some recommendations. 

3. Education: 
Big data has great influence in the education world too. Today almost every course of learning is present online. Along with the online learning, there are many examples of the use of big data in the education industry. 

4. Recommender Systems:
A lot of companies used recommender system to promote their products/suggestions in accordance with user’s interest and relevance of information. Internet giants like Amazon, Google, Flipkart and many more uses this system to improve user experience. The recommendations are made based on previous search results for a user. 
Example- When we search any product on amazon, we always get recommendations about similar product. They not only help you to find relevant products from billions of products available with them, but also adds a lot to the user experience. 

5. Banking Zones and Fraud Detection:
Big data is hugely used in the fraud detection in the banking sectors. In banking sector, it finds out all the mischief tasks done. It detects the misuse of credit and debit cards, business clarity, public analytics for business and IT strategy fulfillment analytics. 

6. Super Market:
Big data analysis is also used in super market for market basket analysis. Market Basket Analysis is one of the most common and useful types of data analysis for marketing and retailing. The purpose of market basket analysis is to determine what products customers purchase together. A store could use this information to place products frequently sold together into the same area. 
Example- People who buy bread also buy butter or people who buy shampoo might also buy conditioner. 

7. Security Enforcement:
Big data is applied for improving national security enforcement. These techniques are used to detect and prevent cyber attack. Police force use big data tools to catch criminals and even predict criminal activity. 

Advantages of Big Dat

Advantages of Big Data:

1.Access to large volume of data.
2.Allows businesses to develop more effective strategies towards competitors in less time.
3.Improve Decision making capabilities.
4.Can analyse data easily.
5.Allows businesses to detect errors and fraud quickly. 
6.Offers businesses a chance to improve profits and customer service.
7.Integration of both structured and unstructured data.
8.Implementing new strategies and improve service dramatically.

Types of Big Data

Ther are 3 types of Big Data

1.Structured Data:
The data which can be stored and processed in table(rows and column) format is called as a structured data. Structured data is relatively simple to enter,store and analyze.
Exm- Relational database management system. 

2.Unstructured Data:
The data with unknown form or structure is called as unstructured data. The fundamental challenge of unstructured data sources is that they are difficult for nontechnical users and data analysts to understand and process.
Exm- Text files,images,videos,email,customer service interactions,webpages,PDF files,PPT,social media data etc. 

3.Semi-structured Data
Semi-structured data is data that is neither raw data nor organized in a rational model like a table. It may organized in tree pattern which is easier to analyze in some case. XML and JSON documents are semi structured documents.

Big Data and Hadoop - Introduction

What is Big Data?

The process of storing and analysing data to make some sense for the organization is called Big data. In simple terms, data which is very large in size and yet growing exponentially with time is called as Big data. 

Why we need Big Data?

For any application that contains limited amount of data we normally use Sql/Postgresql/Oracle/MySQL, but what in case of large applications like Facebook,Google,Youtube? This data is so large and complex that none of the traditional data management system is able to store and process it. 

Facebook generates 500+ TB data per day as people upload various images, videos, posts etc. Similarly sending text/multimedia messages, updating Facebook/WhatsApp status, comments etc. generates huge data. If we use traditional data processing applications (SQL/Oracle/MySQL) to handle it, it will lead to loss of efficiency. So in order to handle exponential growth of data, data analysis becomes a required task. To overcome this problem, we use Big data. Big data includes both structured and unstructured data. 
Traditional data management systems and existing tools are facing difficulties to process such a big data. R is one of the main computing tool used in statistical education and research. It is also widely used for data analysis and numerical computing in scientific research.

Where does Big Data come from?

1.Social data : This could be data coming from social media services such as Facebook Likes, photos and videos uploads, putting comments, Tweets and YouTube views.
2.Share Market: Stock exchange generates huge amount of data through its daily transaction.
3.E-commerce site: E-commerce Sites like Flipkart,Amazon,Snapdeal generates huge amount of data.
4.Airplane: Single airplane can generate 10+ TB of data in 30 minutes of a flight time. 

What is the need for storing such huge amount of data?

The main reason behind storing data is analysis. Data analysis is a process used to clean, transform and remodel data with a view to reach to a certain conclusion for a given situation. More accurate analyses leads to better decision making and better decision making leads to increase in efficiency and risk reduction.

Example-
1. When we search anything on e-commerce websites(Flipkart,Amazon), we get some recommendations of product that we search. The analysis of data that we entered is done by these websites, then accordingly the related products are displayed.
Example - When we search any smart phone, we get recommendations to buy back covers,screen guard etc.

2. Similarly, why facebook stores our images,videos? The reason is advertisement.
There are two types of marketing-
a) Global marketing - Show advertisement to all users.
b) Target marketing - Show advertisement to particular groups/people. So in target marketing, facebook analyses it's data and it shows advertisements to selected people. 
Exm - If advertiser wants to advertise for cricket kit and he/she wants to show that advertisement to only interested set of people so facebook tracks a record of all those people who are member of cricket groups or post anything related to cricket and displays it to them.

Big Data and Hadoop Interview Questions

Big Data and Hadoop Interview Questions

Q. What is a Big Data? 
A. Big data is a term that describes the large volume of data set which is very difficult to capture, store, process, retrieve and analyze it with the help of database management tools or traditional data processing techniques.

Q. What are the characteristics of Big Data? 
1.Volume- Organizations collect data from a variety of sources including social media, share market, airplane, e-commerce websites.
2.Variety- The type and nature of the data(Audio,image,video).
3.Velocity- The speed at which the data is generating is very large.

Q. How is analysis of Big Data useful for organizations? 
A. The major goal of Big Data analysis is good decision making. Organizations will learn which areas to focus on and which areas are less important. It provides some early indicators that can prevent the company from a huge loss. Big data can be analyzed with the help of some software tools like Data Mining Tool, Text Analytics, Statical Method, Mainstream BI software and Visualization tools.

Q. What are the challenges in handling big data? 
1. Difficulties - Capture,storage,search,sharing,analytics
2. Data Storage - Physical storage,acquisition,space and power cost
3. Data processing - Information and content management.

Q. What is the basic difference between traditional RDBMS and Hadoop? 
A. RDBMS is a traditional row-column database used for transactional systems to report and archive the data whereas Hadoop is an approach to store huge amount of data in the distributed file system and process it. Hadoop can handle bigger data than relational DB. RDBMS works on structured data unlike hadoop works on Unstructured data.

Q. What is Hadoop? 
A. When Big Data emerged as a problem, Apache Hadoop evolved as a solution to it. Apache Hadoop is a framework which provides us various services or tools to store and process Big Data. Hadoop is used by major players including Google,IBM,Yahoo.

Q. In what format does hadoop handle data? 
A. Hadoop handles data in key/value format.

Q. Why files are stored in redundant manner in HDFS? 
A. To ensure durability against failure.

Q. What is HDFS ? 
A. HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment. It follows master and slave architecture.

Q. What is a Namenode? 
A. Name node is the master of the system. It is a high-availability machine and single point of failure in HDFS. Namenode holds the meta data for the HDFS like Namespace information, block information etc.

Q.What happens when datanode fails? 
A. When datanode fails:First namenode detect the failure of datanode.On the failed datanode all the tasks are re-scheduled.Then jobTracker will assign the task to another datanode.

Q.What is a Datanode? 
A. Data node is the place where actual data is stored. The data sent by name node is stored into data node.

Q.What is the default block size in hdfs? 
A. Default block size is 64 mb.

Q.What is a MapReduce? 
A. MapReduce is the heart of Hadoop. It is programming paradigm that process large data sets across hundreds or thousands of server in hadoop cluster. It is a framework using which we can write applications to process huge amounts of data in parallel.

Q.How many daemon processes run on a hadoop cluster? 
A. Hadoop comprised of five separate daemons. NameNode, Secondary Namenode and Jobtracker runs on master node while DataNode and TaskTracker runs on slave node.

Q. What is a job tracker?
A. Job tracker is a daemon that runs on a namenode. It assigns the tasks to the different task tracker. There will be only one job tracker for name node but many task trackers for data nodes. It is the single point of failure. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.

Q. What is a task tracker? 
A. Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work amongst different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

Q. What is a heartbeat in HDFS? 
A. A heartbeat is some kind of signal indicate that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker and they are unable to perform the assigned task.

Q. What is a block in HDFS? 
A block is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB. Files in HDFS are broken down into block-sized chunks, which are stored as independent units.

Q. Are task trackers and job tracker present in separate machines? 
A. Yes, task tracker and job tracker are present in different machines. A job tracker present in name node while task tracker present in data node. A job tracker is a single point of failure for the Hadoop MapReduce service.

Q. What is a Secondary Namenode? 
A. Name Node is a primary node in which all the metadata is stored into fsimage and editlog files periodically. But, when name node down secondary node will be online but this node only have the read access to the fsimage and editlog files and dont have the write access to them . All the secondary node operations will be stored to temp folder. when name node back to online this temp folder will be copied to name node and the namenode will update the fsimage and editlog files.

Q. Name some companies that use Hadoop? 
A. Some companies that use Hadoop are Yahoo (One of the biggest user & more than 80% code contributor to Hadoop), Facebook, Cloudera, Amazon, eBay, Twitter, IBM etc.

Q. Differentiate between Structured, Unstructured and Semi-structured data? 
A. Data which can be stored in traditional database systems in the form of rows and columns can be referred to as Structured Data. Data which can be stored only partially in traditional database systems, for example, data in XML records can be referred to as semi structured data. Unorganized and raw data that cannot be categorized as semi structured or structured data is referred to as unstructured data. Text files,images,videos,email,customer service interactions,webpages,PDF files etc. are all examples of unstructured data.

Q. What are the main components of a Hadoop Application?
A. Core components of a Hadoop application are-
1. Hadoop Common
2. HDFS
3. Hadoop MapReduce
4. YAR

Q. What is the port number for NameNode, Task Tracker and Job Tracker? 
1. NameNode-50070
2. Job Tracker-50030
3. Task Tracker-50060

Q. What happens when a user submits a Hadoop job when the NameNode is down. Does the job get in to hold or does it fail? 
A. The Hadoop job fails when the NameNode is down.

Q. Can Hadoop handle streaming data?
A. Yes, through Technologies like Apache Kafka, Apache Flume, and Apache Spark it is possible to do large-scale streaming.

Q. What platform and Java version is required to run Hadoop?
A. Java 1.6.x or higher version are good for Hadoop. Linux and Windows are the supported operating system for Hadoop.

Q. What kind of Hardware is best for Hadoop?
A. The best configuration for executing Hadoop jobs is dual core machines or dual processors with 4GB or 8GB RAM that use ECC memory.

Q. What are the most common input formats defined in Hadoop?
1. Key Value Input Format
2. Text Input Format
3. Sequence File Input Format

Q. What happens when a data node fails?
A. If a data node fails the job tracker and name node will detect the failure. After that all tasks are re-scheduled on the failed node and then name node will replicate the user data to another node.

Q. Is it necessary to know java to learn Hadoop?
A. If you have a background in any programming language like Java, C, C++, PHP etc. It may be really helpful, but if you are nil in java, it is necessary to learn Java and also get the basic knowledge of SQL.

Q. Which data storage components are used by Hadoop?
A. HBase data storage component is used by Hadoop.

Sunday 28 May 2017

Configure PSTN Conferencing for Office 365 Skype User - Step by Step Guide

Configure PSTN Conferencing for Office 365 Skype User - Step by Step Guide


  • PSTN Conferencing facility allows users to dial in to Skype Conference without the need to have internet access.Users can dial-in from there traditional Phone and attend the Skype Conference.
  • Today we will see the step by Step guide to achieve it.
Step by Step:

  • Login to Office 365 portal as tenant administrator.



Wednesday 26 April 2017

Install Bash Shell on Windows to Execute Linux Commands from Windows

Install Bash Shell on Windows to Execute Linux Commands from Windows


  • To install Bash,your PC should be running a 64-bit version of Windows 10 Anniversary Update build 14393 or later
  • Now Open Settings -> Update and Security -> For developers
  • Select the Developer Mode radio button

  • Then from Start, search for "Turn Windows features on or off" (type 'turn')
  • Select Windows Subsystem for Linux (beta)

  • Click OK
  • Now Open a PowerShell prompt as administrator and run:
Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux


  • Restart your computer when prompted

Install Azure CLI 2.0 on Apple mac OS Step by Step Tutorial

Install Azure CLI 2.0 on Apple mac OS Step by Step Tutorial


  • To install Azure CLI on your Mac OS type the following command
curl -L https://aka.ms/InstallAzureCli | bash

  • Then restart your command shell for some changes to take effect.
exec -l $SHELL


Create Linux VM Using Powershell in Azure

Create Linux VM Using Powershell in Azure


  • Azure is the leading Cloud Platform today.
  • Lot of Customers are having a Hybrid setup with Cloud.
  • So With power shell now so famous lot of companies have deployed Azure CLI 2.0 to extend thier powershell skill in Cloud.
  • In Order to Create a Linux VM in cloud using powershell and CLI2.0 installed on you system use this command

az vm create -n LINUX01 -g LinuxGroup --image UbuntuLTS

  • This will create the Linux Machine called "LINUX01" with Ubuntu OS

Saturday 1 April 2017

What is Row Level Security or RLS

1. Row Level Security is a new feature in SQL 2016.

2. It give access control at the Row Level in the Database

3. Now you can store information of different customers using RLS in the same table.

4. This is achieved by SESSION_CONTEXT Column where the User ID is stored.

5. This acts as a filter where User see rows only assigned to his UserID

Wednesday 1 March 2017

SCCM Query find All Server count with OS Name

SCCM Query find All Server count with OS Name



  • Recently I was asked to count of servers by OS
  • It was quite Challenging as we need to understand the SCCM Structure very well to do the SQL Queries.
  • Following are the Queries to fin the Server by OS Name and the count of Servers
SELECT VR.operatingSystem0,COUNT(VR.operatingSystem0) FROM v_R_System AS VR
WHERE VR.operatingSystem0 LIKE '%Windows Server%'
GROUP BY VR.operatingSystem0

Tuesday 21 February 2017

SQL server is blank while adding Additional Management Server in SCOM 2012 R2

SQL server is blank while adding Additional Management Server in SCOM 2012 R2


  • I faced a weird issue today. As i was trying to add additional SCOM management Server , The SQL page where we select the OperationsManger DB was blank.
  • After some investigation i found that the Management Server was already showing in the SCOM Management server view as Unmonitored.
  • On i deleted it and tried again it worked and was able to see the OperationsManager DB

Saturday 28 January 2017

What is Just Enough Administration or JEA in Windows 2016

What is Just Enough Administration or JEA in Windows 2016


  • JEA is new kind of Administration Model in Windows 2016.
  • Its Role Based administration Model where you can fine tune custom Admin Settings.
  • To understand more look at the below video

Video:

Tuesday 17 January 2017

Install a Linux SQL Server VM in Azure Step by Step Tutorial

 Install a Linux SQL Server VM in Azure Step by Step Tutorial



To Create a Linux VM with SQL Server installed

  • Open the Azure portal.
  • Click New on the left.
  • In the New blade, click Compute.
  • Click See All next to the Featured Apps heading.
  • In the search box, type SQL Server vNext, and press Enter to start the search.


  • Select a SQL Server vNext image from the search results.
  • Click Create.
  • On the Basics blade, fill in the details for your Linux VM.
  • Click OK.
  • On the Size blade, choose a machine size. For development and functional testing, we recommend a VM size of DS2 or higher. For performance testing, use DS13 or higher.

  • Click Select.
  • On the Settings blade, you can make changes to the settings or keep the default settings.
  • Click OK.
  • On the Summary page, click OK to create the VM.

Install SQL Server tools on Red Hat Enterprise Linux Step by Step tutorial

Install SQL Server tools on Red Hat Enterprise Linux


  • Enter superuser mode.

sudo su


  • Download the Microsoft Red Hat repository configuration file.


curl https://packages.microsoft.com/config/rhel/7/prod.repo > /etc/yum.repos.d/msprod.repo


  • Exit superuser mode.


exit


  • Run the following commands to install 'mssql-tools' with the unixODBC developer package.


sudo yum update
sudo yum install mssql-tools unixODBC-utf16-devel

Uninstall SQL Server on Red Hat Enterprise Linux Step by Step Tutorial

Uninstall SQL Server on Red Hat Enterprise Linux Step by Step Tutorial


  • In order to remove the mssql-server package, follow these steps:

  • Run the remove command. This will delete the package and remove the files under /opt/mssql/. However, this command will not affect user-generated and system database files, which are located under /var/opt/mssql.

sudo yum remove mssql-server

  • Removing the package will not delete the generated database files. If you want to delete the database files use the following command:

sudo rm -rf /var/opt/mssql/

Upgrade SQL Server on Red Hat Enterprise Linux Step by Step Totorial

Upgrade SQL Server on Red Hat Enterprise Linux Step by Step Totorial


  • In order to upgrade the mssql-server package, execute the following command:

sudo yum update mssql-server


  • These commands will download the newest package and replace the binaries located under /opt/mssql/. The user generated databases and system databases will not be affected by this operation.

Install SQL Server on Red Hat Enterprise Linux 7.3 Step by Step Tutorial

Install SQL Server on Red Hat Enterprise Linux 7.3 Step by Step Tutorial


Video:



  • You need at least 3.25GB of memory to run SQL Server on Linux. 
  • To install the mssql-server package on RHEL, follow these steps:
  • Enter superuser mode.

sudo su

  • Download the Microsoft SQL Server Red Hat repository configuration file:

curl https://packages.microsoft.com/config/rhel/7/mssql-server.repo > /etc/yum.repos.d/mssql-server.repo

  • Exit superuser mode.

exit

  • Run the following commands to install SQL Server:

sudo yum install -y mssql-server

  • After the package installation finishes, run the configuration script and follow the prompts. Make sure to specify a strong password for the SA account (Minimum length 8 characters, including uppercase and lowercase letters, base 10 digits and/or non-alphanumeric symbols).

sudo /opt/mssql/bin/sqlservr-setup

  • Once the configuration is done, verify that the service is running:

systemctl status mssql-server