Big Data & Hadoop

Posts

Apache Spark & Apache KAFKA

APACHE SPARK AND KAFKA Apache Spark is a framework that does not have its own file system so Spark is taking benefit of apache hadoop and yarn that is cluster resourse management system and its is part of hadoop eco system. Do you think Kafka and spark are competitor ! according to me spark is different than apche kafka so let us discuss the difference between apache spark and apache kafka. 1- Apache Kafka is distributed messaging framework that can handle big volume of messages. 2- Spark is framework having couple of component that you are using for big data analysis. 3-Kafka messaging system are based on producers and consumers ..one can send message to broker and broker is broadcasting messages to multiple consumers. 4- internally kafka is using socket programming. 5- Apache spark is having spark streaming module where you can deal with real time data. you can create ...

Apache Spark and Apache Zeppelin Visualization Tool

Apache Spark and Apache Zeppelin Step-1: Installation and configuration of Apache Zeppelin https://zeppelin.apache.org/download.html Step-2: Extract Apache Zeppelin and move it to /usr/lib directory. sudo tar xvf zeppelin-*-bin-all.tgz move zepline to /usr/lib/directory Step-3: Install Java development kit in ubuntu and set JAVA_HOME variable. echo $JAVA_HOME create zepplin-env.sh and zeppline-site.xml from template files. open zepplin-env.sh set JAVA_HOME= /path/ set SPARK_HOME= /path/ ...

Hadoop distributed file System-HDFS

Hadoop Distributed File System HDFS is a management of hadoop file system for example every Operating System do have FS like in windows we have NTFS , FAT32 managing metadata about your directories and files same in hdfs Master node [namenode] is managing metadata about all file and directories that are present in whole cluster. Hadoop is not a single tool and having distributed file System. when user upload data at HDFS then hadoop distribute data across multiple nodes. Hdfs and map-reduce are two base component of hadoop echo system. Let us Understand the concept of hadoop distributed file system. 1- hadoop is working on cluster computing concept that is having master slave architecture. 2- Master and slaves are machine serving couple of services. Master: 1-NameNode 2-Job Tracker. 3-Secondary namenode. Slave: 1-Data Nmode 2-Task Tracker 3- Child Jvm. 1-Job Tracker ...

Apache Hadoop cluster setup guide

APACHE HADOOP CLUSTER SETUP UBUNTU 16 64 bit Step 1: Install ubuntu os system for master and slave nodes. 1-install vmware workstation14. https://www.vmware.com/in/products/workstation-pro/workstation-pro-evaluation.html 2- install Ubuntu 16os-64bit for masternode using vmware 3- install Ubuntu 16os-64bit for slavenode using vmware Step-2 : Update root password so that you can perform all admin level operations. sudo passwd root command to set new root password Step-3-Creating a User from root user for Hadoop Eco System. It is recommended to create a separate user for Hadoop to isolate Hadoop file system from ...

What is Big data & Apache Hadoop.

WHAT IS BIG DATA & APACHE HADOOP Big data & hadoop is emerging technology in current IT sector and many professionals are looking there career in data science. apache hadoop is a powerful framework that deal with Big data. Big data is a problem ! Now Question is how data can be a problem...yes it is because now we have data in PBs the Square Kilometre Array ( SKA ) are generating 20,000 PB data / day facebook,google,yahoo,twitter and other top organizations are creating Big Data -huge amount of data. We can define big data by volume ,velocity and variety as we are not dealing with structured data only now we have unstructured and semi structured data as well like audio,video,text,json etc. Apache Hadoop is an open source product that is solution for Big data problem. you would not like to wait for a result like web response just think if you will get youtube recommended videos after two days. Hadoop is not a single product it is a framework...