training

 

Hadoop for Developers – 4 Days

Course Description

Apache Hadoop is the most popular framework for processing Big Data on clusters of servers. This course will introduce a developer to Hadoop ecosystem.

Requirements

There is no need to install Hadoop software on students’ machines. A working Hadoop cluster will be provided for students. Participants will only need the following:

  • SSH client – Linux and Mac already have SSH client; Putty is recommended for Windows
  • Browser to access the cluster – We recommend Firefox browser

Prerequisites

Students should be familiar with the Java programming language (most programming exercises are in Java) and comfortable in Linux environment (i.e., be able to navigate Linux command line, edit files using vi / nano).

Outline

  • Section 1: Introduction to Hadoop
    • hadoop history, concepts
    • eco system
    • distributions
    • high level architecture
    • hadoop myths
    • hadoop challenges
    • hardware / software
    • Lab : first look at Hadoop
  • Section 2: HDFS
    • Design and architecture
    • concepts (horizontal scaling, replication, data locality, rack awareness)
    • Daemons : Namenode, Secondary namenode, Data node
    • communications / heart-beats
    • data integrity
    • read / write path
    • Namenode High Availability (HA), Federation
    • labs : Interacting with HDFS
  • Section 3 : Map Reduce
    • concepts and architecture
    • daemons (MRV1) : jobtracker / tasktracker
    • phases : driver, mapper, shuffle/sort, reducer
    • Map Reduce Version 1 and Version 2 (YARN)
    • Internals of Map Reduce
    • Introduction to Java Map Reduce program
    • labs : Running a sample MapReduce program
  • Section 4 : Pig
    • pig vs java map reduce
    • pig job flow
    • pig latin language
    • ETL with Pig
    • Transformations & Joins
    • User defined functions (UDF)
    • labs : writing Pig scripts to analyze data
  • Section 5: Hive
    • architecture and design
    • data types
    • SQL support in Hive
    • Creating Hive tables and querying
    • partitions
    • joins
    • text processing
    • labs : various labs on processing data with Hive
  • Section 6: HBase
    • concepts and architecture
    • hbase vs RDBMS vs cassandra
    • HBase Java API
    • Time series data on HBase
    • schema design
    • labs : Interacting with HBase using shell;   programming in HBase Java API ; Schema design exercise

MindIQ

Print Friendly, PDF & Email