training

 

Hadoop for Administrators – 3 to 4 Days

Course Description

Apache Hadoop is the most popular framework for processing Big Data on clusters of servers. In this course, attendees will learn about the business benefits and use cases for Hadoop and its ecosystem, how to plan cluster deployment and growth, how to install, maintain, monitor, troubleshoot and optimize Hadoop.

What You Will Learn

  • Hadoop & Big Data
  • Installing Hadoop
  • Managing and Monitoring Hadoop
  • Loading data in HDFS
  • Managing eco system
  • Securing Hadoop

Intended Audience

Hadoop administrators – Format: 60% lecture, 40% hands-on labs

Lab Environment

Zero Install : There is no need to install hadoop software on students’ machines! A working hadoop cluster will be provided for students.

Students will need the following:

Prerequisites

  • Comfortable with basic Linux system administration
  • Basic scripting skills
  • Knowledge of Hadoop and Distributed Computing is not required, but will be introduced and explained in the course.

Outline

  1. Scala primer
    • A quick introduction to Scala
    • Labs : Getting know Scala
  2. Spark Basics
    • Big Data , Hadoop, Spark
    • Spark concepts and architecture
    • Spark eco system (core, spark sql, mlib, streaming)
    • Labs : Installing and running Spark
  3. First Look at Spark
    • Spark shell
    • Spark web UIs
    • Analyzing dataset – part 1
    • Labs: Spark shell exploration
  4. RDDs (condensed coverage)
    • RDDs concepts
    • Partitions
    • RDD Operations / transformations
    • Labs : Unstructured data analytics using RDDs
  5. Dataframes / Datasets
    • Understanding newer Dataset API
    • Dataframes
    • Loading structured data using Dataframes
    • Caching and persistence
    • Labs : Dataframes, Datasets, Caching
  6. Spark SQL
    • Spark SQL concepts and overview
    • Defining tables and importing datasets
    • Querying data using SQL
    • Handling various storage formats : JSON / Parquet / ORC
    • Labs : querying structured data using SQL; evaluating data formats
  7. Spark and Hadoop
    • Hadoop Primer : HDFS / YARN
    • Hadoop + Spark architecture
    • Running Spark on Hadoop YARN
    • Processing HDFS files using Spark
    • Spark & Hive
  8.  Machine Learning (ML) (day – 3)
    • Machine Learning primer
    • Machine Learning in Spark : MLib / ML
    • Spark ML overview (newer Spark2 version)
    • Algorithms : Clustering, Classifications, Recommendations
    • Labs : Writing ML applications
  9.  GraphX (day – 3)
    • GraphX library overview
    • GraphX APIs
    • Labs : Processing graph data using Spark

MindIQ

Print Friendly, PDF & Email