Spark V2 for Data Analysts – 2 to 3 Days
Course Description
This course will introduce Apache Spark. The students will learn how Spark fits into the Big Data ecosystem, and how to use Spark for data analysis.
What You Will Learn
- Scala primer
- Spark Shell
- Spark Data structures (RDD / Dataframe / Dataset)
- Spark SQL
- Spark & Hadoop
- Spark MLLib (3rd day)
- Spark Graphx (3rd day)
Intended Audience
Data and Business Analysts
Lab Environment
We provide the complete lab environment in the cloud. No need to install Spark on your laptop.
See below for what to bring.
What to Bring
- A reasonably modern laptop. Need to be able to connect to cloud services. Laptops with overly restrictive firewalls are not recommended)
- ssh client (For Windows use Putty / SecureCRT ; Mac and Linux come with ssh clients)
- Chrome browser with Markdown Preview Plus plugin
- Nice to have : a programmer’s editor
Prerequisites
- Analyst background (familiarity with SQL, Scripting ..etc)
- Basic understanding of Linux development environment (basic command line navigation / editing files / running programs)
Outline
- Scala primer
- A quick introduction to Scala
- Labs : Getting know Scala
- Spark Basics
- Big Data , Hadoop, Spark
- Spark concepts and architecture
- Spark eco system (core, spark sql, mlib, streaming)
- Labs : Installing and running Spark
- First Look at Spark
- Spark shell
- Spark web UIs
- Analyzing dataset – part 1
- Labs: Spark shell exploration
- RDDs (condensed coverage)
- RDDs concepts
- Partitions
- RDD Operations / transformations
- Labs : Unstructured data analytics using RDDs
- Dataframes / Datasets
- Understanding newer Dataset API
- Dataframes
- Loading structured data using Dataframes
- Caching and persistence
- Labs : Dataframes, Datasets, Caching
- Spark SQL
- Spark SQL concepts and overview
- Defining tables and importing datasets
- Querying data using SQL
- Handling various storage formats : JSON / Parquet / ORC
- Labs : querying structured data using SQL; evaluating data formats
- Spark and Hadoop
- Hadoop Primer : HDFS / YARN
- Hadoop + Spark architecture
- Running Spark on Hadoop YARN
- Processing HDFS files using Spark
- Spark & Hive
- Machine Learning (ML) (day – 3)
- Machine Learning primer
- Machine Learning in Spark : MLib / ML
- Spark ML overview (newer Spark2 version)
- Algorithms : Clustering, Classifications, Recommendations
- Labs : Writing ML applications
- GraphX (day – 3)
- GraphX library overview
- GraphX APIs
- Labs : Processing graph data using Spark