Machine Learning with Spark – 4 Days
Course Description
Spark is a new and very popular Big Data processing engine. Spark MLLib is a de facto standard for machine learning in Big Data.
This course is intended for data scientists and software engineers. It maintains an optimal balance of theory and practice. For each machine learning concept, we first discuss the foundations, its applicability and limitations. Then we explain the implementation and use, and specific use cases. This is achieved through a combination of about 50% lecture, 50% lab work.
What You Will Learn
- attain thorough understanding of popular machine learning algorithms, their applicability and limitations
- practice the application of these methods in the Spark machine learning environment
- achieve clarity in the real-world use of machine learning by illustrating each method with practical use cases
Intended Audience
Data Scientists and Software Engineers
Lab Environment
Working Spark environment will be provided for students. Students would only need an SSH client and a browse.
Zero Install : There is no need to install software on students’ machines.
Prerequisites
- familiarity with programming in at least one language
- be able to navigate Linux command line
- basic knowledge of command line Linux editors (VI / nano)
Outline
Section 1: Introductions and overviews
- Machine learning: goals, results, supervised/unsupervised
- Spark as a tool for Big Data
- Scala as the language of Spark (together with Python, Java and R)
If the students do not have the Spark/Scala prerequisites, a thorough introduction of these is taught in the section
Section 2: SVM (Supervised Vector Machines)
- Theory
- Lab
- Use case: anomaly detection
Section 3: Logistic Regression
- Theory
- Lab
- Use case: healthcare prediction
Section 4: Linear regression
- Theory
- Lab
- Use case: financial modelling
Section 5: Naive Bayes
- Theory
- Lab
- Use case: spam filtering
Section 6: Decision Trees
- Theory
- Lab
- Use case: vessel shipment planning
Section 7: Clustering (K-Means)
- Theory
- Lab
- Use case: topic grouping
Section 8: LDA (Latent Dirichlet Allocation)
- Theory
- Lab
- Use case: unsupervised topic discovery
Section 9: Principal Component Analysis (PCA)
- Theory
- Lab
- Use case: stock analysis
Section 10: Recommendation (Collaborative filtering)
- Theory
- Lab
- Use case: dating
Section 11: Graphs – graph operations
- Theory
- Lab
- Use case: finding followers
Section 12: Graphs – optimizations with Pregel
- Theory
- Lab
- Use case: shortest routes, PageRank