Applications Open now for Jan 2023 Batch | Applications Close: Jan 15, 2023 | Exam: Feb 26, 2023

Applications Open now for Jan 2023 Batch | Applications Close: Jan 15, 2023 | Exam: Feb 26, 2023

Degree Level Course

Introduction to Big Data

This course will introduce students to practical aspects of analytics at a large scale, i.e. big data. The course will start with a basic introduction to big data and cloud concepts spanning hardware, systems and software, and then delve into the details of algorithm design and execution at large scale.

by Rangarajan Vasudevan

Course ID: BSCCS3006

Course Credits: 4

Course Type: Elective

Pre-requisites: None

What you’ll learn

Introduction to Cloud Concepts: Cloud-Native architecture, serverless computing, message queues, PaaS, SaaS, IaaS
Introduction to Big Data concepts: divide- and-conquer, parallel algorithms, distributed virtualized storage, distributed resource management, real-time processing.
Technology deep-dive on GCP as the vehicle for the experiments: Google Cloud Storage, GCP Dataflow, DataProc, Google Pub/Sub, Cloud Functions
Analytics at Large Scale: PySpark, BigQuery, Integration with Tensorflow/Pytorch

Course structure & Assessments

For details of standard course structure and assessments, visit Academics page.

WEEK 1 Introduction: ​Big data concepts & GCP Platform Setup
WEEK 2 Cloud concepts​: ​Cloud-Native architecture, serverless computing, message queues, PaaS, SaaS, IaaS Assignment:​ Spin Up a VM and write a python program to count lines of a file placed in GCS.
WEEK 3 Serverless​:​ Google Cloud Functions Assignment:​ Write a Python program to count lines of a file that is placed in GCS using Google Cloud Functions.
WEEK 4 Big Data Engineering​:​ Hadoop and PySpark Assignment:​ Write a spark code for executing the Hash example discussed in the lab.
WEEK 5 Big Data ML​:​ DataProc with ML - including Spark ML (Batch processing) Assignment:​ Train a classification model on an ML dataset. Provide the details of data exploration and feature engineering steps.
WEEK 6 Quiz #1​: 1​ -hour graded quiz on syllabus covered so far, to be worked on ahead of the lab. Followed by the quiz, there needs to be an explanatory session of the correct answers.
WEEK 7 Streaming​: Message queue:​ Pub/Sub Assignment:​ Count the number of lines in a file uploaded to GCS bucket in real-time by using Google Cloud Functions and Pub/Sub. Write a Google cloud Function which gets triggered whenever a file is added to a bucket and publishes the file name to a topic in Pub/Sub. Write a python file, which acts as a subscriber to this topic and prints out the number of lines in the file in real-time.
WEEK 8 Streaming​: Event processing:​ Spark Streaming Assignment:​ Stream the data stored on the GCS bucket into Kafka. Use Spark Streaming to read the data and make real-time predictions using a pre-trained logistic regression model.
WEEK 9 Deep Learning​ on cloud. Assignment:​ Deploy any deep learning model of your choice using Keras and PySpark. Report performance measures of the model on the MNIST dataset
WEEK 10 Quiz #2​:​ 1-hour graded quiz on syllabus covered so far, to be worked on ahead of the lab. Lab will go through answers and will also be open for any clarifications on the syllabus.
WEEK 11 Final Project: Use a DataProc Cluster and submit a Spark job for data pre-processing and model training. Store the model in your GCS Bucket. Submit a Spark job for evaluation on validation data stored on a GCS bucket. Stream the test data stored on the GCS bucket into Kafka. Use Spark Streaming to read the data and make real-time predictions using your stored model. Compare against an out-of-the-box deep learning model for accuracy on the same data.
WEEK 12 Optional – q&a week can be set in between too.
+ Show all weeks

About the Instructors

Rangarajan Vasudevan
Co-Founder & Chief Data Officer , Lentra.ai

Rangarajan Vasudevan is the Co-Founder & CDO of Lentra.ai, India’s fastest growing lending cloud. He did “big data” & “data science” before it was fashionable, building data-native applications across industries and geographies over 15+ years.

...  more

Ranga joined Lentra by way of an acquisition in June 2022 of his company TheDataTeam, creators of Cadenz.ai customer intelligence platform. Prior to founding TheDataTeam, Ranga served as Director, Big Data with Teradata Corporation’s international business unit. Ranga joined Teradata via the acquisition of Aster Data Systems, where he was a founding engineer and co-invented a company-defining, patented, pattern recognition algorithm. He is a recipient of both the Distinguished Engineer (R&D) and Consulting Excellence awards while at Teradata.

Ranga has degrees in Computer Science from the University of Michigan and IIT Madras.

  less