DE by DL.AI - C4 W3 - Data Transformations & Technical Considerations

Anh-Thi Dinh
draft
⚠️
This is a quick & dirty draft, for me only!
List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.

Batch transformations

Week 3 overview

  • This week explores advanced data transformation frameworks beyond Pandas to address scalability and performance needs.
  • This week plan:
    • Batch Transformaton
    • Transformation use cases.
    • Distributed processing frameworks:
      • Hadoop MapReduce: disk-based storage and processing ← many people consider this because of legacy techs due to complexity, high-cost scaling, significant maintenance requirements. ← need to understand because it presents in many distributed systems today.
      • Spark: in memory-baesd processing framework.
    • Compare SQL-based transformation vs Python -based transformation.
    • Lab: previous course, we use dbt to transformation inside the dataware, this time, we implementation outside datawarehouse using Spark.
    • Streaming transformation
    • Transformation use cases.
    • Micro-batch vs true streaming processing tools.
    • Lab: implement CDC pipeline (change data capture) using Kafka and Flink.

Batch transformation patterns & Use cases

  • A lot of DE’s work is batch processing.
  • Some transformation patterns:
    • The first 2, we’ve already consider in Course 1 (ETL) and Course 4 (ELT)
  • We’ve already considerd ETL and ELT with Spark and dbt
  • Data Wrangling: → should use tools like AWS Glue DataBrew.
  • Transform data for updating
    • Approach 1: Truncate and Reload ← only applied to small dataset.
    • Approach 2: CDC (Change Data Capture): identify the changes → insert / update / delete
      • Insert only → insert next to the old
      • Upsert/merge: based on ids → update if exist, add new if non-exist
      • Capture deletehard (remove completely), soft (marked as deleled and then filter)
    • Single-row inserts → OK for row-oriented database but not for OLAP system!

Distributed processing framework - Hadoop

  • Google developed GFS (2003) and MapReduce (2004) for distributed data processing, leading to Yahoo's creation of Hadoop in 2006.
    • HDFS powers modern big data engines like EMR and Spark
    • MapReduce's concepts remain influential in distributed systems
  • HDFS (combines compute and storage on the same nodes) vs Object Storage (limited compute support for internal processing)
  • HDFS
  • MapReduce:
    • → Weakness: writing on Disk → Spark uses RAM!

Distributed processing framework - Spark

Spark is native in Scala.
Driver → talks to Cluster Manager → Worker nodes. All tasks “T” in each stage run in parallel, stage 2&3 run in parallel.
 

Spark dataframes

  • Spark DataFrames let you handle large distributed datasets while interacting with them like a single table, with the complex distribution details abstracted away.
  • We work with high-level API
  • 2 types of operations

Demo: working Spark dataframes using Python & Spark SQL

  • Using Python
    • Since Spark runs natively on JVM but Python UDFs require a separate Python process, data conversion between the two is expensive. For better performance, consider writing UDFs in Scala or Java to run directly on JVM.
  • Spark SQL
    • When working with PySpark, you can choose to manipulate your data, using SQL code, Python code or a mix of both.
    • Temporary view: virtual table, persists as long as the Spark session is running, providers and interface for you to work with SQL.
      • 1transactions_df.createOrReplaceTempView('orders')

Amazon EMR

  • AWS Console → EMR → EMR Serverless → Get Started → Create and launch EMR Studio → leave the type as “Spark” → “Use default settings for interactive workloads” (enabled enpoint for EMR Studio → Create & start application
  • In EMR Studio → Daskboard → Create workspace → error → Edit Studio → choose service role “emr” → Browse S3: “emr-workspaces-…” → Save changes → View Workspaces → Create Workspace → naming… → Create Workspace → open Jupyter notebook
  • In Jupyter Notebook, click Compute tab (left sidebar) → attach EMR Serverless application and interactive runtime role → Attach → select kernel, choose PySpark

AWS Glue

There are 3 options to create and run Glue job:
  1. AWS Glue DataBrew (no code/low code)
    1. It likes working with Excel but with the power of Spark. No need understanding of Spark or coding at all.
  1. Glue Studio → SQL code (more advanced users) + can drag and drop
  1. Jupyter notebook (Spark code from scratch)
Check back C1W2 Lab
The purpose of this glue job is: load normalized data from RDS → apply transformation (ETL) to star schema → store in S3. ← build glue_job.py
Instead of writing line by line codes line in this glue_job.py file, we can open Glue Studio to do that.
To open Glue Studio: Console → AWS Glue → ETL jobs / Visual ETL → Visual ETL → open the canvas

Lab 1 — Data transformations with Spark

Technical considerations

Streaming transformations

Streaming processing