DE by DL.AI - C4 W3&4 - Data Transformations & Tech Considerations & Serving

Anh-Thi Dinh
List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.

Batch transformations

Week 3 overview

  • This week explores advanced data transformation frameworks beyond Pandas to address scalability and performance needs.
  • This week plan:
    • Batch Transformaton
    • Transformation use cases.
    • Distributed processing frameworks:
      • Hadoop MapReduce: disk-based storage and processing ← many people consider this because of legacy techs due to complexity, high-cost scaling, significant maintenance requirements. ← need to understand because it presents in many distributed systems today.
      • Spark: in memory-based processing framework.
    • Compare SQL-based transformation vs Python-based transformation.
    • Lab: previous course, we use dbt to make the transformation inside the dataware, this time, we implementation outside datawarehouse using Spark.
    • Streaming transformation
    • Transformation use cases.
    • Micro-batch vs true streaming processing tools.
    • Lab: implement CDC pipeline (change data capture) using Kafka and Flink.

Batch transformation patterns & Use cases

  • A lot of DE’s work are batch processing.
  • Some transformation patterns:
    • The first 2, we’ve already consider in Course 1 (ETL) and Course 4 (ELT)
  • We’ve already considerd ETL and ELT with Spark and dbt
  • Data Wrangling: → should use tools like AWS Glue DataBrew.
  • Transform data for updating
    • Approach 1: Truncate and Reload ← only applied to small dataset.
    • Approach 2: CDC (Change Data Capture): identify the changes → insert / update / delete
      • Insert only → insert next to the old
      • Upsert/merge: based on ids → update if exist, add new if non-exist
      • Capture deletehard (remove completely), soft (marked as deleled and then filter)
    • Single-row inserts → OK for row-oriented database but not for OLAP system!

Distributed processing framework - Hadoop

  • Google developed GFS (2003) and MapReduce (2004) for distributed data processing, leading to Yahoo's creation of Hadoop in 2006.
    • HDFS powers modern big data engines like EMR and Spark
    • MapReduce's concepts remain influential in distributed systems
  • HDFS (combines compute and storage on the same nodes) vs Object Storage (limited compute support for internal processing)
  • HDFS
  • MapReduce:
    • → Weakness: writing on Disk → Spark uses RAM!

Distributed processing framework - Spark

Spark is native in Scala.
Driver → talks to Cluster Manager → Worker nodes. All tasks “T” in each stage run in parallel, stage 2&3 run in parallel.
 

Spark dataframes

  • Spark DataFrames let you handle large distributed datasets while interacting with them like a single table, with the complex distribution details abstracted away.
  • We work with high-level API
  • 2 types of operations
    • Lazy evaluation
    • This lazy evaluation allows Spark to optimize the execution plan by rearranging transformation operations for efficiency. Instead, they're recorded as a lineage and only executed when an action is invoked.

Demo: working Spark dataframes using Python & Spark SQL

  • Using Python
    • Since Spark runs natively on JVM but Python UDFs require a separate Python process, data conversion between the two is expensive. For better performance, consider writing UDFs in Scala or Java to run directly on JVM.
  • Spark SQL
    • When working with PySpark, you can choose to manipulate your data, using SQL code, Python code or a mix of both.
    • Temporary view: virtual table, persists as long as the Spark session is running, providers and interface for you to work with SQL.
      • 1transactions_df.createOrReplaceTempView('orders')

Amazon EMR

  • AWS Console → EMR → EMR Serverless → Get Started → Create and launch EMR Studio → leave the type as “Spark” → “Use default settings for interactive workloads” (enabled enpoint for EMR Studio → Create & start application
  • In EMR Studio → Daskboard → Create workspace → error → Edit Studio → choose service role “emr” → Browse S3: “emr-workspaces-…” → Save changes → View Workspaces → Create Workspace → naming… → Create Workspace → open Jupyter notebook
  • In Jupyter Notebook, click Compute tab (left sidebar) → attach EMR Serverless application and interactive runtime role → Attach → select kernel, choose PySpark

AWS Glue

There are 3 options to create and run Glue job:
  1. AWS Glue DataBrew (no code/low code)
    1. It likes working with Excel but with the power of Spark. No need understanding of Spark or coding at all.
  1. Glue Studio → SQL code (more advanced users) + can drag and drop
  1. Jupyter notebook (Spark code from scratch)
Check back C1W2 Lab
The purpose of this glue job is: load normalized data from RDS → apply transformation (ETL) to star schema → store in S3. ← build glue_job.py
Instead of writing line by line codes like in this glue_job.py file, we can open Glue Studio to do that.
To open Glue Studio: Console → AWS Glue → ETL jobs / Visual ETL → Visual ETL → open the canvas

Lab 1 — Data transformations with Spark

Technical considerations

Depending on your transformation use case, you might find one of the approaches more suitable than the other.
When to use Spark over pandas (when data is big, doesn’t fit into memory)

Streaming transformations

Streaming processing

 
Differences between Microbatch stream processing (”near” tru streaming) vs True stream processing.
 
CDC Infrastructure
Here is a brief description of the components:
  • Source Database: You already interacted with the source database in previous labs. For the prototype, the team set up a MySQL database inside a Docker container with the classicmodels database.
  • Debezium: Distributed platform for CDC, it connects to the source system and captures the changes based on the configuration one setup. The UI has been provisioned with the 8080 port of the EC2 instance.
  • Apache Kafka: The event streaming platform, has already been configured to connect with Debezium, based on your configuration in Debezium topics are created for each table that contains messages with each change.
  • Apache Zookeeper: Service that stores configuration information and allows the coordination of cloud applications.
  • Apache Zeppelin: Web-based notebook environment, it allows us to run Flink in a notebook alongside other possible interpreters. You can access the UI by connecting to the 8085 port of the EC2 instance, you can also see the Flink UI with the 8086 port.
  • Apache Flink: Distributed processing engine for data streams, you will use it to connect to the Kafka topics and update the target database accordingly.
  • Target Database: For the prototype, the team set up a Postgres database in an RDS instance with the target schema classicmodels_star_schema.

Serving Data for Analytics and Machine Learning

  • 3 ways to serve a data:
    • As files: text, tabular database, images/videos
    • From databases and data warehouses. Benefits: Imposes order & structure through
      schema, Gives you fine-grained permissions
      controls, Offers high performance for queries
    • From streaming systems: low latency, effectively combining the features of an OLAP database.
  • Object storage or data lakes is the best method for serving large files of semi-structured or unstructured data.
  • One advantage of using a data sharing platform is to ensure a coherent and consistent version for the shared files.
    • VIEW → End users no need to write many joins
      • VIEW vs CTE (common table expression) ← CTE exists within scope of the main query where they are refrenced. VIEW is a database object, stored and persists on disk (← expensive for complex queries).
        MATERIALIZED VIEW ← cache the query results and allows you to refresh the data.

    Lab 1 - Visualization with DBT and Superset

    Summary of the program concepts

    • Think like a Data Engineering
    • Data Ingestion
    • Data Transformation
    • Serving
    • Undercurrents

    Course Capstone Project

    • There are 2 parts:
      • Part 1: Create and configure the reources for the pipeline. ← codes
      • Part 2: Integrate data quality checks and orchestration, & createe data visualizations. ← codes
    • Goal
    • Data model
    • Solution Architecture1st step mainly use Terraform.
      • (Like Medallion architecture) Landing zone (bronze zone), Transformation Zone (silver zone), Serving Zone (gold zone)
    • Second lab: → using AIrflow ← run daily to update the model data

    Program conclusion

    • The boundary between DE-SE and DS-ML are blurry
    • Trend:
      • Shift of simplified and easy-to-use tools.
      • Abstraction + streamline dev process allows engineers focuses on complex and value problems rather than manage low level infrastructure and code.
      • Streaming pipelines and real-time analytical
    • Advices
      • Should work on your own DE project.
      • Ask questions and communication