List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.
- This week explores advanced data transformation frameworks beyond Pandas to address scalability and performance needs.
- This week plan:
- Transformation use cases.
- Distributed processing frameworks:
- Hadoop MapReduce: disk-based storage and processing ← many people consider this because of legacy techs due to complexity, high-cost scaling, significant maintenance requirements. ← need to understand because it presents in many distributed systems today.
- Spark: in memory-based processing framework.
- Compare SQL-based transformation vs Python-based transformation.
- Lab: previous course, we use dbt to make the transformation inside the dataware, this time, we implementation outside datawarehouse using Spark.
- Transformation use cases.
- Micro-batch vs true streaming processing tools.
- Lab: implement CDC pipeline (change data capture) using Kafka and Flink.
Batch Transformaton
Streaming transformation
- A lot of DE’s work are batch processing.
- Some transformation patterns:
- We’ve already considerd ETL and ELT with Spark and dbt
- Data Wrangling: → should use tools like AWS Glue DataBrew.
- Transform data for updating
- Approach 1: Truncate and Reload ← only applied to small dataset.
- Approach 2: CDC (Change Data Capture): identify the changes → insert / update / delete
- Insert only → insert next to the old
- Upsert/merge: based on ids → update if exist, add new if non-exist
- Capture delete → hard (remove completely), soft (marked as deleled and then filter)
- Single-row inserts → OK for row-oriented database but not for OLAP system!
- Google developed GFS (2003) and MapReduce (2004) for distributed data processing, leading to Yahoo's creation of Hadoop in 2006.
- HDFS powers modern big data engines like EMR and Spark
- MapReduce's concepts remain influential in distributed systems
- HDFS (combines compute and storage on the same nodes) vs Object Storage (limited compute support for internal processing)
- HDFS
- MapReduce:
→ Weakness: writing on Disk → Spark uses RAM!
- Spark DataFrames let you handle large distributed datasets while interacting with them like a single table, with the complex distribution details abstracted away.
- We work with high-level API
- 2 types of operations
- This lazy evaluation allows Spark to optimize the execution plan by rearranging transformation operations for efficiency. Instead, they're recorded as a lineage and only executed when an action is invoked.
- Using Python
- Since Spark runs natively on JVM but Python UDFs require a separate Python process, data conversion between the two is expensive. For better performance, consider writing UDFs in Scala or Java to run directly on JVM.
- Spark SQL
- When working with PySpark, you can choose to manipulate your data, using SQL code, Python code or a mix of both.
- Temporary view: virtual table, persists as long as the Spark session is running, providers and interface for you to work with SQL.
1transactions_df.createOrReplaceTempView('orders')
- AWS Console → EMR → EMR Serverless → Get Started → Create and launch EMR Studio → leave the type as “Spark” → “Use default settings for interactive workloads” (enabled enpoint for EMR Studio → Create & start application
- In EMR Studio → Daskboard → Create workspace → error → Edit Studio → choose service role “emr” → Browse S3: “emr-workspaces-…” → Save changes → View Workspaces → Create Workspace → naming… → Create Workspace → open Jupyter notebook
- In Jupyter Notebook, click Compute tab (left sidebar) → attach EMR Serverless application and interactive runtime role → Attach → select kernel, choose PySpark
There are 3 options to create and run Glue job:
- AWS Glue DataBrew (no code/low code)
It likes working with Excel but with the power of Spark. No need understanding of Spark or coding at all.
- Glue Studio → SQL code (more advanced users) + can drag and drop
- Jupyter notebook (Spark code from scratch)
Check back C1W2 Lab
The purpose of this glue job is: load normalized data from RDS → apply transformation (ETL) to star schema → store in S3. ← build
glue_job.py
Instead of writing line by line codes like in this
glue_job.py
file, we can open Glue Studio to do that.To open Glue Studio: Console → AWS Glue → ETL jobs / Visual ETL → Visual ETL → open the canvas
Check this notebook.
Depending on your transformation use case, you might find one of the approaches more suitable than the other.
When to use Spark over pandas (when data is big, doesn’t fit into memory)
Here is a brief description of the components:
- Source Database: You already interacted with the source database in previous labs. For the prototype, the team set up a MySQL database inside a Docker container with the
classicmodels
database.
- Debezium: Distributed platform for CDC, it connects to the source system and captures the changes based on the configuration one setup. The UI has been provisioned with the
8080
port of the EC2 instance.
- Apache Kafka: The event streaming platform, has already been configured to connect with Debezium, based on your configuration in Debezium topics are created for each table that contains messages with each change.
- Apache Zookeeper: Service that stores configuration information and allows the coordination of cloud applications.
- Apache Zeppelin: Web-based notebook environment, it allows us to run Flink in a notebook alongside other possible interpreters. You can access the UI by connecting to the
8085
port of the EC2 instance, you can also see the Flink UI with the8086
port.
- Apache Flink: Distributed processing engine for data streams, you will use it to connect to the Kafka topics and update the target database accordingly.
- Target Database: For the prototype, the team set up a Postgres database in an RDS instance with the target schema
classicmodels_star_schema
.
Check this notebook.
- 3 ways to serve a data:
- As files: text, tabular database, images/videos
- From databases and data warehouses. Benefits: Imposes order & structure through
schema, Gives you fine-grained permissions
controls, Offers high performance for queries - From streaming systems: low latency, effectively combining the features of an OLAP database.
- Object storage or data lakes is the best method for serving large files of semi-structured or unstructured data.
- One advantage of using a data sharing platform is to ensure a coherent and consistent version for the shared files.
VIEW
→ End users no need to write many joins
VIEW
vs CTE
(common table expression) ← CTE exists within scope of the main query where they are refrenced. VIEW is a database object, stored and persists on disk (← expensive for complex queries).→
MATERIALIZED VIEW
← cache the query results and allows you to refresh the data.Check these codes.
- Think like a Data Engineering
- Data Ingestion
- Data Transformation
- Serving
- Undercurrents
- There are 2 parts:
- Goal
- Data model
- Solution Architecture ← 1st step mainly use Terraform.
- First lab:
- mainly focus on using terraform and glue scrips.
- At the end of the lab, set up the dbt project folder + (optional) define the tables of the star schema model ← check DE by DL.AI - C3 W2 - Storage Abstractions (to see an example of terraform folder structure) + dbt demo in DE by DL.AI - C4 W1 - Data Modeling & Transformations for Analytics
- Second lab: → using AIrflow ← run daily to update the model data
- The boundary between DE-SE and DS-ML are blurry
- Trend:
- Shift of simplified and easy-to-use tools.
- Abstraction + streamline dev process allows engineers focuses on complex and value problems rather than manage low level infrastructure and code.
- Streaming pipelines and real-time analytical
- Advices
- Should work on your own DE project.
- Ask questions and communication