DE by DL.AI - an AIO note

Bài này dùng để tổng hợp lại toàn bộ khoá học Data Engineering trên DeepLearning.ai. Những note chi tiết hơn cho từng course cụ thể có thể xem tại đây.

⚠️

Lưu ý, note này mang tính chất là bản nháp, chỉ dành cho tôi đọc nó. Note này cần phải kết hợp với các notes chính của course.

Khoá học được dạy bởi Joe Reis, tác giả quyển sách Fundamentals of Data Engineering. Khoá này dạy bám sát quyển sách này.

Information

Home page

Home page on Coursera.

Instructor: Joe Reis (the author of “Fundamentals of Data Engineering” ← free download)

Home page of Course 1 — Introduction to Data Engineering

Lecture notes

DeepLearning.AI community for this course.

My Github repository for resources in the course.

DE Lifecycle + tổng quan

Tổng quan thì SE khá gần với DE. Gần như những kiến thức nền tảng DE đều là những task của SE nhưng là làm việc với data.

Lifecycle: quanh đi quẩn lại là làm những việc này

Lịch sử: SQL → Data Warehouses → internet boom → MapReduce and Hadoop → Cloud platforms (AWS, GCP, Azure) + streaming.

Upstream (SE) → DE → Downstream (Analytics / ML / …)

DE cần rất nhiều cuộc trao đổi với những phòng ban khác (cái này đã trải nghiệm ở Dataswati)

Thinking like DE:

Step 3.3 is crucial to complete before investing too much time in implementation

A big picture of DE → get raw data and turn it into something useful.

DE life cycle transformation = query, modeling and transformation.

Undercurrents of DE Lifecycles

Security → Principle of Least Privilege

Data Management → Plans and practices that optimize data value throughout its lifecycle.

Data Architecture → roadmap or blueprint for your data systems

DataOps → From DevOps (test codes, deploy and maintan codes) + data → DataOps → Automation (CI/CD) + Monitoring

Orchestration: phối hợp các thứ lại với nhau ← Airflow

DAG (Directed Acyclic Graph)

Tổng quan AWS

AWS Global Infrastructure (Regions & Zones)

Official website

Regions → tên trùng với nơi nó ở (eg. Europe Franfurl, US East Northen Virginia,…)

Mỗi region có thể có nhiều Availability Zones (AZs, at least 3), 1 cái dies thì còn nhiều cái khác.

Trong mỗi AZs thì có 1 hoặc nhiều Data Centers.

Regions > AZ > Data Centers.

Region > Zone > VPC (Virtual Private Cloud) > Subnets

Example of names: us-east-1, us-east-1a.

AWS Core Services

Core services and additional services - Public Sector Cloud Transformation

Free Cloud Computing Services - AWS Free Tier

AWS Certified Data Engineer - Associate Certification | AWS Certification

AWS Well-Architected - Build secure, efficient cloud applications

Virtual machien → EC2, nhiều instances like t3a.micro (t: family name, 3: generation, a: optional capabilities, micro: size)

Serverless functions (trigger events) → AWS Lambda

Amazon VPC (Virtual Private Cloud)

Region > Zone > VPC (Virtual Private Cloud) > Subnets

Storage → Object (S3) / Block (EBS) / File (EFS)

Object → any kind of data, most used.
Block → low latency environments
File → like file system on your laptop ← hierarchical structure

Database: RDS (relational database) ← PostgreSQL, MySQL, MariaDB,…, Redshift ← data warehouse service

Amazon Redshift is designed to handle large-scale data analytics and provides fast query performance for analyzing massive amounts of data
Amazon DynamoDB (NoSQL)
Amazon Kinesis Data Streams / SQS / Apache Kafka ← MSK

Monitoring: Amazon CloudWatch

Message queue service: Amazon SQS, SNS

Security: AWS IAM (user and access management)

CDN → CDN Cloud Service - Amazon CloudFront - AWS

Examples:

Web app stack: EC2 for hosting, RDS for database, S3 for static content, CloudFront for content delivery.

Serverless architecture: Lambda for computing, API Gateway for endpoints, DynamoDB for data storage, CloudWatch for monitoring.

DE with AWS

Source systems: database + streaming xem ở trên.

Batch injection tools: Amazon EMR vs AWS Glue ETL

Streaming ingestion tools: Kinesis, MSK.

Ingestion: DMS, AWS Glue (most used in the course)

Streaming: Kinesis Data Streams, Firehose, SQS, MSK.

Storage: xem ở trên (Redshift, S3)

Transformation: AWS Glue, Spark, dbt.

Serving: Athena, Redshift, QuickSighgt,…

Data manamagent: AWS Glue Data Catalog (discover, create and manage metadata for data stored in Amazon S3 or other storage and database systems)

IDE: AWS Cloud9, EC2.

AWS Glue (Course 4)

to be noted….

Terraform (Course 2 Week 3)

DE by DL.AI - C2 W3 - DataOps ← xem cái notebook lab 1 để hiểu rõ từng thành phần

Infrastructures as Code (IaC) ← Nếu dùng AWS thì có AWS CloudFormation (native to AWS)

Home page.

Examples: Example code terraform ec2, c2-w3-lab1 (dataops terraform)

Sử dụng HCL language. Terraform is highly idempotent (if you repeatedly execute the same HCL commands, your infrastructure will maintain the same desired end-state as the first time you ran the commands)

Chia thành files .tf riêng lẻ just for readability (ko cần chia cũng được). Terraform will automatically concatenate all tf files into one.

Browse Providers | Terraform Registry

Dùng 3 lệnh sau

1# init the configs
2terraform init
3
4# preview the changes to infrastructure before applying them
5# (compare desired state with the current one)
6terraform plan
7
8# apply the changes
9terraform apply

Các biến sẽ lưu trong .tfvars files, backup với các biến TF_VAR_ được định nghĩa trong .env file. ← sau khi apply configs thì file .tfvars sẽ được update. ← có thể share file này cho equip!

Nên ignore các file .tfvars trong gitignore.
Ví dụ state được lưu trên S3, dùng DynamoDB table để ngăn chặn edit cùng lúc trong team.

1# Configure the backend for Terraform using AWS
2terraform {
3  backend "s3" {
4    bucket         = "de-c2w3lab1-211125601709-us-east-1-terraform-state" # The name of the S3 bucket to store the state file
5    key            = "de-c2w3lab1/terraform.state" # The key in the bucket where the state file will be stored
6    region         = "us-east-1" # AWS region where the S3 bucket is located
7    dynamodb_table = "de-c2w3lab1-terraform-state-lock" # The name of the DynamoDB table to use for state locking
8    encrypt        = true
9    
10  }
11}

Other terraform commands

1terraform output db_host # in output.tf file

Example: tạo bastion host

Check this lab md file (C2W3 lab 1) + full codes of this lab.
Yêu cầu: dùng terraform để tạo bastion host (EC2) và RDS database. Tạo luôn cái SSH connection (key pair).
Ta có thể để file .tf trong những module riêng lẻ. Sau đó trong main.tf, trỏ về là ok.
Những resources đã được generated bởi

Architecture

Always be architecting → build loosely coupled system.

Every action requires authentication.

Plan for failure.

Example: ETL (Extract-Transform-Load) and ELT

Streaming frameworks: Apache Kafka, Apache Storm, samza.

Keep in mind the end goal! → Deliver high-quality data products. You focus on DAr (what, why, when), the tools is for How.

On-premise = company owns and maintains the hardwares and software for their data stack.

Different from Cloud provider (GCP, Azure, AWS)

Random notes

Thông qua các bài lab trong course. → Dùng Terraform (Infrastructures as Code) để setup mấy cái service với nhau programmatically.

Introduction to HashiCorp Terraform with Armon Dadgar - YouTube

awswrangler (AWS SDK for Pandas)

1products_df = wr.athena.read_sql_query(
2    """
3    SELECT * FROM dim_products
4    """,
5    database=GLUE_DATABASE,
6)
7
8products_df.head()

Cái này, database đã được copy vào trong S3 rùi, user chỉ việc thông qua Athena để query tới database mà thôi. 👇 Xem hình bên dưới.

Dùng Glue để làm các tasks ETL + dùng Amazon Athena để serving.

The PostgreSQL database is typically faster for complex queries and data analysis, while MySQL, which you used previously, is more efficient for simpler queries.

W3Schools.com

Boto3 = AWS SDK for Python

Labs

Các link bên dưới đều dẫn đến file .md assignment của từng lab.

C2W1 Lab 1 — Interacting With a Relational Database Using SQL

C2W1 Lab 2 — Interacting With Amazon DynamoDB NoSQL Database

C2W1 Lab 3 — Interacting With Amazon S3 Object Storage

C2W1 Lab 4 — Networking and Troubleshooting Database Connectivity on AWS

C2W2 Lab 1 — Batch Data Processing from an API (working with Spotify APIs)

C3W1 Lab 1 — Comparing Cloud Data Storage Options (S3, File Storage, Block Storage, Memory)

C3W1 Lab 2 — Graph Databases and Vector Search with Neo4j + Cypher query language

C3W3 Lab 2 — Comparing the Query Performance Between Row-Oriented and Column-Oriented Databases

C4W1 Lab 2 — Data Modeling with DBT ← từ normalized form thành Star schema sử dụng dbt

C4W2 Lab 1 — Feature Engineering for ML

SQL (Course 2)

Notebook: Interacting With a Relational Database Using SQL

To interact with relational database, we use Relational Database Management System (RDBMS): MySQL, PostgreSQL, Oracle Database, SQLServer. ← support SQL language.

Run SQL commands in Jupyter notebook using %load_ext sql thanks to ipython-sql extension

Streaming

❤️ Gentle introduction into Kafka: Gently down the stream

Airflow

C2W3 — DataOps Automation

C2W4 — Airflow

RAG = A directed acyclic graph (RAG)

Great Expectation (GX)

C2W3 - GX

Great Expectations enables data quality testing similar to software testing by allowing users to set “expectations” about data at specific points in a pipeline to ensure reliability.

Data Warehouse - Data Lake - Data Lakehouse

Data Warehouses:

store structured data for reporting
Amazon Redshift
schema-on-write
higher costs (compare with Lake)
If a company only needs to analyze structured data for reports

Data Lakes

hold raw, unstructured data for analytics
Amazon S3 with Glue
schema-on-read
Risk: lead to data swamp (unusable, unorganized data).
if a company mainly stores raw data for future machine learning

Data Lakehouses

Combines the low-cost, flexible storage of a data lake with the ACID compliance and performance optimizations of a data warehouse
Amazon SageMaker Lakehouse
using Redshift Spectrum alongside S3.
ACID Compliance: Ensures transactional integrity (e.g., updates/deletes).
databricks — 1st company introduces the notion of a data lakehouse.
Setup a Lakehouse may cost much + require expertises to manage

Which one?

When it comes to choosing between a data warehouse, a data lake, or a data lakehouse, it's really about choosing the right storage abstraction to support your organization's needs.