DE by DL.AI - C1 W1&2 - Intro to DE & DE Lifecycle and Undercurrents

☝

List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.

Information

Home page

Home page on Coursera.

Instructor: Joe Reis (the author of “Fundamentals of Data Engineering” ← free download)

Home page of Course 1 — Introduction to Data Engineering

Lecture notes

DeepLearning.AI community for this course.

My Github repository for resources in the course.

Introduction to Data Engineering

Data-Centric AI: The discipline of systematically engineering the data used to build an AI system.

This program is all about framework, principles, getting you to think like a data engineer + building system on AWS.

Program

Course 1: Intro to DE

Course 2: Source Systems, Data Ingestion, and Pipelines.

Course 3: Data Storage and Queries.

Course 4: Data Modeling, Transformation, and Serving.

Prerequisite

Intermediate Python, Pandas

Basic SQL

Basic AWS Cloud.

This program

What is unique about this program?

This program teaches you how to think like a data engineer
Hands-on practice.

Textbook: Fundamentals of Data Engineering

Scenario

Most of dev focuses only on the last stage → waste time and less effective

First course → a big picture. First week is only about how to think like a DE. No lab, no implementation.

Plan for course 1

Week 1: High level look at the firled of DE

DE lifecycle
HIstory of DE
The DE among other stakeholders
Business value
Translation of stakeholder needs into requirements

Week 2: DE lifecycle and undercurrents

Week 3: Principles of good data architcture

Week 4: Design and build out a data architecture

Software Engineering (SE) → DE

Definition (by the author of the book): Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering

→ Your job is to get raw data from somewhere, turn it into something useful, and then make it available for downstream use cases!

DE lifecycle

History of DE

1960s-1970s: Digital data emerges with computers. Relational databases and SQL are developed.

1980s-1990s: Data warehouses and BI tools emerge. Inmon and Kimball introduce data modeling approaches.

Mid-1990s-Early 2000s: Internet boom drives web app growth. MapReduce and Hadoop revolutionize data processing.

Late 2000s-2010s: Cloud platforms (AWS, Google Cloud, Azure) transform data applications. Shift to real-time processing and event streaming.

Present: Data engineering focuses on scalable systems, cloud-first solutions, and technology integration to serve business goals.

The DE among other stakeholders: 2 ways (upstream and downstream)

Business Value:

Focus on creating business value in data engineering. Don't chase every new technology. Prioritize solutions that deliver tangible benefits to the organization. Ultimately, business value is the driving force behind technological decisions in our industry. — (Bill Inman's advice)

System Requirements: Before we start writing any code or spinning up resources on the Cloud

The most important step is Requirements Gathering

Know to translate from the high level goals to requirements

Requirements Gathering Conversation

(mock conversation between a Data Scientist and a DE)

DS/DA receive requests from marketing for real-time dashboard and recommendations, but lack direct data access. They must process dumped data, 90% of which is irrelevant, spending 80% of time on formatting (2 days), eliminating real-time capability.

Continuous data structure changes further delay the process by 2 days.

An automated process for data formatting and handling is needed to allow data scientists to focus on analysis.

The DE should clarify marketing's objectives (e.g., "real-time" frequency), identify key requirements, and outline their proposed solution for DS confirmation.

Translate Stakeholder Needs into Specific Requirements

Key Elements of Requirements Gathering

Learn what existing data systems or solutions are in place.

Learn what pain points or problems there are with the existing solutions.

Learn what actions stakeholders plan to take with the data. Tip: Repeat what you learned back to your stakeholders.

Identify any other stakeholders you’ll need to talk to if you’re still missing information.

Thinking like a DE: below steps are in a circle

Step 3.3 is crucial to complete before investing too much time in implementation

Data Engineering on the Cloud

High level mental framework and way of thinking like a data engineer is important for everything that follows

As a data engineer, the actual set of tools and technologies you work with could be quite different from one company to the next.

Public cloud: AWS, GCP (Google Cloud Platform), MS Azure.

Intro to the AWS Cloud

Pay as you go pricing
IT Resources

Advantage of building on cloud

Cloud resources are scalable and elastic.
No need to worry about the exact storage capacity needed
No need to manage the scaling operations.

AWS data centers are all around the world → AWS regions (their names are the same as where they are located). ← AWS Global Infrastructure

Each regions has Availability Zones: one dies, there are others. ← Regions & Availability Zones
A region consists of multiple availability zones and an availability zone contains one or more data centers.

Example of names:

us-east-1 = the first one created in the eastern US
us-east-1a = an availability zone in us-east-1 (Northern Virginia Region)

To host your applications or data pipelines, you need to choose an AWS region. Consider these four main factors:

Latency: choose a region close to where your end users are located to minimize latency;
Cost: the resource costs may differ between regions;
Compliance: certain regulations may require hosting your data in a specific geographic region;
Service availability: not all services are available in all regions.

AWS Core Services

COMPUTE

EC2 (Amazon Elastic Compute Cloud): The service that provides virtual machines, or VMs, on AWS.

instance type naming: t3a.micro (t: family name, 3: generation, a: optional capabilities, micro: size)
Amazon EC2 Instance types
Amazon EC2 instance type naming conventions
Amazon EC2 billing and purchasing options

Virtual Machines or servers, where you can run any operating system and applications (as a virtual computer that runs OS)
Each “computer ECS” called EC2 Instance (you can use multiple instances for horizontal scaling)
EC2 can be used as a dev machine for programming or to run a web server, container, or ML workload.
AWS lambda: serverless functions → host code that runs in response to triggers or events.
Container hosting services: Amazon Elastic Container Service (ESC) or Amazon Elastic Kubernetes Service (EKS)

NETWORK

Whenever you create an EC2 instance or many other types of AWS resources, you need to place it into a network of some kind → Amazon Virtual Private Network (VPC)

VPCs are isolated from other networks.

You choose the size of the private IP space.

Partition space into smaller networks called subnetworks or subnets.

Your data and resources don’t leave the region unless you specifically build your solutions to behave that way

→ Whenever you create certain AWS resources, like EC2 instances or instance based databases, you need to select which VPC you want and which AZ you want to place it in

STORAGE: 3 types

Object Storage: most often used for storing unstructured data (logs, documents, photos, videos… or any kind of data) ← Amazon Simple Storage Service (S3)
Block Storage: used for database storage, virtual machine file systems, and other low-latency environments. ← Amazon Elastic Block Store (EBS)
File Storage: (the most familiar type of storage for non tech user) Data is organized into files and directories in a hierarchical structure (like file system on your laptop) ← Amazon Elastic File System (EFS)

DATABASES: uses block storage behind the scene + profide special functionality for managing structured data (complex querying, data indexing,…). In these courses, you’re going to become very familiar with

Amazon relational Database service (RDS) — A cloud based relational database service
Amazon Redshift — a data warehouse service that allows you to store transform and serve data for end use cases.

SECURITY (ref) → Shared Responsibility Model → AWS is responsible for security OF the cloud (like toà nhà chọc trời trang bị rất nhiều công nghệ bảo mật), and you are responsible for security IN the cloud (like bạn phải khoá cửa + tuân thủ các yêu cầu)

Some resources

AWS getting started guide

Trying services using AWS Free Tier (AWS Free Tier)

Create a billing alarm to monitor your estimated AWS charges

☝

IMPORTANT: Don’t forget to stop or delete any resources when you are not using them to avoid getting billed for them.

EC2 → only get charged for EBS attached to the instance.

Account ID + regions

⭐ AWS Certified Data Engineer - Associate Certification | AWS Certification

The Data Engineering Lifecycle

Data Generation in Source Systems

Databases — Relational Databases or NoSQL Databases (Key-Value, Document Stores)

Files — Text, MP3, MP4

API — request and get back data formatted as .xml, .json, etc

Data Sharing Platform — Internal Data User or Third Party

IoT devices (internet of things) — “swarm” of IoT, streaming data

In real world, source systems are unpredictable systems

Systems go down
Change in format/schema of data
Change in data

When accessing the source systems:

How are the systems set up?
What kind of changes are to expect?

It’s good to work directly with source system owners to know: ← good relation is the crucial part of successful DE

How they generate data
How the data may change over time
How the changes will impact the downstream systems

Ingestion

Means moving raw data from source systems into your data pipeline for further processing.

Source systems and data ingestion represent the biggest bottlenecks of DE. ← work with the owners

Frequency of ingestion (how often) you need to move data from source systems in to your data pipeline.

Batch injection: In batches, once every hour or day
Streaming injection: Ingest data as a consrant stream of events in real time. Events like clicks on websites, sensor measurement,…

available to downstream systems a short time after it's produced. ← use tools like Event-streaming platform or a message queue
Cost more than batch injection: time, money, maintenance, downtime

Change data capture (CDC): whether a source system pushes data to you or you’ll be actively pulling it from the source?

Storage

Raw hardward ingredients:

Solid-state storage (usb, sd card, ssd)
Magnetic disk (HDD): backbone of moden data storage system. Cheaper 2-3x than Solid-state
RAM (Random Access Memory): faster read and write, 30-50x more expensive than solid-state, volatile.

In most modern architecturtes, data will pass through: magnetic → solid state → memory

Storage systems: As a DE, you work with storage systems like Database Management Systems, Object Storage like S3, APache Iceberg, Cache / Memory-based Storage or Streaming Storage.

Storage Abstractions: combinations of storage system arranged into storage abstractions like

Choose configuration params: latency, scalability, cost.

From the bottom to the top: Raw storage ingredients > Storage systems > Storage abstractions.

Queries, Modeling, and Transformation

Recall: a big picture of DE → get raw data, turn it into something useful and then make it available to end users.

Transformation = turn it into something useful!

DE Life cycle transformation = query, modeling and transformation.

Query: issuing a request to read records from a database or other storage systems. In this course, we focus on SQL.

Poor query: negative impact on the source database, cause row explosion, cause downstream delays,…

Data modeling: choosing a coherent structure for your data to make it useful for the business.

Data transformation: Data manipulated, enhanced and saved for downstream use.

Manipulate the data source as adding timestamp,…
At any stages, before/in-fly/after ingest → as map to correct types, standard formats,…
Enrich records with additional fields and calculations,..
Even in the downstream: apply large-scale aggregation for reporting or featurize data for ML.

Serving Data

Final stage of DE Lifecycle.

Analytics: the process of identifying key insights and patterns within data.

3 common forms: business intelligence (BI), operational analytics, embedded analytics.
BI: explore historical and current business data to discover insights.

Operational Analytics: monitoring real-time data for immediate action.

Embedded Analytics (new trend): External or customer-facing analytics. As a DE, your job would be servign real time and historical data for use in user facing applications

Machine Learning will be treated separatedly from other serving ‘cause it involve addition complexities.

Reverse ETL (Extract, Transform, Load): take transformed data as well as analytics and perhaps machine learning model output and feed it back into source systems.

The Undercurrents of the Data Engineering Lifecycle

Introduction to the Undercurrents

DE no wencompasses fare more than just tools and technologies.

Security

Clients trust you with their information and private data. DE must follow set of principles, protocols and best practices.

Principle of Least Privilege: Give users or applications access to only the essential data and resources they need for only the duration required.

Don’t give and operatie as root or superuser permission when not neccessary!

Data sensitivity (hide number of digits in credit cards,…). Not inject the full data (with sensitive inform) into your system at the first place.

Secutiry in the Cloud: Identify and Access Management (IAM), Encryption Methods, Networking Protocols.

Security is also about people! → definsive mindset (be cautious with sensitive data, design for potential attacks).

Data Management

DAMA International provides resources for effective data management. Their DAMA-DMBOK guide is a key reference.

Data Management: Plans and practices that optimize data value throughout its lifecycle.

Data Quality: High (accurate, complete, timely) vs Low (inaccurate, incomplete, delayed).

Data Architecture (DA)

DA = roadmap or blueprint for your data systems.

Being able to think like an architect will make you more successful in your role as a DE.

Principle of Good Data Architecture

Choose common components wisely (CC → used across your org)

Plan for failure!

Architect for scalability

Architecture is leadership

Always be architecting (constantly avaluating your systems)

Buld loosely coupled systems

Make reversible decisions

Prioritize security (Principle of least privilege, zero-trust principle)

Embrace FinOps (Finance and DataOps/DevOps) → optimize cost and revenue

DataOps

DevOps ← Software Dev (write test code) & Software deployment team (deploy and maintain code). → The DevOps movement has resulted in increased release cycles and enhanced quality for software products.

Similar idea as DevOps when data comes in → DataOps: improves the dev poocess and quality of data products. It’s a set of cultural habits and practices: Communication & Collaboration, Continuous Improvement, Rapid Iteration.

DevOps practices ← Agile methodology

Pillars of DataOpes:

Automation: CI/CD (Continuous Integration & Continuous Delivery) → example: Airflow

Observability & Monitoring: keep in mind that “Everything fails all the time” (Werner Vogels, CTO of AWS) ← crucial aspect of the data systems you build
Incident Response: As a data engineer, you should be proactively finding issues before they are reported to you by other stakeholders in your organization.

→ Goal: provide high-quality data products.

Orchestration

Pure scheduling: get some specific tasks to run auto.

Problem:

Orchestration Framworks: Apache Airflow, Dagster, Prefect, Mage.

Automate pipeline with complex dependencies.
Monitor pipeline.
Set up monitoring & alerts.

Directed Acyclic Graph (DAG)

Software Engineering

SE: the design, dev, deployment and maintenance of software applications.

SE becomes DE

DE writes much less codes than SE does but it's more important than ever that you can write great code and that the code you'd write is of top quality.

Write core data processing code at all stages using SQL, Spark, Kafka.
Languges: Python, Java, Scala, Bash, R, Rush, Go.
In this specialization, we focus on: Python, SQL, Bash.

Practical Examples on AWS (week 2)

The DE Lifecycle on AWS

SOURCE SYSTEMS

Databases:

Amazon Relational Database Service (RDS): MySQL, PostgreSQL.
Amazon DynamoDB: serverless NoSQL database options.

virtually unlimited in their total size
suited for low-latency access to large volumes of data like gaming, IoT, mobile apps and real time analyse
flexible schema

Streaming sources:

Amazon Kinesis Data Streams: set up as a source system streaming real-time user activities from a sales platform log.
Amazon Simple Queue Service (SQS): handle messages when building your own data pipelines outside of these courses.
Apache Kafka ← Amazon Managed Streaming for Apache Kafka (MSK)

INGESTION

From a Database:

AWS Database migration Service (DMS): can migrate and replicate data from a source to a target in an automated way.
AWS Glue (most in these courses): Offers features that support data integration processes.

From a streaming source: Amazon Kinesis Data Streams, Amazon Data Firehose, Amazon SQS, Amazon MSK.

STORAGE

Traditional data warehouse: Amazon Redshift
Object storage for a data lake: Amazon Simple Storage Service (S3)

→ Combine both: Lakehouse Arrangement (Access structured data in your data warehouse and unstructured data in an object storage data lake)

TRANSFORMATION → data processing tools: AWS Glue, Spark, dbt

SERVING → 2 use cases

Business Intelligence or Analytics

Amazon Athena, Amazon Redshift: for querying structured and unstructured data. Also work with Jupyter notebooks,…
Amazon QuickSight, Superset, Metabase: Dashboarding tools

AI or Machine Learning: serve batch data for model training, and work with some vector database → product recommenders and large language models.

Undercurrents on AWS

Undercurrents aspects on AWS are more conceptual and more tools oriented.

SECURE

Identity and Access Management (IAM): set up roles and permissions.
Amazon Virtual Private Cloud (VPC), Security Groups (Instance level firewalls)

DATA MANAGEMENT

AWS Glue, AWS Glue Crawler, AWS Glue Data Catalog → discover, create and manage metadata for data stored in Amazon S3 or other storage and database systems.
AWS Lake Formation → Centrally manage and scale fine-grained data access permissions.

DATAOPS

Amazon CloudWatch: Collects metrics and provides monitoring features for cloud resources, applications and on-premises resources.

Amazon CloudWatch Logs: Store and analyze operational logs.

Amazon Simple Notification Service (SNS): Sets up notifications between applications or via text/email that are triggered by events within your system.
Opensource tools: Monte Carlo, Bigeye.

ORCHESTRATION: Airflow (main in industry), dagster, Prefect, Mage.

ARCHITECTURE: AWS Well-Architected (a set of principles and practices developed by AWS that can help you build systems with an eye towards operational efficiency, security, scalability, and sustainability)

SOFTWARE ENGINEERING

AWS Cloud9 (IDE for devs) hosted on Amazon Elastic Compute Cloud (EC2)
AWS CodeDeploy (automate code deployment)
Git, Github.

⚠️

Make sure to log out of your personal account before practicing the lab in these courses!

Lab Walkthrough

mooc-de/c1-w2 at main · dinhanhthi/mooc-de

The main goal of this lab is to help you get started interacting with a data pipeline on AWS.

Pipeline Scenario

You are an DE who work with a retailer for scale models of classic cars and other vehicles.
Customer stores data in a relational database.

You’re asked to build a pipeline to transform and serve to Data Analyst in the marketing team.

Data Modeling (course 4): Transform the data into a structure that is easier to understand and faster to query.

In general, what we will do:

Amazon RDS: the source system contains the SQL tabls (provided)
Glue ETL: a tool that allows you to ingest data from the source database and apply transformations on the fly to the ingested data

Glue job: connecting to the RDS database → Extracting the raw data + Transforming the data by modeling it using the provided star schema, and finally loading the transformed data into AWS object storage in an S3 bucket

ETL = Extract + Transform + Load
Glue Crawler: crawl over S3 and write metadata to a data catalog.
Amazon Athena: query service to retrieve data from S3.
We can manually create bottom 3 resources (Glue ETL, S3, Glue Crawler) using the AWS console or programmatically create them using Terraform (Infrastructures as Code, IaC). (given and we learn more in Course 2)

It enables users to define and provision infrastructure using a declarative language, describing components without specifying detailed implementation steps.
Introduction to HashiCorp Terraform with Armon Dadgar - YouTube

Juptyer notebook (AWS Cloud9) to perform some DA tasks.

Lab technical notes

mooc-de/c1-w2 at main · dinhanhthi/mooc-de

Claude9: to open IDE (a VSCode like environment). Choose machine t3.small and enable SSH.

Download required resources into IDE (don’t forget to “Allow all cookies”)

1aws s3 cp --recursive s3://dlai-data-engineering/labs/c1w2-187976/ ./
2# then install
3source scripts/setup.sh

Database: AWS Console → AWS RDS → Databases → check the “DB identifier”, eg. de-c1w2-rds

1aws rds describe-db-instances --db-instance-identifier de-c1w2-rds --output text --query "DBInstances[].Endpoint.Address"
2
3# return the endpoint, something like
4# de-c1w2-rds.xxxx.us-east-1.rds.amazonaws.com

Connect the database / Establish the connection to the RDS instance

1mysql --host=de-c1w2-rds.xxxx.us-east-1.rds.amazonaws.com --user=admin --password=adminpwrd --port=3306

Check the database

1# Don't forget the semicolon ";"
2use classicmodels;
3show tables;
4
5# exit the sql env
6exit;Bye

ETL Process Overview

Extract: AWS Glue Job retrieves data from the OLTP database in RDS.
Transform: Glue reshapes data into a star schema, improving readability and query efficiency for analysts. This may involve denormalization and aggregation.
Load: Transformed data is stored in Amazon S3 as Parquet files, optimized for analytics in data lakes and warehouses.

Terraform: init → plan → apply

1cd infrastructure/terraform
2terraform init
3terraform plan
4terraform apply

plan: Previews infrastructure changes. Terraform analyzes configs, compares desired and current states, and calculates necessary actions.

Check Glue jobs in AWS Glue → ETL jobs → tab “Runs”

1# Start the Glue job
2aws glue start-job-run --job-name de-c1w2-etl-job | jq -r '.JobRunId'
3# return JobRunID
4
5# Check the status
6aws glue get-job-run --job-name de-c1w2-etl-job --run-id <JobRunID> --output text --query "JobRun.JobRunState"

In jupyter notebook

1# Interact with AWS
2import awswrangler as wr
3
4# Interative data
5import ipywidgets as widgets

S3 → Buckets → ...-datalake-...

👉

DE by DL.AI - C1 W3&4 - Data Architecture & Translating Requirements to DA