DE by DL.AI - Course 1: Introduction to DE (W1&2)

Anh-Thi Dinh

Information

Introduction to Data Engineering

  • Data-Centric AI: The discipline of systematically engineering the data used to build an AI system.
  • This program is all about framework, principles, getting you to think like a data engineer + building system on AWS.

Program

  1. Course 1: Intro to DE
  1. Course 2: Source Systems, Data Ingestion, and Pipelines.
  1. Course 3: Data Storage and Queries.
  1. Course 4: Data Modeling, Transformation, and Serving.

Prerequisite

  • Intermediate Python, Pandas
  • Basic SQL
  • Basic AWS Cloud.

This program

  • What is unique about this program?
    • This program teaches you how to think like a data engineer
    • Hands-on practice.
  • Scenario
    • Most of dev focuses only on the last stage → waste time and less effective
  • First course → a big picture. First week is only about how to think like a DE. No lab, no implementation.

Plan for course 1

  • Week 1: High level look at the firled of DE
    • DE lifecycle
    • HIstory of DE
    • The DE among other stakeholders
    • Business value
    • Translation of stakeholder needs into requirements
  • Week 2: DE lifecycle and undercurrents
  • Week 3: Principles of good data architcture
  • Week 4: Design and build out a data architecture

Software Engineering (SE) → DE

  • Definition (by the author of the book): Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering
    • → Your job is to get raw data from somewhere, turn it into something useful, and then make it available for downstream use cases!

DE lifecycle

History of DE

  • 1960s-1970s: Digital data emerges with computers. Relational databases and SQL are developed.
  • 1980s-1990s: Data warehouses and BI tools emerge. Inmon and Kimball introduce data modeling approaches.
  • Mid-1990s-Early 2000s: Internet boom drives web app growth. MapReduce and Hadoop revolutionize data processing.
  • Late 2000s-2010s: Cloud platforms (AWS, Google Cloud, Azure) transform data applications. Shift to real-time processing and event streaming.
  • Present: Data engineering focuses on scalable systems, cloud-first solutions, and technology integration to serve business goals.
  • The DE among other stakeholders: 2 ways (upstream and downstream)
  • Business Value:
    • Focus on creating business value in data engineering. Don't chase every new technology. Prioritize solutions that deliver tangible benefits to the organization. Ultimately, business value is the driving force behind technological decisions in our industry. — (Bill Inman's advice)
  • System Requirements: Before we start writing any code or spinning up resources on the Cloud
    • The most important step is Requirements Gathering
      Know to translate from the high level goals to requirements

Requirements Gathering Conversation

(mock conversation between a Data Scientist and a DE)
  • DS/DA receive requests from marketing for real-time dashboard and recommendations, but lack direct data access. They must process dumped data, 90% of which is irrelevant, spending 80% of time on formatting (2 days), eliminating real-time capability.
  • Continuous data structure changes further delay the process by 2 days.
  • An automated process for data formatting and handling is needed to allow data scientists to focus on analysis.
  • The DE should clarify marketing's objectives (e.g., "real-time" frequency), identify key requirements, and outline their proposed solution for DS confirmation.

Translate Stakeholder Needs into Specific Requirements

Key Elements of Requirements Gathering
  1. Learn what existing data systems or solutions are in place.
  1. Learn what pain points or problems there are with the existing solutions.
  1. Learn what actions stakeholders plan to take with the data. Tip: Repeat what you learned back to your stakeholders.
  1. Identify any other stakeholders you’ll need to talk to if you’re still missing information.
  • Thinking like a DE: below steps are in a circle
    • Step 3.3 is crucial to complete before investing too much time in implementation

Data Engineering on the Cloud

  • High level mental framework and way of thinking like a data engineer is important for everything that follows
  • As a data engineer, the actual set of tools and technologies you work with could be quite different from one company to the next.
  • Public cloud: AWS, GCP (Google Cloud Platform), MS Azure.
  • Intro to the AWS Cloud
    • Pay as you go pricing
    • IT Resources
    • Advantage of building on cloud
      • Cloud resources are scalable and elastic.
      • No need to worry about the exact storage capacity needed
      • No need to manage the scaling operations.
  • AWS data centers are all around the world → AWS regions (their names are the same as where they are located). ← AWS Global Infrastructure
    • Each regions has Availability Zones: one dies, there are others.Regions & Availability Zones
    • A region consists of multiple availability zones and an availability zone contains one or more data centers.
  • Example of names:
    • us-east-1 = the first one created in the eastern US
    • us-east-1a = an availability zone in us-east-1 (Northern Virginia Region)
  • To host your applications or data pipelines, you need to choose an AWS region. Consider these four main factors:
    • Latency: choose a region close to where your end users are located to minimize latency;
    • Cost: the resource costs may differ between regions;
    • Compliance: certain regulations may require hosting your data in a specific geographic region;
    • Service availability: not all services are available in all regions.

AWS Core Services

  • NETWORK
      • VPCs are isolated from other networks.
      • You choose the size of the private IP space.
      • Partition space into smaller networks called subnetworks or subnets.
      • Your data and resources don’t leave the region unless you specifically build your solutions to behave that way
      → Whenever you create certain AWS resources, like EC2 instances or instance based databases, you need to select which VPC you want and which AZ you want to place it in
  • STORAGE: 3 types
    • Object Storage: most often used for storing unstructured data (logs, documents, photos, videos… or any kind of data) ← Amazon Simple Storage Service (S3)
    • Block Storage: used for database storage, virtual machine file systems, and other low-latency environments. ← Amazon Elastic Block Store (EBS)
    • File Storage: (the most familiar type of storage for non tech user) Data is organized into files and directories in a hierarchical structure (like file system on your laptop) ← Amazon Elastic File System (EFS)
  • DATABASES: uses block storage behind the scene + profide special functionality for managing structured data (complex querying, data indexing,…). In these courses, you’re going to become very familiar with
  • SECURITY (ref)Shared Responsibility Model → AWS is responsible for security OF the cloud (like toà nhà chọc trời trang bị rất nhiều công nghệ bảo mật), and you are responsible for security IN the cloud (like bạn phải khoá cửa + tuân thủ các yêu cầu)
 

Some resources

IMPORTANT: Don’t forget to stop or delete any resources when you are not using them to avoid getting billed for them.
  • EC2 → only get charged for EBS attached to the instance.
  • Account ID + regions

The Data Engineering Lifecycle

Data Generation in Source Systems

  • Databases — Relational Databases or NoSQL Databases (Key-Value, Document Stores)
  • Files — Text, MP3, MP4
  • API — request and get back data formatted as .xml, .json, etc
  • Data Sharing Platform — Internal Data User or Third Party
  • IoT devices (internet of things) — “swarm” of IoT, streaming data
  • In real world, source systems are unpredictable systems
    • Systems go down
    • Change in format/schema of data
    • Change in data
  • When accessing the source systems:
    • How are the systems set up?
    • What kind of changes are to expect?
  • It’s good to work directly with source system owners to know: ← good relation is the crucial part of successful DE
    • How they generate data
    • How the data may change over time
    • How the changes will impact the downstream systems

Ingestion

Means moving raw data from source systems into your data pipeline for further processing.
  • Source systems and data ingestion represent the biggest bottlenecks of DE. ← work with the owners
  • Frequency of ingestion (how often) you need to move data from source systems in to your data pipeline.
    • Batch injection: In batches, once every hour or day
    • Streaming injection: Ingest data as a consrant stream of events in real time. Events like clicks on websites, sensor measurement,…
      • available to downstream systems a short time after it's produced. ← use tools like Event-streaming platform or a message queue
      • Cost more than batch injection: time, money, maintenance, downtime
  • Change data capture (CDC): whether a source system pushes data to you or you’ll be actively pulling it from the source?

Storage

  • Raw hardward ingredients:
    • Solid-state storage (usb, sd card, ssd)
    • Magnetic disk (hdd): backbone of moden data storage system. Cheaper 2-3x than Solid-state
    • RAM (Random Access Memory): faster read and write, 30-50x more expensive than solid-state, volatile.
    • In most modern architecturtes, data will pass through: magnetic → solid state → memory
  • Storage systems: As a DE, you work with storage systems like Database Management Systems, Object Storage like S3, APache Iceberg, Cache / Memory-based Storage or Streaming Storage.
  • Stopratge Abstractions: combinations of storage system arranged into storage abstractions like
    • Choose configuration params: latency, scalability, cost.
  • From the bottom to the top: Raw storage ingredients > Storage systems > Storage abstractions.

Queries, Modeling, and Transformation

  • Recall: a big picture of DE → get raw data, turn it into something useful and then make it available to end users.
  • Transformation = turn it into something useful!
  • DE Life cycle transformation = query, modeling and transformation.
  • Query: issuing a request to read records from a database or other storage systems. In this course, we focus on SQL.
    • Poor query: negative impact on the source database, cause row explosion, cause downstream delays,…
  • Data modeling: choosing a coherent structure for your data to make it useful for the business.
  • Data transformation: Data manipulated, enhanced and saved for downstream use.
    • Manipulate the data source as adding timestamp,…
    • At any stages, before/in-fly/after ingest → as map to correct types, standard formats,…
    • Enrich records with additional fields and calculations,..
    • Even in the downstream: apply large-scale aggregation for reporting or featurize data for ML.

Serving Data

  • Final stage of DE Lifecycle.
  • Analytics: the process of identifying key insights and patterns within data.
    • 3 common forms: business intelligence (BI), operational analytics, embedded analytics.
    • BI: explore historical and current business data to discover insights.
    • Operational Analytics: monitoring real-time data for immediate action.
    • Embedded Analytics (new trend): External or customer-facing analytics. As a DE, your job would be servign real time and historical data for use in user facing applications
  • Machine Learning will be treated separatedly from other serving ‘cause it involve addition complexities.
  • Reverse ETL (Extract, Transform, Load): take transformed data as well as analytics and perhaps machine learning model output and feed it back into source systems.

The Undercurrents of the Data Engineering Lifecycle

Introduction to the Undercurrents

DE no wencompasses fare more than just tools and technologies.

Security

  • Clients trust you with their information and private data. DE must follow set of principles, protocols and best practices.
  • Principle of Least Privilege: Give users or applications access to only the essential data and resources they need for only the duration required.
  • Don’t give and operatie as root or superuser permission when not neccessary!
  • Data sensitivity (hide number of digits in credit cards,…). Not inject the full data (with sensitive inform) into your system at the first place.
  • Secutiry in the Cloud: Identify and Access Management (IAM), Encryption Methods, Networking Protocols.
  • Security is also about people! → definsive mindset (be cautious with sensitive data, design for potential attacks).

Data Management

  • DAMA International provides resources for effective data management. Their DAMA-DMBOK guide is a key reference.
  • Data Management: Plans and practices that optimize data value throughout its lifecycle.
  • Data Quality: High (accurate, complete, timely) vs Low (inaccurate, incomplete, delayed).

Data Architecture (DA)

  • DA = roadmap or blueprint for your data systems.
  • Being able to think like an architect will make you more successful in your role as a DE.
  • Principle of Good Data Architecture
      1. Choose common components wisely (CC → used across your org)
      1. Plan for failure!
      1. Architect for scalability
      1. Architecture is leadership
      1. Always be architecting (constantly avaluating your systems)
      1. Buld loosely coupled systems
      1. Make reversible decisions
      1. Prioritize security (Principle of least privilege, zero-trust principle)
      1. Embrace FinOps (Finance and DataOps/DevOps) → optimize cost and revenue

DataOps

  • DevOps ← Software Dev (write test code) & Software deployment team (deploy and maintain code). → The DevOps movement has resulted in increased release cycles and enhanced quality for software products.
  • Similar idea as DevOps when data comes in → DataOps: improves the dev poocess and quality of data products. It’s a set of cultural habits and practices: Communication & Collaboration, Continuous Improvement, Rapid Iteration.
  • DevOps practices ← Agile methodology
  • Pillars of DataOpes:
    • Automation: CI/CD (Continuous Integration & Continuous Delivery) → example: Airflow
    • Observability & Monitoring: keep in mind that “Everything fails all the time” (Werner Vogels, CTO of AWS) ← crucial aspect of the data systems you build
    • Incident Response: As a data engineer, you should be proactively finding issues before they are reported to you by other stakeholders in your organization.
    • Goal: provide high-quality data products.

Orchestration

  • Pure scheduling: get some specific tasks to run auto.
  • Problem:
  • Directed Acyclic Graph (DAG)

Software Engineering

  • SE: the design, dev, deployment and maintenance of software applications.
  • SE becomes DE
  • DE writes much less codes than SE does but it's more important than ever that you can write great code and that the code you'd write is of top quality.
    • Write core data processing code at all stages using SQL, Spark, Kafka.
    • Languges: Python, Java, Scala, Bash, R, Rush, Go.
    • In this specialization, we focus on: Python, SQL, Bash.

Practical Examples on AWS (week 2)

The DE Lifecycle on AWS

  • STORAGE
    • Traditional data warehouse: Amazon Redshift
    • Object storage for a data lake: Amazon Simple Storage Service (S3)
    • → Combine both: Lakehouse Arrangement (Access structured data in your data warehouse and unstructured data in an object storage data lake)
  • SERVING → 2 use cases
    • Business Intelligence or Analytics
    • AI or Machine Learning: serve batch data for model training, and work with some vector database → product recommenders and large language models.

Undercurrents on AWS

  • Undercurrents aspects on AWS are more conceptual and more tools oriented.
  • ARCHITECTURE: AWS Well-Architected (a set of principles and practices developed by AWS that can help you build systems with an eye towards operational efficiency, security, scalability, and sustainability)
⚠️
Make sure to log out of your personal account before practicing the lab in these courses!

Lab Walkthrough

  • The main goal of this lab is to help you get started interacting with a data pipeline on AWS.
  • Pipeline Scenario
    • You are an DE who work with a retailer for scale models of classic cars and other vehicles.
    • Customer stores data in a relational database.
      •  
    • You’re asked to build a pipeline to transform and serve to Data Analyst in the marketing team.
  • Data Modeling (course 4): Transform the data into a structure that is easier to understand and faster to query.
  • In general, what we will do:
    • Amazon RDS: the source system contains the SQL tabls (provided)
    • Glue ETL: a tool that allows you to ingest data from the source database and apply transformations on the fly to the ingested data
      • Glue job: connecting to the RDS database → Extracting the raw data + Transforming the data by modeling it using the provided star schema, and finally loading the transformed data into AWS object storage in an S3 bucket
    • ETL = Extract + Transform + Load
    • Glue Crawler: crawl over S3 and write metadata to a data catalog.
    • Amazon Athena: query service to retrieve data from S3.
    • We can manually create bottom 3 resources (Glue ETL, S3, Glue Crawler) using the AWS console or programmatically create them using Terraform (Infrastructures as Code, IaC). (given and we learn more in Course 2)
    • Juptyer notebook (AWS Cloud9) to perform some DA tasks.

Lab technical notes

  • Claude9: to open IDE (a VSCode like environment). Choose machine t3.small and enable SSH.
  • Download required resources into IDE (don’t forget to “Allow all cookies”)
    • 1aws s3 cp --recursive s3://dlai-data-engineering/labs/c1w2-187976/ ./
      2# then install
      3source scripts/setup.sh
  • Database: AWS Console → AWS RDS → Databases → check the “DB identifier”, eg. de-c1w2-rds
    • 1aws rds describe-db-instances --db-instance-identifier de-c1w2-rds --output text --query "DBInstances[].Endpoint.Address"
      2
      3# return the endpoint, something like
      4# de-c1w2-rds.xxxx.us-east-1.rds.amazonaws.com
  • Connect the database / Establish the connection to the RDS instance
    • 1mysql --host=de-c1w2-rds.xxxx.us-east-1.rds.amazonaws.com --user=admin --password=adminpwrd --port=3306
  • Check the database
    • 1# Don't forget the semicolon ";"
      2use classicmodels;
      3show tables;
      4
      5# exit the sql env
      6exit;Bye
  • ETL Process Overview
    • Extract: AWS Glue Job retrieves data from the OLTP database in RDS.
    • Transform: Glue reshapes data into a star schema, improving readability and query efficiency for analysts. This may involve denormalization and aggregation.
    • Load: Transformed data is stored in Amazon S3 as Parquet files, optimized for analytics in data lakes and warehouses.
  • Terraform: init → plan → apply
    • 1cd infrastructure/terraform
      2terraform init
      3terraform plan
      4terraform apply
    • plan: Previews infrastructure changes. Terraform analyzes configs, compares desired and current states, and calculates necessary actions.
  • Check Glue jobs in AWS Glue → ETL jobs → tab “Runs”
    • 1# Start the Glue job
      2aws glue start-job-run --job-name de-c1w2-etl-job | jq -r '.JobRunId'
      3# return JobRunID
      4
      5# Check the status
      6aws glue get-job-run --job-name de-c1w2-etl-job --run-id <JobRunID> --output text --query "JobRun.JobRunState"
  • In jupyter notebook
    • 1# Interact with AWS
      2import awswrangler as wr
      3
      4# Interative data
      5import ipywidgets as widgets
  • S3 → Buckets → ...-datalake-...