DE by DL.AI - Course 1: Introduction to DE (W1&2)

Anh-Thi Dinh

Information

Introduction to Data Engineering

  • Data-Centric AI: The discipline of systematically engineering the data used to build an AI system.
  • This program is all about framework, principles, getting you to think like a data engineer + building system on AWS.

Program

  1. Course 1: Intro to DE
  1. Course 2: Source Systems, Data Ingestion, and Pipelines.
  1. Course 3: Data Storage and Queries.
  1. Course 4: Data Modeling, Transformation, and Serving.

Prerequisite

  • Intermediate Python, Pandas
  • Basic SQL
  • Basic AWS Cloud.

This program

  • What is unique about this program?
    • This program teaches you how to think like a data engineer
    • Hands-on practice.
  • Scenario
    • Most of dev focuses only on the last stage → waste time and less effective
  • First course → a big picture. First week is only about how to think like a DE. No lab, no implementation.

Plan for course 1

  • Week 1: High level look at the firled of DE
    • DE lifecycle
    • HIstory of DE
    • The DE among other stakeholders
    • Business value
    • Translation of stakeholder needs into requirements
  • Week 2: DE lifecycle and undercurrents
  • Week 3: Principles of good data architcture
  • Week 4: Design and build out a data architecture

Software Engineering (SE) → DE

  • Definition (by the author of the book): Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering
    • → Your job is to get raw data from somewhere, turn it into something useful, and then make it available for downstream use cases!

DE lifecycle

History of DE

  • 1960s: The advent of computers marks the beginning of digital data. Computerized databases are introduced.
  • 1970s: Relational databases emerge, leading to the development of SQL (Structured Query Language) by IBM.
  • 1980s: The first data warehouse is developed by Bill Inmon, enabling data transformation for analytical decision making.
  • 1990s: Dedicated tools and data pipelines for reporting and business intelligence are developed. Data modeling approaches for analytics, such as Ralph Kimball and Bill Inmon's approaches, are introduced.
  • Mid-1990s: The Internet goes mainstream, leading to the growth of web applications and the need for backend systems like servers, databases, and storage solutions.
  • Early 2000s: The dotcom boom and subsequent bust highlight the need for handling large volumes of data. Google's publication on MapReduce inspires the development of Apache Hadoop by Yahoo, revolutionizing data technologies.
  • Late 2000s: Amazon creates Amazon Web Services (AWS), offering scalable computing and storage solutions. Public cloud platforms like AWS, Google Cloud, and Microsoft Azure become popular, transforming the way data applications are developed and deployed.
  • 2010s: The transition from batch computing to event streaming enables handling real-time data. The term "big data" loses momentum as data processing becomes more accessible and every company aims to derive value from their data.
  • Present: Data engineering plays a crucial role in building powerful, scalable data systems using tools and technologies developed by pioneers. Cloud-first, open-source, and third-party products simplify working with data at scale. Data engineering is increasingly focused on interoperation and connecting technologies to serve business goals.
  • The DE among other stakeholders: 2 ways (upstream and downstream)
  • Business Value:
    • I'm going to give them the same advice as if they were a bank robber. Go to where the money is if you want to have long term, great success in our industry, find business value. Don't get hung up on every technology that comes out. Every new fangled thing that comes out, go to where there's business value. Because at the end of the day, business value drives everything we do in technology. — (Bill Inman’s advice)
  • System Requirements: Before we start writing any code or spinning up resources on the Cloud
    • The most important step is Requirements Gathering
      Know to translate from the high level goals to requirements

Requirements Gathering Conversation

(mock conversation between a Data Scientist and a DE)
  • DS/DA receive requests from the marketing team. They want to check through the dashboard and recommendations in real time, but they do not allow DS to directly access the data. DS are only permitted to dump the data and process it themselves. However, 90% of the data is useless to DS, and DS spends 80% of their time handling the data to ensure it is in the correct format (which takes two days) → there is no longer any real-time capability!
  • Not to mention that the data structure keeps changing continuously → it takes another 2 days → ….
  • I wish there were a process that could automate the formatting and handling of this data so that data scientists could focus on their core expertise—analysis.
  • The Data Engineer should ask for more clarity on the marketing team’s objectives (”real time” means hourly / daily /… ?), what they really want, and what the Data Scientists really need. Then, the DE should summarize the key points and explain what they will and can do (input and output) for the Data Scientists to confirm one last time.

Translate Stakeholder Needs into Specific Requirements

Key Elements of Requirements Gathering
  1. Learn what existing data systems or solutions are in place.
  1. Learn what pain points or problems there are with the existing solutions.
  1. Learn what actions stakeholders plan to take with the data. Tip: Repeat what you learned back to your stakeholders.
  1. Identify any other stakeholders you’ll need to talk to if you’re still missing information.
  • Thinking like a DE: below steps are in a circle
    • Step 3.3 is crucial to complete before investing too much time in implementation

Data Engineering on the Cloud

  • High level mental framework and way of thinking like a data engineer is important for everything that follows
  • As a data engineer, the actual set of tools and technologies you work with could be quite different from one company to the next.
  • Public cloud: AWS, GCP (Google Cloud Platform), MS Azure.
  • Intro to the AWS Cloud
    • Pay as you go pricing
    • IT Resources
    • Advantage of building on cloud
      • Cloud resources are scalable and elastic.
      • No need to worry about the exact storage capacity needed
      • No need to manage the scaling operations.
  • AWS data centers are all around the world → AWS regions (their names are the same as where they are located). ← AWS Global Infrastructure
    • Each regions has Availability Zones: one dies, there are others.Regions & Availability Zones
    • A region consists of multiple availability zones and an availability zone contains one or more data centers.
  • Example of names:
    • us-east-1 = the first one created in the eastern US
    • us-east-1a = an availability zone in us-east-1 (Northern Virginia Region)
  • To host your applications or data pipelines, you need to choose an AWS region. Consider these four main factors:
    • Latency: choose a region close to where your end users are located to minimize latency;
    • Cost: the resource costs may differ between regions;
    • Compliance: certain regulations may require hosting your data in a specific geographic region;
    • Service availability: not all services are available in all regions.

AWS Core Services

  • NETWORK
      • VPCs are isolated from other networks.
      • You choose the size of the private IP space.
      • Partition space into smaller networks called subnetworks or subnets.
      • Your data and resources don’t leave the region unless you specifically build your solutions to behave that way
      → Whenever you create certain AWS resources, like EC2 instances or instance based databases, you need to select which VPC you want and which AZ you want to place it in
  • STORAGE: 3 types
    • Object Storage: most often used for storing unstructured data (logs, documents, photos, videos… or any kind of data) ← Amazon Simple Storage Service (S3)
    • Block Storage: used for database storage, virtual machine file systems, and other low-latency environments. ← Amazon Elastic Block Store (EBS)
    • File Storage: (the most familiar type of storage for non tech user) Data is organized into files and directories in a hierarchical structure (like file system on your laptop) ← Amazon Elastic File System (EFS)
  • DATABASES: uses block storage behind the scene + profide special functionality for managing structured data (complex querying, data indexing,…). In these courses, you’re going to become very familiar with
  • SECURITY (ref)Shared Responsibility Model → AWS is responsible for security OF the cloud (like toà nhà chọc trời trang bị rất nhiều công nghệ bảo mật), and you are responsible for security IN the cloud (like bạn phải khoá cửa + tuân thủ các yêu cầu)
 

Some resources

IMPORTANT: Don’t forget to stop or delete any resources when you are not using them to avoid getting billed for them.
  • EC2 → only get charged for EBS attached to the instance.
  • Account ID + regions

The Data Engineering Lifecycle

Data Generation in Source Systems

  • Databases — Relational Databases or NoSQL Databases (Key-Value, Document Stores)
  • Files — Text, MP3, MP4
  • API — request and get back data formatted as .xml, .json, etc
  • Data Sharing Platform — Internal Data User or Third Party
  • IoT devices (internet of things) — “swarm” of IoT, streaming data
  • In real world, source systems are unpredictable systems
    • Systems go down
    • Change in format/schema of data
    • Change in data
  • When accessing the source systems:
    • How are the systems set up?
    • What kind of changes are to expect?
  • It’s good to work directly with source system owners to know: ← good relation is the crucial part of successful DE
    • How they generate data
    • How the data may change over time
    • How the changes will impact the downstream systems

Ingestion

Means moving raw data from source systems into your data pipeline for further processing.
  • Source systems and data ingestion represent the biggest bottlenecks of DE. ← work with the owners
  • Frequency of ingestion (how often) you need to move data from source systems in to your data pipeline.
    • Batch injection: In batches, once every hour or day
    • Streaming injection: Ingest data as a consrant stream of events in real time. Events like clicks on websites, sensor measurement,…
      • available to downstream systems a short time after it's produced. ← use tools like Event-streaming platform or a message queue
      • Cost more than batch injection: time, money, maintenance, downtime
  • Change data capture (CDC): whether a source system pushes data to you or you’ll be actively pulling it from the source?

Storage

  • Raw hardward ingredients:
    • Solid-state storage (usb, sd card, ssd)
    • Magnetic disk (hdd): backbone of moden data storage system. Cheaper 2-3x than Solid-state
    • RAM (Random Access Memory): faster read and write, 30-50x more expensive than solid-state, volatile.
    • In most modern architecturtes, data will pass through: magnetic → solid state → memory
  • Storage systems: As a DE, you work with storage systems like Database Management Systems, Object Storage like S3, APache Iceberg, Cache / Memory-based Storage or Streaming Storage.
  • Stopratge Abstractions: combinations of storage system arranged into storage abstractions like
    • Choose configuration params: latency, scalability, cost.
  • From the bottom to the top: Raw storage ingredients > Storage systems > Storage abstractions.

Queries, Modeling, and Transformation

  • Recall: a big picture of DE → get raw data, turn it into something useful and then make it available to end users.
  • Transformation = turn it into something useful!
  • DE Life cycle transformation = query, modeling and transformation.
  • Query: issuing a request to read records from a database or other storage systems. In this course, we focus on SQL.
    • Poor query: negative impact on the source database, cause row explosion, cause downstream delays,…
  • Data modeling: choosing a coherent structure for your data to make it useful for the business.
  • Data transformation: Data manipulated, enhanced and saved for downstream use.
    • Manipulate the data source as adding timestamp,…
    • At any stages, before/in-fly/after ingest → as map to correct types, standard formats,…
    • Enrich records with additional fields and calculations,..
    • Even in the downstream: apply large-scale aggregation for reporting or featurize data for ML.

Serving Data

  • Final stage of DE Lifecycle.
  • Analytics: the process of identifying key insights and patterns within data.
    • 3 common forms: business intelligence (BI), operational analytics, embedded analytics.
    • BI: explore historical and current business data to discover insights.
    • Operational Analytics: monitoring real-time data for immediate action.
    • Embedded Analytics (new trend): External or customer-facing analytics. As a DE, your job would be servign real time and historical data for use in user facing applications
  • Machine Learning will be treated separatedly from other serving ‘cause it involve addition complexities.
  • Reverse ETL (Extract, Transform, Load): take transformed data as well as analytics and perhaps machine learning model output and feed it back into source systems.

The Undercurrents of the Data Engineering Lifecycle

Introduction to the Undercurrents

DE no wencompasses fare more than just tools and technologies.

Security

  • Clients trust you with their information and private data. DE must follow set of principles, protocols and best practices.
  • Principle of Least Privilege: Give users or applications access to only the essential data and resources they need for only the duration required.
  • Don’t give and operatie as root or superuser permission when not neccessary!
  • Data sensitivity (hide number of digits in credit cards,…). Not inject the full data (with sensitive inform) into your system at the first place.
  • Secutiry in the Cloud: Identify and Access Management (IAM), Encryption Methods, Networking Protocols.
  • Security is also about people! → definsive mindset (be cautious with sensitive data, design for potential attacks).

Data Management

  • Data Management is so important that there is an international organization (the Data Management Association Internation, or DAMA) to provide resources for companies and individuals to get data management right! ← Book DAMA-DMBOK
  • Definition: “Data management as the development, execution, and supervision of plans, programs, and practices that deliver, control, protect, and enhance the value of data and information assets throughout their life cycles”
  • Data Quality: High DQ (accurate, complte, discoverable and avialable in a timely manner ← exactly what stakeholders expect) vs Low DQ (inaccurate, incomplete, hard to find, late ← Unusable) .

Data Architecture (DA)

  • DA = roadmap or blueprint for your data systems.
  • Being able to think like an architect will make you more successful in your role as a DE.
  • Principle of Good Data Architecture
      1. Choose common components wisely (CC → used across your org)
      1. Plan for failure!
      1. Architect for scalability
      1. Architecture is leadership
      1. Always be architecting (constantly avaluating your systems)
      1. Buld loosely coupled systems
      1. Make reversible decisions
      1. Prioritize security (Principle of least privilege, zero-trust principle)
      1. Embrace FinOps (Finance and DataOps/DevOps) → optimize cost and revenue

DataOps

  • DevOps ← Software Dev (write test code) & Software deployment team (deploy and maintain code). → The DevOps movement has resulted in increased release cycles and enhanced quality for software products.
  • Similar idea as DevOps when data comes in → DataOps: improves the dev poocess and quality of data products. It’s a set of cultural habits and practices: Communication & Collaboration, Continuous Improvement, Rapid Iteration.
  • DevOps practices ← Agile methodology
  • Pillars of DataOpes:
    • Automation: CI/CD (Continuous Integration & Continuous Delivery) → example: Airflow
    • Observability & Monitoring: keep in mind that “Everything fails all the time” (Werner Vogels, CTO of AWS) ← crucial aspect of the data systems you build
    • Incident Response: As a data engineer, you should be proactively finding issues before they are reported to you by other stakeholders in your organization.
    • Goal: provide high-quality data products.

Orchestration

  • Pure scheduling: get some specific tasks to run auto.
  • Problem:
  • Directed Acyclic Graph (DAG)

Software Engineering

  • SE: the design, dev, deployment and maintenance of software applications.
  • SE becomes DE
  • DE writes much less codes than SE does but it's more important than ever that you can write great code and that the code you'd write is of top quality.
    • Write core data processing code at all stages using SQL, Spark, Kafka.
    • Languges: Python, Java, Scala, Bash, R, Rush, Go.
    • In this specialization, we focus on: Python, SQL, Bash.

Practical Examples on AWS (week 2)

The DE Lifecycle on AWS

  • STORAGE
    • Traditional data warehouse: Amazon Redshift
    • Object storage for a data lake: Amazon Simple Storage Service (S3)
    • → Combine both: Lakehouse Arrangement (Access structured data in your data warehouse and unstructured data in an object storage data lake)
  • SERVING → 2 use cases
    • Business Intelligence or Analytics
    • AI or Machine Learning: serve batch data for model training, and work with some vector database → product recommenders and large language models.

Undercurrents on AWS

  • Undercurrents aspects on AWS are more conceptual and more tools oriented.
  • ARCHITECTURE: AWS Well-Architected (a set of principles and practices developed by AWS that can help you build systems with an eye towards operational efficiency, security, scalability, and sustainability)
⚠️
Make sure to log out of your personal account before practicing the lab in these courses!

Lab Walkthrough

  • The main goal of this lab is to help you get started interacting with a data pipeline on AWS.
  • Pipeline Scenario
    • You are an DE who work with a retailer for scale models of classic cars and other vehicles.
    • Customer stores data in a relational database.
      •  
    • You’re asked to build a pipeline to transform and serve to Data Analyst in the marketing team.
  • Data Modeling (course 4): Transform the data into a structure that is easier to understand and faster to query.
  • In general, what we will do:
    • Amazon RDS: the source system contains the SQL tabls (provided)
    • Glue ETL: a tool that allows you to ingest data from the source database and apply transformations on the fly to the ingested data
      • Glue job: connecting to the RDS database → Extracting the raw data + Transforming the data by modeling it using the provided star schema, and finally loading the transformed data into AWS object storage in an S3 bucket
    • ETL = Extract + Transform + Load
    • Glue Crawler: crawl over S3 and write metadata to a data catalog.
    • Amazon Athena: query service to retrieve data from S3.
    • We can manually create bottom 3 resources (Glue ETL, S3, Glue Crawler) using the AWS console or programmatically create them using Terraform (Infrastructures as Code, IaC). (given and we learn more in Course 2)
      • It allows users to define and provision infrastructure using a declarative configuration language (declarative means that users only need to describe the components of the data pipeline without instructing on the detailed steps needed to build the data pipeline)
      • Introduction to HashiCorp Terraform with Armon Dadgar - YouTube
    • Juptyer notebook (AWS Cloud9) to perform some DA tasks.

Lab technical notes

  • Claude9: to open IDE (a VSCode like environment). Choose machine t3.small and enable SSH.
  • Download required resources into IDE (don’t forget to “Allow all cookies”)
    • 1aws s3 cp --recursive s3://dlai-data-engineering/labs/c1w2-187976/ ./
      2# then install
      3source scripts/setup.sh
  • Database: AWS Console → AWS RDS → Databases → check the “DB identifier”, eg. de-c1w2-rds
    • 1aws rds describe-db-instances --db-instance-identifier de-c1w2-rds --output text --query "DBInstances[].Endpoint.Address"
      2
      3# return the endpoint, something like
      4# de-c1w2-rds.xxxx.us-east-1.rds.amazonaws.com
  • Connect the database / Establish the connection to the RDS instance
    • 1mysql --host=de-c1w2-rds.xxxx.us-east-1.rds.amazonaws.com --user=admin --password=adminpwrd --port=3306
  • Check the database
    • 1# Don't forget the semicolon ";"
      2use classicmodels;
      3show tables;
      4
      5# exit the sql env
      6exit;Bye
  • Terraform: init → plan → apply
    • 1cd infrastructure/terraform
      2terraform init
      3terraform plan
      4terraform apply
    • plan: lets you preview the changes to infrastructure before applying them. When you run the corresponding command, Terraform analyzes configuration files to check the desired state, compares the desired state with the current infrastructure state, and calculates actions needed to achieve it.
  • Check Glue jobs in AWS Glue → ETL jobs → tab “Runs”
    • 1# Start the Glue job
      2aws glue start-job-run --job-name de-c1w2-etl-job | jq -r '.JobRunId'
      3# return JobRunID
      4
      5# Check the status
      6aws glue get-job-run --job-name de-c1w2-etl-job --run-id <JobRunID> --output text --query "JobRun.JobRunState"
  • In jupyter notebook
    • 1# Interact with AWS
      2import awswrangler as wr
      3
      4# Interative data
      5import ipywidgets as widgets
  • S3 → Buckets → ...-datalake-...