List of notes for this specialization + Lecture notes & Repository + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.

References

Batch Queries

Overview

  • As a data engineer, you’ll write queries to extract data and set up storage solutions for stakeholders to query directly, considering how storage choices affect query speed and performance.
  • Query: A statement that you write in a specific query language to retrieve or
    act on data.
  • Week 3 plan:

A life of a query

  • Database Management System (DBMS)
      1. Transport System: when you send a query to DBMS, your request arrives in TS and then to Query Processor.
      1. Query Processor: Query Parser + Query Optimizer
        1. Query Parser
          Query Optimizer
      1. Execution Engine: excution plan → query result
      1. Storage Engine: already mentioned in prev courses.
  • Use EXPLAIN keyword before any query to see the cost.

Advanced SQL queries

  • Common vs Advanced SQL statements
  • The data we will work with
  • Use DISTINCT and SQL Functions
    • 1SELECT DISTINCT
      2    fact_rental.staff_id,
      3    CONCAT(dim_staff.first_name, ' ', dim_staff.last_name) AS staff_name,
      4    fact_rental.customer_id
      5FROM fact_rental
      6JOIN dim_staff ON fact_rental.staff_id = dim_staff.staff_id
      Get which staff member served which customer.
  • CASE
    • 1CASE
      2    WHEN cond1 THEN result1
      3    WHEN cond2 THEN result2
      4    ELSE result3
      5END
      General structure.
  • Use CASE and IN
    • 1SELECT
      2    fact_rental.customer_id,
      3    fact_rental.rental_id,
      4    (CASE
      5        WHEN payment_date < return_date THEN 1
      6        ELSE 0
      7    END) AS on_time_payment
      8FROM fact_rental
      9JOIN dim_customer ON dim_customer.customer_id = fact_rental.customer_id
      10WHERE dim_customer.country IN ("United States", "Canada")
      11	AND (fact_retal.rental_date between "2005-05-24" and "2005-07-26")
      12LIMIT 5
      Check whether a customer made an on-time payment.
  • Common Table Expressions (CTE) → Use WITH...AS
    • 1WITH staff_customer_pairs AS (
      2		-- start: previous query
      3    SELECT DISTINCT fact_rental.staff_id,
      4    CONCAT(dim_staff.first_name, ' ', dim_staff.last_name) AS staff_name,
      5    fact_rental.customer_id
      6    FROM fact_rental
      7    JOIN dim_staff ON fact_rental.staff_id = dim_staff.staff_id
      8    -- end: prev query
      9)
      10SELECT staff_name, COUNT(customer_id)
      11FROM staff_customer_pairs
      12GROUP BY staff_name
      From previous query (”Get which staff member served which customer.”), we want to find the tootal number of customers served by each staff member.
  • Compute the percentage of on-time payments for each customer (from the prev query)
    • 1WITH customer_payment_info AS (
      2	-- start: previous query
      3  SELECT fact_rental.customer_id, fact_rental.rental_id,
      4  (CASE
      5      WHEN payment_date < return_date THEN 1
      6      ELSE 0
      7  END) AS on_time_payment
      8  FROM fact_rental
      9  JOIN dim_customer ON dim_customer.customer_id = fact_rental.customer_id
      10  WHERE dim_customer.country IN ("United States", "Canada")
      11  AND fact_rental.rental_date BETWEEN '2005-05-24' AND '2005-06-24'
      12  -- end: prev query
      13)
      14SELECT customer_id, AVG(on_time_payment) AS percent_on_time_payment
      15FROM customer_payment_info
      16GROUP BY customer_id
  • → Improve prev codes: find the maximum of the “percent on time payment” column
    • 1WITH customer_payment_info AS (
      2    SELECT fact_rental.customer_id, fact_rental.rental_id,
      3    (CASE
      4        WHEN payment_date < return_date THEN 1
      5        ELSE 0
      6    END) AS on_time_payment
      7    FROM fact_rental
      8    JOIN dim_customer ON dim_customer.customer_id = fact_rental.customer_id
      9    WHERE dim_customer.country IN ("United States", "Canada")
      10    AND (fact_rental.rental_date BETWEEN '2005-05-24' AND '2005-07-26')
      11), customer_percent_on_time_payment AS ( -- 👈 new from here
      12    SELECT customer_id, AVG(on_time_payment) AS percent_on_time_payment
      13    FROM customer_payment_info
      14    GROUP BY customer_id
      15)
      16SELECT MAX(percent_on_time_payment)
      17FROM customer_percent_on_time_payment
      It looks like no customer paid on time
  • You can also incorporate some temporary results within your main query using subqueries
    • 1SELECT film_id, length
      2FROM dim_film
      3WHERE length > (SELECT AVG(length) FROM dim_film) -- 👈 subquery
      Get the ids of the films that have length greater than the average length.
  • Compute the average duration in days that a customer spent on a film category
    • 1SELECT
      2    fact_rental.customer_id,
      3    dim_category.name,
      4    AVG(DATEDIFF(return_date, rental_date)) AS average_rental_days
      5FROM fact_rental
      6JOIN dim_category ON fact_rental.category_id = dim_category.category_id
      7GROUP BY fact_rental.customer_id, dim_category.name
      8ORDER BY fact_rental.customer_id, average_rental_days DESC
  • SQL Window Functions: Allows you to apply an aggregate or ranking function over a
    particular window or a set of rows. Does not group rows into a single output row: each row remains separate
    • Window functions apply to aggregate and ranking functions over a particular window (set of rows). OVER clause is used with window functions to define that window. OVER clause does two things : 
    • Partitions rows to form a set of rows. (PARTITION BY clause is used)
    • Orders rows within those partitions into a particular order. (ORDER BY clause is used)
    • Note: If partitions aren’t done, then ORDER BY orders all rows of the table. 
  • Add a column that shows for each customer the running sum over each window
    • 1WITH customer_info AS (
      2	-- start: previous query
      3  SELECT
      4      fact_rental.customer_id,
      5      dim_category.name,
      6      AVG(DATEDIFF(return_date, rental_date)) AS average_rental_days
      7  FROM fact_rental
      8  JOIN dim_category ON fact_rental.category_id = dim_category.category_id
      9  GROUP BY fact_rental.customer_id, dim_category.name
      10  ORDER BY fact_rental.customer_id, average_rental_days DESC
      11  -- end: previous query
      12)
      13SELECT customer_id, name, average_rental_days,
      14    RANK() OVER ( -- 👈 SQL Window Function
      15        PARTITION BY customer_id ORDER BY average_rental_days DESC
      16    ) AS rank_category
      17FROM customer_info
      18ORDER BY customer_id, rank_category

Lab 1 — Advanced SQL queries

  • After this assignment you'll be able to:
    • Build SQL queries using several techniques to manipulate data.
    • Distinguish between Common Table Expressions and Subqueries.
    • Apply the DATE_PART function to manipulate date values.
    • Make use of the LAG and LEAD functions to access values from other rows when making a calculation.
    • Utilize the CASE statement to return different values based on conditions.
    • Use window functions on SQL to aggregate data within a specific window.
    • Understand pivot tables and identify how they are built.

Index deep dive

  • As a data engineer, understanding how indexes are implemented can help you design better index structures and optimize your queries to improve SQL performance.
  • Index: A separate data structure that has its own disk space and contains information that refers to the actual table.
  • Balanced Search Tree (B-Tree)
    • Starting from root node, if not found, based on alphabet, take the 2nd and keep going down the tree. If found multiple rows, move along the leaf nodes.
  • CREATE INDEX to improve the query performance
    • 1CREATE INDEX rental_idx ON payment (rental_id);
      “Seq Scan” means full table scan.
  • Columnar Storage. Cloud DB like Amazon Redshift use Sort Key to sort one or more column and stores the sorted data on disk.
  • If employee_id is the primary key , then an index is automatically created on the primary key.

Retrieving only the data you need

  • It’s worst when you write a query that not only scans the entire table, but also returns everything. Eg: SELECT * FROM orderavoid with no WHERE
    • Large amounts of data need to be transferred from disk.
    • SELECT * on cloud pas-as-you-go databases can be expensive.
  • TIP → Using Pruning (tỉa, lược bớt)
    • Row-based pruning
      • 1-- index before select
        2CREATE INDEX rental_idx ON payment (rental_id);
        3-- then
        4SELECT * FROM payment WHERE rental_id = 1;
    • Column-based pruning
      • 1SELECT customer_id, rental_id FROM payment;
    • Partition pruning

The JOIN statement

  • JOIN operation is one of the most time-consuming query operations.
  • JOIN can use several methods:
    • Nested loop join: For each row in “orders”`, go throw all rows in “customers”
    • Index-based nested loop:
    • Hash join: using hash to map “customer_id” in to bucket
  • Another challenge is when 2 tables have a many-to-many relationship. ← row explosion (return more rows than expected)

Aggregate queries

  • Aggregate queries: used to compute summary value of a column like sum, average, max, min, count,…
  • Using index if available.
  • It’s quicker if using columnar storage vs row storage.

Amazon Redshift cloud data warehouse

  • We'll look at how Redshift queries data and some considerations to keep in mind for table design that can optimize query performance.
  • Amazon Redshift is a columnar storage ← speed up query performance ← best suit for Data Warehousing and large-scale data analytics.
  • Redshift uses Data compression. ← save storage space + read less data from disk.
  • Support Massive Parallel Processing (MPP) → depend much on distribution style and sort key.

Lab 2 — Comparing the query performance between row and columnar storage

  • Compare 2 types of storages.
  • Row-based storage → PostgreSQL, column-based → Amazon Redshift
  • TPC-H benchmark: The goal of this benchmark is to simulate a set of basic scenarios to examine a large dataset and execute queries to answer business questions. It's designed to evaluate the performance of various database systems and how they execute complex queries.
    •  

Additional query strategies

  • Leverage Query Caching:
  • Prioritize Readability:
    • Using CTE (should) vs using nested subqueries (not good)
  • Data resouces → table bloat (data size on disk exceeds the actual data size)
    • Use technique called Vacuuming to remove dead records

Streaming Queries

Queries on streaming data

  • We can use system like Apache Flink or Apache Spark to apply SQL queries continuously on streaming data.
  • To query on streaming data, we can use Windowed Query:
    • Session window
      Fixed-time window (tumbling window)
      Sliding window
  • To join data streams, we have several ways
  • You can use several services on AMZ to work with Apache Flink or you can use Amz Managed Service for Apache Flink.
  • AMSAF creates the hosted environment for you to run your applications.
  • On AMSAF (Console), when creating a new environment with AMSAF:
      • Streaming applications: allows you to host your Apache Flink application and run it in a production environment.
      • Studio notebooks: Ideal for development and interactive data exploration. ← Apache Zeppelin
  • AMSAF uses connectors to interface with databases, storages, message queues,…
  • Notebook → Apache Zeppelin: code written in SQL, Python, Java. Data visualization as tables and charts. Integration with tools like Flink.
  • In case you wanna setup notebook in your way (not using blueprint)
Check the codes for this lab.