DE by DL.AI - C3 W3 - Queries

☝

List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.

References

Use the index, Luke!

Amazon Redshift Documentation

Batch Queries

Overview

As a data engineer, you’ll write queries to extract data and set up storage solutions for stakeholders to query directly, considering how storage choices affect query speed and performance.

Query: A statement that you write in a specific query language to retrieve or
act on data.

In previous course, you’ve worked with some kinds of queries:

Course 2 Lab 1

Course 2 Lab 3

Course 3 Week 1 Lab 2

Week 3 plan:

A life of a query

Database Management System (DBMS)

Transport System: when you send a query to DBMS, your request arrives in TS and then to Query Processor.

Query Processor: Query Parser + Query Optimizer

Query Parser

Query Optimizer

Execution Engine: excution plan → query result

Storage Engine: already mentioned in prev courses.

Use EXPLAIN keyword before any query to see the cost.

Advanced SQL queries

Common vs Advanced SQL statements

SQL Server Functions

The data we will work with

Use DISTINCT and SQL Functions

1SELECT DISTINCT
2    fact_rental.staff_id,
3    CONCAT(dim_staff.first_name, ' ', dim_staff.last_name) AS staff_name,
4    fact_rental.customer_id
5FROM fact_rental
6JOIN dim_staff ON fact_rental.staff_id = dim_staff.staff_id

Get which staff member served which customer.

CASE

1CASE
2    WHEN cond1 THEN result1
3    WHEN cond2 THEN result2
4    ELSE result3
5END

General structure.

Use CASE and IN

1SELECT
2    fact_rental.customer_id,
3    fact_rental.rental_id,
4    (CASE
5        WHEN payment_date < return_date THEN 1
6        ELSE 0
7    END) AS on_time_payment
8FROM fact_rental
9JOIN dim_customer ON dim_customer.customer_id = fact_rental.customer_id
10WHERE dim_customer.country IN ("United States", "Canada")
11	AND (fact_retal.rental_date between "2005-05-24" and "2005-07-26")
12LIMIT 5

Check whether a customer made an on-time payment.

Common Table Expressions (CTE) → Use WITH...AS

1WITH staff_customer_pairs AS (
2		-- start: previous query
3    SELECT DISTINCT fact_rental.staff_id,
4    CONCAT(dim_staff.first_name, ' ', dim_staff.last_name) AS staff_name,
5    fact_rental.customer_id
6    FROM fact_rental
7    JOIN dim_staff ON fact_rental.staff_id = dim_staff.staff_id
8    -- end: prev query
9)
10SELECT staff_name, COUNT(customer_id)
11FROM staff_customer_pairs
12GROUP BY staff_name

From previous query (”Get which staff member served which customer.”), we want to find the tootal number of customers served by each staff member.

Compute the percentage of on-time payments for each customer (from the prev query)

1WITH customer_payment_info AS (
2	-- start: previous query
3  SELECT fact_rental.customer_id, fact_rental.rental_id,
4  (CASE
5      WHEN payment_date < return_date THEN 1
6      ELSE 0
7  END) AS on_time_payment
8  FROM fact_rental
9  JOIN dim_customer ON dim_customer.customer_id = fact_rental.customer_id
10  WHERE dim_customer.country IN ("United States", "Canada")
11  AND fact_rental.rental_date BETWEEN '2005-05-24' AND '2005-06-24'
12  -- end: prev query
13)
14SELECT customer_id, AVG(on_time_payment) AS percent_on_time_payment
15FROM customer_payment_info
16GROUP BY customer_id

→ Improve prev codes: find the maximum of the “percent on time payment” column

1WITH customer_payment_info AS (
2    SELECT fact_rental.customer_id, fact_rental.rental_id,
3    (CASE
4        WHEN payment_date < return_date THEN 1
5        ELSE 0
6    END) AS on_time_payment
7    FROM fact_rental
8    JOIN dim_customer ON dim_customer.customer_id = fact_rental.customer_id
9    WHERE dim_customer.country IN ("United States", "Canada")
10    AND (fact_rental.rental_date BETWEEN '2005-05-24' AND '2005-07-26')
11), customer_percent_on_time_payment AS ( -- 👈 new from here
12    SELECT customer_id, AVG(on_time_payment) AS percent_on_time_payment
13    FROM customer_payment_info
14    GROUP BY customer_id
15)
16SELECT MAX(percent_on_time_payment)
17FROM customer_percent_on_time_payment

It looks like no customer paid on time

You can also incorporate some temporary results within your main query using subqueries

1SELECT film_id, length
2FROM dim_film
3WHERE length > (SELECT AVG(length) FROM dim_film) -- 👈 subquery

Get the ids of the films that have length greater than the average length.

Compute the average duration in days that a customer spent on a film category

1SELECT
2    fact_rental.customer_id,
3    dim_category.name,
4    AVG(DATEDIFF(return_date, rental_date)) AS average_rental_days
5FROM fact_rental
6JOIN dim_category ON fact_rental.category_id = dim_category.category_id
7GROUP BY fact_rental.customer_id, dim_category.name
8ORDER BY fact_rental.customer_id, average_rental_days DESC

SQL Window Functions: Allows you to apply an aggregate or ranking function over a
particular window or a set of rows. Does not group rows into a single output row: each row remains separate

Window functions apply to aggregate and ranking functions over a particular window (set of rows). OVER clause is used with window functions to define that window. OVER clause does two things :

Partitions rows to form a set of rows. (PARTITION BY clause is used)
Orders rows within those partitions into a particular order. (ORDER BY clause is used)

Note: If partitions aren’t done, then ORDER BY orders all rows of the table.

Add a column that shows for each customer the running sum over each window

1WITH customer_info AS (
2	-- start: previous query
3  SELECT
4      fact_rental.customer_id,
5      dim_category.name,
6      AVG(DATEDIFF(return_date, rental_date)) AS average_rental_days
7  FROM fact_rental
8  JOIN dim_category ON fact_rental.category_id = dim_category.category_id
9  GROUP BY fact_rental.customer_id, dim_category.name
10  ORDER BY fact_rental.customer_id, average_rental_days DESC
11  -- end: previous query
12)
13SELECT customer_id, name, average_rental_days,
14    RANK() OVER ( -- 👈 SQL Window Function
15        PARTITION BY customer_id ORDER BY average_rental_days DESC
16    ) AS rank_category
17FROM customer_info
18ORDER BY customer_id, rank_category

Lab 1 — Advanced SQL queries

⭐ Check the notebook.

After this assignment you'll be able to:

Build SQL queries using several techniques to manipulate data.
Distinguish between Common Table Expressions and Subqueries.
Apply the DATE_PART function to manipulate date values.
Make use of the LAG and LEAD functions to access values from other rows when making a calculation.
Utilize the CASE statement to return different values based on conditions.
Use window functions on SQL to aggregate data within a specific window.
Understand pivot tables and identify how they are built.

Index deep dive

As a data engineer, understanding how indexes are implemented can help you design better index structures and optimize your queries to improve SQL performance.

Index: A separate data structure that has its own disk space and contains information that refers to the actual table.

Balanced Search Tree (B-Tree)

Starting from root node, if not found, based on alphabet, take the 2nd and keep going down the tree. If found multiple rows, move along the leaf nodes.

CREATE INDEX to improve the query performance

1CREATE INDEX rental_idx ON payment (rental_id);

“Seq Scan” means full table scan.

Columnar Storage. Cloud DB like Amazon Redshift use Sort Key to sort one or more column and stores the sorted data on disk.

If employee_id is the primary key , then an index is automatically created on the primary key.

Retrieving only the data you need

It’s worst when you write a query that not only scans the entire table, but also returns everything. Eg: SELECT * FROM order ← avoid with no WHERE

Large amounts of data need to be transferred from disk.
SELECT * on cloud pas-as-you-go databases can be expensive.

TIP → Using Pruning (tỉa, lược bớt)

Row-based pruning

1-- index before select
2CREATE INDEX rental_idx ON payment (rental_id);
3-- then
4SELECT * FROM payment WHERE rental_id = 1;

Column-based pruning

1SELECT customer_id, rental_id FROM payment;

Partition pruning

The `JOIN` statement

JOIN operation is one of the most time-consuming query operations.

JOIN can use several methods:

Nested loop join: For each row in “orders”`, go throw all rows in “customers”

Index-based nested loop:

Hash join: using hash to map “customer_id” in to bucket

Another challenge is when 2 tables have a many-to-many relationship. ← row explosion (return more rows than expected)

Aggregate queries

Aggregate queries: used to compute summary value of a column like sum, average, max, min, count,…

Using index if available.

It’s quicker if using columnar storage vs row storage.

Amazon Redshift cloud data warehouse

We'll look at how Redshift queries data and some considerations to keep in mind for table design that can optimize query performance.

Amazon Redshift is a columnar storage ← speed up query performance ← best suit for Data Warehousing and large-scale data analytics.

Redshift uses Data compression. ← save storage space + read less data from disk.

Support Massive Parallel Processing (MPP) → depend much on distribution style and sort key.

Lab 2 — Comparing the query performance between row and columnar storage

Check the codes for this lab.

Compare 2 types of storages.

Row-based storage → PostgreSQL, column-based → Amazon Redshift

TPC-H benchmark: The goal of this benchmark is to simulate a set of basic scenarios to examine a large dataset and execute queries to answer business questions. It's designed to evaluate the performance of various database systems and how they execute complex queries.

Additional query strategies

Leverage Query Caching:

Prioritize Readability:

Using CTE (should) vs using nested subqueries (not good)

Data resouces → table bloat (data size on disk exceeds the actual data size)

Use technique called Vacuuming to remove dead records

Streaming Queries

Queries on streaming data

We can use system like Apache Flink or Apache Spark to apply SQL queries continuously on streaming data.

To query on streaming data, we can use Windowed Query:

Session window

Fixed-time window (tumbling window)

Sliding window

To join data streams, we have several ways

Deploying an application with Amazon Mangaed Service for Apache Flink (AMSAF)

You can use several services on AMZ to work with Apache Flink or you can use Amz Managed Service for Apache Flink.

AMSAF creates the hosted environment for you to run your applications.

On AMSAF (Console), when creating a new environment with AMSAF:

Streaming applications: allows you to host your Apache Flink application and run it in a production environment.

Studio notebooks: Ideal for development and interactive data exploration. ← Apache Zeppelin

AMSAF uses connectors to interface with databases, storages, message queues,…

Notebook → Apache Zeppelin: code written in SQL, Python, Java. Data visualization as tables and charts. Integration with tools like Flink.

In case you wanna setup notebook in your way (not using blueprint)

Lab 3 — streaming queries with Apache Flink

Check the codes for this lab.

👉

DE by DL.AI - C4 W1 - Data Modeling & Transformations for Analytics