Thi's avatar
HomeAboutNotesBlogTopicsToolsReading
About|My sketches |Cooking |Cafe icon Support Thi
💌 [email protected]

DE by DL.AI - C3 W3 - Queries

Anh-Thi Dinh
DeepLearning.AImooc-deData ScienceData EngineeringMOOCAWSAPI & Services
Left aside
☝
List of notes for this specialization + Lecture notes & Repository & Quizzes + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.

References

  • Use the index, Luke!
  • Amazon Redshift Documentation

Batch Queries

Overview

  • As a data engineer, you’ll write queries to extract data and set up storage solutions for stakeholders to query directly, considering how storage choices affect query speed and performance.
  • Query: A statement that you write in a specific query language to retrieve or
    act on data.
  • In previous course, you’ve worked with some kinds of queries:
    • Course 2 Lab 1
      Course 2 Lab 3
      Course 3 Week 1 Lab 2
  • Week 3 plan:

A life of a query

  • Database Management System (DBMS)
      1. Transport System: when you send a query to DBMS, your request arrives in TS and then to Query Processor.
      1. Query Processor: Query Parser + Query Optimizer
        1. Query Parser
          Query Optimizer
      1. Execution Engine: excution plan → query result
      1. Storage Engine: already mentioned in prev courses.
  • Use EXPLAIN keyword before any query to see the cost.

Advanced SQL queries

  • Common vs Advanced SQL statements
  • SQL Server Functions
  • The data we will work with
  • Use DISTINCT and SQL Functions
  • CASE
  • Use CASE and IN
  • Common Table Expressions (CTE) → Use WITH...AS
  • Compute the percentage of on-time payments for each customer (from the prev query)
  • → Improve prev codes: find the maximum of the “percent on time payment” column
    • It looks like no customer paid on time
  • You can also incorporate some temporary results within your main query using subqueries
  • Compute the average duration in days that a customer spent on a film category
  • SQL Window Functions: Allows you to apply an aggregate or ranking function over a
    particular window or a set of rows. Does not group rows into a single output row: each row remains separate
    • Window functions apply to aggregate and ranking functions over a particular window (set of rows). OVER clause is used with window functions to define that window. OVER clause does two things : 
    • Partitions rows to form a set of rows. (PARTITION BY clause is used)
    • Orders rows within those partitions into a particular order. (ORDER BY clause is used)
    • Note: If partitions aren’t done, then ORDER BY orders all rows of the table. 
  • Add a column that shows for each customer the running sum over each window

Lab 1 — Advanced SQL queries

  • ⭐ Check the notebook.
  • After this assignment you'll be able to:
    • Build SQL queries using several techniques to manipulate data.
    • Distinguish between Common Table Expressions and Subqueries.
    • Apply the DATE_PART function to manipulate date values.
    • Make use of the LAG and LEAD functions to access values from other rows when making a calculation.
    • Utilize the CASE statement to return different values based on conditions.
    • Use window functions on SQL to aggregate data within a specific window.
    • Understand pivot tables and identify how they are built.

Index deep dive

  • As a data engineer, understanding how indexes are implemented can help you design better index structures and optimize your queries to improve SQL performance.
  • Index: A separate data structure that has its own disk space and contains information that refers to the actual table.
  • Balanced Search Tree (B-Tree)
    • Starting from root node, if not found, based on alphabet, take the 2nd and keep going down the tree. If found multiple rows, move along the leaf nodes.
  • CREATE INDEX to improve the query performance
    • “Seq Scan” means full table scan.
  • Columnar Storage. Cloud DB like Amazon Redshift use Sort Key to sort one or more column and stores the sorted data on disk.
  • If employee_id is the primary key , then an index is automatically created on the primary key.

Retrieving only the data you need

  • It’s worst when you write a query that not only scans the entire table, but also returns everything. Eg: SELECT * FROM order ← avoid with no WHERE
    • Large amounts of data need to be transferred from disk.
    • SELECT * on cloud pas-as-you-go databases can be expensive.
  • TIP → Using Pruning (tỉa, lược bớt)
    • Row-based pruning
    • Column-based pruning
    • Partition pruning

The JOIN statement

  • JOIN operation is one of the most time-consuming query operations.
  • JOIN can use several methods:
    • Nested loop join: For each row in “orders”`, go throw all rows in “customers”
    • Index-based nested loop:
    • Hash join: using hash to map “customer_id” in to bucket
  • Another challenge is when 2 tables have a many-to-many relationship. ← row explosion (return more rows than expected)

Aggregate queries

  • Aggregate queries: used to compute summary value of a column like sum, average, max, min, count,…
  • Using index if available.
  • It’s quicker if using columnar storage vs row storage.

Amazon Redshift cloud data warehouse

  • We'll look at how Redshift queries data and some considerations to keep in mind for table design that can optimize query performance.
  • Amazon Redshift is a columnar storage ← speed up query performance ← best suit for Data Warehousing and large-scale data analytics.
  • Redshift uses Data compression. ← save storage space + read less data from disk.
  • Support Massive Parallel Processing (MPP) → depend much on distribution style and sort key.

Lab 2 — Comparing the query performance between row and columnar storage

  • Check the codes for this lab.
  • Compare 2 types of storages.
  • Row-based storage → PostgreSQL, column-based → Amazon Redshift
  • TPC-H benchmark: The goal of this benchmark is to simulate a set of basic scenarios to examine a large dataset and execute queries to answer business questions. It's designed to evaluate the performance of various database systems and how they execute complex queries.
    •  

Additional query strategies

  • Leverage Query Caching:
  • Prioritize Readability:
    • Using CTE (should) vs using nested subqueries (not good)
  • Data resouces → table bloat (data size on disk exceeds the actual data size)
    • Use technique called Vacuuming to remove dead records

Streaming Queries

Queries on streaming data

  • We can use system like Apache Flink or Apache Spark to apply SQL queries continuously on streaming data.
  • To query on streaming data, we can use Windowed Query:
    • Session window
      Fixed-time window (tumbling window)
      Sliding window
  • To join data streams, we have several ways

Deploying an application with Amazon Mangaed Service for Apache Flink (AMSAF)

  • You can use several services on AMZ to work with Apache Flink or you can use Amz Managed Service for Apache Flink.
  • AMSAF creates the hosted environment for you to run your applications.
  • On AMSAF (Console), when creating a new environment with AMSAF:
      • Streaming applications: allows you to host your Apache Flink application and run it in a production environment.
      • Studio notebooks: Ideal for development and interactive data exploration. ← Apache Zeppelin
  • AMSAF uses connectors to interface with databases, storages, message queues,…
  • Notebook → Apache Zeppelin: code written in SQL, Python, Java. Data visualization as tables and charts. Integration with tools like Flink.
  • In case you wanna setup notebook in your way (not using blueprint)

Lab 3 — streaming queries with Apache Flink

Check the codes for this lab.
👉
DE by DL.AI - C4 W1 - Data Modeling & Transformations for Analytics
◆References◆Batch Queries○Overview○A life of a query○Advanced SQL queries○Lab 1 — Advanced SQL queries○Index deep dive○Retrieving only the data you need○The JOIN statement○Aggregate queries○Amazon Redshift cloud data warehouse○Lab 2 — Comparing the query performance between row and columnar storage○Additional query strategies◆Streaming Queries○Queries on streaming data○Deploying an application with Amazon Mangaed Service for Apache Flink (AMSAF)○Lab 3 — streaming queries with Apache Flink
About|My sketches |Cooking |Cafe icon Support Thi
💌 [email protected]
1SELECT DISTINCT
2    fact_rental.staff_id,
3    CONCAT(dim_staff.first_name, ' ', dim_staff.last_name) AS staff_name,
4    fact_rental.customer_id
5FROM fact_rental
6JOIN dim_staff ON fact_rental.staff_id = dim_staff.staff_id
Get which staff member served which customer.
1CASE
2    WHEN cond1 THEN result1
3    WHEN cond2 THEN result2
4    ELSE result3
5END
General structure.
1SELECT
2    fact_rental.customer_id,
3    fact_rental.rental_id,
4    (CASE
5        WHEN payment_date < return_date THEN 1
6        ELSE 0
7    END) AS on_time_payment
8FROM fact_rental
9JOIN dim_customer ON dim_customer.customer_id = fact_rental.customer_id
10WHERE dim_customer.country IN ("United States", "Canada")
11	AND (fact_retal.rental_date between "2005-05-24" and "2005-07-26")
12LIMIT 5
Check whether a customer made an on-time payment.
1WITH staff_customer_pairs AS (
2		-- start: previous query
3    SELECT DISTINCT fact_rental.staff_id,
4    CONCAT(dim_staff.first_name, ' ', dim_staff.last_name) AS staff_name,
5    fact_rental.customer_id
6    FROM fact_rental
7    JOIN dim_staff ON fact_rental.staff_id = dim_staff.staff_id
8    -- end: prev query
9)
10SELECT staff_name, COUNT(customer_id)
11FROM staff_customer_pairs
12GROUP BY staff_name
From previous query (”Get which staff member served which customer.”), we want to find the tootal number of customers served by each staff member.
1WITH customer_payment_info AS (
2	-- start: previous query
3  SELECT fact_rental.customer_id, fact_rental.rental_id,
4  (CASE
5      WHEN payment_date < return_date THEN 1
6      ELSE 0
7  END) AS on_time_payment
8  FROM fact_rental
9  JOIN dim_customer ON dim_customer.customer_id = fact_rental.customer_id
10  WHERE dim_customer.country IN ("United States", "Canada")
11  AND fact_rental.rental_date BETWEEN '2005-05-24' AND '2005-06-24'
12  -- end: prev query
13)
14SELECT customer_id, AVG(on_time_payment) AS percent_on_time_payment
15FROM customer_payment_info
16GROUP BY customer_id
1WITH customer_payment_info AS (
2    SELECT fact_rental.customer_id, fact_rental.rental_id,
3    (CASE
4        WHEN payment_date < return_date THEN 1
5        ELSE 0
6    END) AS on_time_payment
7    FROM fact_rental
8    JOIN dim_customer ON dim_customer.customer_id = fact_rental.customer_id
9    WHERE dim_customer.country IN ("United States", "Canada")
10    AND (fact_rental.rental_date BETWEEN '2005-05-24' AND '2005-07-26')
11), customer_percent_on_time_payment AS ( -- 👈 new from here
12    SELECT customer_id, AVG(on_time_payment) AS percent_on_time_payment
13    FROM customer_payment_info
14    GROUP BY customer_id
15)
16SELECT MAX(percent_on_time_payment)
17FROM customer_percent_on_time_payment
1SELECT film_id, length
2FROM dim_film
3WHERE length > (SELECT AVG(length) FROM dim_film) -- 👈 subquery
Get the ids of the films that have length greater than the average length.
1SELECT
2    fact_rental.customer_id,
3    dim_category.name,
4    AVG(DATEDIFF(return_date, rental_date)) AS average_rental_days
5FROM fact_rental
6JOIN dim_category ON fact_rental.category_id = dim_category.category_id
7GROUP BY fact_rental.customer_id, dim_category.name
8ORDER BY fact_rental.customer_id, average_rental_days DESC
1WITH customer_info AS (
2	-- start: previous query
3  SELECT
4      fact_rental.customer_id,
5      dim_category.name,
6      AVG(DATEDIFF(return_date, rental_date)) AS average_rental_days
7  FROM fact_rental
8  JOIN dim_category ON fact_rental.category_id = dim_category.category_id
9  GROUP BY fact_rental.customer_id, dim_category.name
10  ORDER BY fact_rental.customer_id, average_rental_days DESC
11  -- end: previous query
12)
13SELECT customer_id, name, average_rental_days,
14    RANK() OVER ( -- 👈 SQL Window Function
15        PARTITION BY customer_id ORDER BY average_rental_days DESC
16    ) AS rank_category
17FROM customer_info
18ORDER BY customer_id, rank_category
1CREATE INDEX rental_idx ON payment (rental_id);
1-- index before select
2CREATE INDEX rental_idx ON payment (rental_id);
3-- then
4SELECT * FROM payment WHERE rental_id = 1;
1SELECT customer_id, rental_id FROM payment;