List of notes for this specialization + Lecture notes & Repository + Home page on Coursera. Read this note alongside the lecture notes—some points aren't mentioned here as they're already covered in the lecture notes.
- As a data engineer, you’ll write queries to extract data and set up storage solutions for stakeholders to query directly, considering how storage choices affect query speed and performance.
- Query: A statement that you write in a specific query language to retrieve or
act on data.
- In previous course, you’ve worked with some kinds of queries:
- Week 3 plan:
- Database Management System (DBMS)
- Transport System: when you send a query to DBMS, your request arrives in TS and then to Query Processor.
- Query Processor: Query Parser + Query Optimizer
- Execution Engine: excution plan → query result
- Storage Engine: already mentioned in prev courses.
- Use
EXPLAIN
keyword before any query to see the cost.
- Common vs Advanced SQL statements
- The data we will work with
- Use
DISTINCT
and SQL Functions
1SELECT DISTINCT
2 fact_rental.staff_id,
3 CONCAT(dim_staff.first_name, ' ', dim_staff.last_name) AS staff_name,
4 fact_rental.customer_id
5FROM fact_rental
6JOIN dim_staff ON fact_rental.staff_id = dim_staff.staff_id
Get which staff member served which customer.
CASE
1CASE
2 WHEN cond1 THEN result1
3 WHEN cond2 THEN result2
4 ELSE result3
5END
General structure.
- Use
CASE
andIN
1SELECT
2 fact_rental.customer_id,
3 fact_rental.rental_id,
4 (CASE
5 WHEN payment_date < return_date THEN 1
6 ELSE 0
7 END) AS on_time_payment
8FROM fact_rental
9JOIN dim_customer ON dim_customer.customer_id = fact_rental.customer_id
10WHERE dim_customer.country IN ("United States", "Canada")
11 AND (fact_retal.rental_date between "2005-05-24" and "2005-07-26")
12LIMIT 5
Check whether a customer made an on-time payment.
- Common Table Expressions (CTE) → Use
WITH...AS
1WITH staff_customer_pairs AS (
2 -- start: previous query
3 SELECT DISTINCT fact_rental.staff_id,
4 CONCAT(dim_staff.first_name, ' ', dim_staff.last_name) AS staff_name,
5 fact_rental.customer_id
6 FROM fact_rental
7 JOIN dim_staff ON fact_rental.staff_id = dim_staff.staff_id
8 -- end: prev query
9)
10SELECT staff_name, COUNT(customer_id)
11FROM staff_customer_pairs
12GROUP BY staff_name
From previous query (”Get which staff member served which customer.”), we want to find the tootal number of customers served by each staff member.
- Compute the percentage of on-time payments for each customer (from the prev query)
1WITH customer_payment_info AS (
2 -- start: previous query
3 SELECT fact_rental.customer_id, fact_rental.rental_id,
4 (CASE
5 WHEN payment_date < return_date THEN 1
6 ELSE 0
7 END) AS on_time_payment
8 FROM fact_rental
9 JOIN dim_customer ON dim_customer.customer_id = fact_rental.customer_id
10 WHERE dim_customer.country IN ("United States", "Canada")
11 AND fact_rental.rental_date BETWEEN '2005-05-24' AND '2005-06-24'
12 -- end: prev query
13)
14SELECT customer_id, AVG(on_time_payment) AS percent_on_time_payment
15FROM customer_payment_info
16GROUP BY customer_id
- → Improve prev codes: find the maximum of the “percent on time payment” column
1WITH customer_payment_info AS (
2 SELECT fact_rental.customer_id, fact_rental.rental_id,
3 (CASE
4 WHEN payment_date < return_date THEN 1
5 ELSE 0
6 END) AS on_time_payment
7 FROM fact_rental
8 JOIN dim_customer ON dim_customer.customer_id = fact_rental.customer_id
9 WHERE dim_customer.country IN ("United States", "Canada")
10 AND (fact_rental.rental_date BETWEEN '2005-05-24' AND '2005-07-26')
11), customer_percent_on_time_payment AS ( -- 👈 new from here
12 SELECT customer_id, AVG(on_time_payment) AS percent_on_time_payment
13 FROM customer_payment_info
14 GROUP BY customer_id
15)
16SELECT MAX(percent_on_time_payment)
17FROM customer_percent_on_time_payment
- You can also incorporate some temporary results within your main query using subqueries
1SELECT film_id, length
2FROM dim_film
3WHERE length > (SELECT AVG(length) FROM dim_film) -- 👈 subquery
Get the ids of the films that have length greater than the average length.
- Compute the average duration in days that a customer spent on a film category
1SELECT
2 fact_rental.customer_id,
3 dim_category.name,
4 AVG(DATEDIFF(return_date, rental_date)) AS average_rental_days
5FROM fact_rental
6JOIN dim_category ON fact_rental.category_id = dim_category.category_id
7GROUP BY fact_rental.customer_id, dim_category.name
8ORDER BY fact_rental.customer_id, average_rental_days DESC
- SQL Window Functions: Allows you to apply an aggregate or ranking function over a
particular window or a set of rows. Does not group rows into a single output row: each row remains separate - Partitions rows to form a set of rows. (
PARTITION BY
clause is used) - Orders rows within those partitions into a particular order. (
ORDER BY
clause is used)
Window functions apply to aggregate and ranking functions over a particular window (set of rows).
OVER
clause is used with window functions to define that window. OVER
clause does two things : Note: If partitions aren’t done, then
ORDER BY
orders all rows of the table. - Add a column that shows for each customer the running sum over each window
1WITH customer_info AS (
2 -- start: previous query
3 SELECT
4 fact_rental.customer_id,
5 dim_category.name,
6 AVG(DATEDIFF(return_date, rental_date)) AS average_rental_days
7 FROM fact_rental
8 JOIN dim_category ON fact_rental.category_id = dim_category.category_id
9 GROUP BY fact_rental.customer_id, dim_category.name
10 ORDER BY fact_rental.customer_id, average_rental_days DESC
11 -- end: previous query
12)
13SELECT customer_id, name, average_rental_days,
14 RANK() OVER ( -- 👈 SQL Window Function
15 PARTITION BY customer_id ORDER BY average_rental_days DESC
16 ) AS rank_category
17FROM customer_info
18ORDER BY customer_id, rank_category
- ⭐ Check the notebook.
- After this assignment you'll be able to:
- Build SQL queries using several techniques to manipulate data.
- Distinguish between Common Table Expressions and Subqueries.
- Apply the
DATE_PART
function to manipulate date values. - Make use of the
LAG
andLEAD
functions to access values from other rows when making a calculation. - Utilize the
CASE
statement to return different values based on conditions. - Use window functions on SQL to aggregate data within a specific window.
- Understand pivot tables and identify how they are built.
- As a data engineer, understanding how indexes are implemented can help you design better index structures and optimize your queries to improve SQL performance.
- Index: A separate data structure that has its own disk space and contains information that refers to the actual table.
- Balanced Search Tree (B-Tree)
CREATE INDEX
to improve the query performance
1CREATE INDEX rental_idx ON payment (rental_id);
- Columnar Storage. Cloud DB like Amazon Redshift use Sort Key to sort one or more column and stores the sorted data on disk.
- If
employee_id
is the primary key , then an index is automatically created on the primary key.
- It’s worst when you write a query that not only scans the entire table, but also returns everything. Eg:
SELECT * FROM order
← avoid with noWHERE
- Large amounts of data need to be transferred from disk.
SELECT *
on cloud pas-as-you-go databases can be expensive.
- TIP → Using Pruning (tỉa, lược bớt)
- Row-based pruning
- Column-based pruning
- Partition pruning
1-- index before select
2CREATE INDEX rental_idx ON payment (rental_id);
3-- then
4SELECT * FROM payment WHERE rental_id = 1;
1SELECT customer_id, rental_id FROM payment;
JOIN
operation is one of the most time-consuming query operations.
JOIN
can use several methods:- Nested loop join: For each row in “orders”`, go throw all rows in “customers”
- Index-based nested loop:
- Hash join: using hash to map “customer_id” in to bucket
- Another challenge is when 2 tables have a many-to-many relationship. ← row explosion (return more rows than expected)
- Aggregate queries: used to compute summary value of a column like sum, average, max, min, count,…
- Using index if available.
- It’s quicker if using columnar storage vs row storage.
- We'll look at how Redshift queries data and some considerations to keep in mind for table design that can optimize query performance.
- Amazon Redshift is a columnar storage ← speed up query performance ← best suit for Data Warehousing and large-scale data analytics.
- Redshift uses Data compression. ← save storage space + read less data from disk.
- Support Massive Parallel Processing (MPP) → depend much on distribution style and sort key.
- Check the codes for this lab.
- Compare 2 types of storages.
- Row-based storage → PostgreSQL, column-based → Amazon Redshift
- TPC-H benchmark: The goal of this benchmark is to simulate a set of basic scenarios to examine a large dataset and execute queries to answer business questions. It's designed to evaluate the performance of various database systems and how they execute complex queries.
- Leverage Query Caching:
- Prioritize Readability:
- Data resouces → table bloat (data size on disk exceeds the actual data size)
- We can use system like Apache Flink or Apache Spark to apply SQL queries continuously on streaming data.
- To query on streaming data, we can use Windowed Query:
- To join data streams, we have several ways
- You can use several services on AMZ to work with Apache Flink or you can use Amz Managed Service for Apache Flink.
- AMSAF creates the hosted environment for you to run your applications.
- On AMSAF (Console), when creating a new environment with AMSAF:
- Streaming applications: allows you to host your Apache Flink application and run it in a production environment.
- Studio notebooks: Ideal for development and interactive data exploration. ← Apache Zeppelin
- AMSAF uses connectors to interface with databases, storages, message queues,…
- Notebook → Apache Zeppelin: code written in SQL, Python, Java. Data visualization as tables and charts. Integration with tools like Flink.
- In case you wanna setup notebook in your way (not using blueprint)
Check the codes for this lab.