👉 Lecture notes & Repositoty.
- Storage and queries are more complex than they appear.
- Various decisions in data storage impact performance and efficiency.
- Storage solution considerations: data type, data size, data format, access and update pattern.
- Storage hierarchy:
- Management system: Organizes data in the raw components and allows you to interact with the stored data
- OLTP (Online Transactional Processing Systems): focus on performing read and write queries with low latency. ← row oriented storage is more suitable
- OLAP systems (Online Analytical Processing Systems): Focus on applying analytical activities on data (e.g. aggregation, summarization) ← column oriented storage is more suitable
- Plan:
- Week 1: Trade off between storage cost and performance.
- Cloud storage paradigms (block, object and file storage)
- Data storage in databases
- Row vs column-oriented databases
- Graph and vector databases
- Characteristics of physical components
- Serialization and compression
- Week 2: How to choose the appropriate abstractions for storing your data.
- Week 3: Queries
- How queries work
- How different storage solutions affect query performance
- Techniques for improving query performance
- Persistent Storage Medium: HDD, SSD.
- Volatile Memory: RAM, CPU cache.
- Serialization: Data need to be serialized before saving in storage and need to be de-serialized to be read.
- Serialization formats
- Compression
- File Storage:
- Organizes data in a hierarchical directory structure.
- Suitable for centralized access and easy sharing among users.
- Lower read/write performance due to tracking file hierarchy.
- Block Storage:
- Stores data in fixed-size blocks.
- High performance and low latency for frequent read/write operations.
- Ideal for transactional workloads and virtual machine storage.
- Object Storage:
- Stores data as immutable objects in a flat structure.
- Highly scalable, supporting petabytes of data.
- Best for analytical workloads, data lakes, and large unstructured data storage.
- Which one to choose?
- Storage as cluster (group of nodes). Each node is HDD or SSD.
- Easy to horizontal scaling + higher fault tolerance and data durability (die one, there are others)
- Methods for distributing data: replication (duplicate the same as many) or partitioning (split and put in mutiple nodes)
- Weakness of distributing data: takes time
- CAP Theorem: can only choose 2 of 3 — consistency, availability and partition tolerance. The last is almost fix → problem is to choose consistency or availability!
- RDBMS → consistency, NoSQL → Availability
- You want to distribute your data across multiple nodes → you need to split your data into partitions or shards. You can use a database sharding method or rule to construct a shard key that indicates how the data will be partitioned.
- Common sharding methods:
- Range-based sharding: Splits rows based on a range of values, such as first letters of customer names, but may lead to unbalanced data distribution.
- Hashed sharding: Uses a hash function to partition data, resulting in a more balanced distribution across nodes.
- Geo sharding: Partitions data by geographical location, allowing data to be stored closer to customers for faster retrieval.
- Other methods: Can split data based on meaningful attributes like occupation or favorite color.
- Check these codes.
- Object storage:
- Using Amazon S3 buckets and
boto3
to upload data. - Object key looks like containing folder, subfolders but actually it’s a flat structure (all are in the same place)
- Enable versioning: in case you upload to the same object key, a new version of this file will be created. → use
list_object_version
to check.
- File storage
- Block storage: there is a server that emulates the behavior of block storage.
- Memory: using
cache-pandas
package with@timed_lru_cache()
decorator to store the data into RAM cache.
- Databases store data via a DBMS (Database Management System). It contains several components:
- Transport System
- Query Processor
- Execution Engine
- Storage Engine
- The Storage Engine is responsible for:
- Serialization
- Arrangement of data on disk
- Indexing
- Indexing helps query data much faster by storing the locations of IDs via indexes. Indexes can be based on key values or foreign key values.
- Row-oriented storage: more suitable for transactional workloads that require low latency read and write operation, but not efficient analytical workloads
- Column-oriented storage: suitable for analytical workloads that involve aggregating operations to columns but not suitable for reading, writing or updating individual rows
- Combine both column-oriented and row-oriented approaches. In this hybrid approach, rows are partitioned into groups and each row group is stored in a column-wise format.
- Parquet can be used for storing tabular data as well as nested data such as JSON file.
- Parquet's portability is another key advantage. It offers improved performance when working with external tools, unlike proprietary columnar formats from cloud data warehouses that need conversion for compatibility.
- Normal key-value database (NoSQL)
- Wide-column databases
- Wide-column database doesn’t enforce a strict table schema, adding columns becomes very flexible, and a column is only written if there’s data for it.
- Amazon keyspaces (for Apache Cassandra)
- With relational database, things get complicated with complex relationships → graph database helps.
- Use cases of Graph Database:
- Recommending products
- Modeling social networks
- Representing network and IT operations
- Simulating supply chains logistics
- Tracing data lineage
- Fraud detection
- Knowledge graph (used with RAG technique in LLM)
- Graph databases: neo4j, ArangoDB, Amazon Neptune.
- Vector databases enable you to efficiently query data based on semantic similarities.
- You could use a vector database for storing any kind of data that is in the form of numbers in an array.
- Vector database uses a distance metric to find similar vectors: Euclidean distance, Cosin distance or Manhattan distance.
- Similarity Search: KNN (inefficient when data size increases or curse of dimensionality), ANN (more efficient than KNN, vector database buit to support ANN).
Given a query point (e.g. the green node in the image above), the algorithm starts at the entry node of the top layer (i.e. the red node) and navigates through the graph of that layer, each time choosing the neighboring node that is closest to the query point. It stops at the node that does not have any neighboring nodes closer to the query point. At this point, the algorithm shifts to the current node in the next lower layer and begins searching again. It repeats the process until it finds the nearest node at the bottom layer.
References:
- Graph databases such as Neo4j allow you to model your data as a graph and interact with it similar to how you interact with relational databases.
- This focuses on Property Graph Model.
- The model describes what types of node exist in the graph and how these are linked together.
- Edges between each nodes are call relationships and each has a type.
- Each relationship type has a source node and a target node.
- AN example
- To create a graph database in neo4j
- Cypher Query Language → use
MATH pattern RETURN result
1# return all nodes
2Match (n) return n
3
4# Get total number of nodes
5Match (n) return count(n)
6
7# with specific label
8Math (n:Order) return count(n)
9
10# Explore node labels
11# "distinct" ensures the returned labels aren't repeated
12Match (n) return distinct labels(n)
13
14# Explore properties of a specific node
15Math (n:Order) return Properties(n) limit 1 # limit only 1st node
node →
()
1# Count all the directed paths
2Match ()-[r]->() return count(r)
3
4# List distinct types of all relationships in the graph
5Match ()-[r]->() return distinct type(r)
6
7# Specify the type of the relationship -> return properties
8# "as" <- define an alias for the return value
9Match ()-[r:ORDERS]->()
10return AVG(r.quantity*r.unitPrice) as average_price
11
12# Get the average price for all orders grouped by product category
13Match ()-[r:ORDERS]->()-[part:PART_OF]->(c:Category)
14return c.categoryName, AVG(r.quantity*r.unitPrice) as average_price
relationship →
[]
, path → (source node)-[r]->(target node)
1# Filter: Retrieve the product name and unit price of all products
2# that belong to category "Meat/Poultry"
3Match (p:Product)-[:PART_OF]->(c:Category)
4where c.categoryName="Meat/Poultry"
5return p.productName, p.unitPrice
6
7# OR
8Match (p:Product)-[:PART_OF]->(c:Category {categoryName:"Meat/Poultry"})
9return p.productName, p.unitPrice
1# Retrieve the product name of all products ordered by the
2# customer with customerId "QUEDE"
3Match (c1:Customer {customerID:"QUEDE"}) -[:PURCHASED]->()-[:ORDERS]->(p:Product)
4return p.productName
5
6# Get the ID of other customers who ordered the same products as "QUEDE"
7Match (c1:Customer {customerID:"QUEDE"}) -[:PURCHASED]->()-[:ORDERS]->(p:Product) <-[:ORDERS]-()<-[:PURCHASED]-(c2:Customer)
8return c2.customerID
1# Retrieve the orders that contain at most 2 products
2# STEP 1: Get the total #products for each order
3Match (o:Order)-[:ORDERS]->(p:Product)
4return o.orderID as ID, count(p) as countProd
5# STEP 2: filter the orders having at most 2 products
6Match (o:Order)-[:ORDERS]->(p:Product)
7with o.orderID as ID, count(p) as countProd # replace "return" with "with"
8where countProd <= 2
9return ID, countProd
- ⭐ Should read the notebook, there are explanations there!