Trino Query Time Merge (id)#
Handling Duplicates and Latest Updates with Trino
See also
In this guide, we demonstrate how to manage duplicates at query time and retain the latest updates using Trino SQL. Similar to Apache Spark’s window functions, Trino’s window functions enable you to:
Partition the data based on a unique identifier (
id
).Order the records by a timestamp (or another field to determine the most recent entry).
Assign row numbers to isolate and filter the latest record.
By leveraging this approach, we can efficiently handle large datasets while ensuring that only the most up-to-date information is processed.
1. Sample Data#
Let’s assume we have a table sample_data
with duplicate records:
CREATE TABLE sample_data (
id INTEGER,
name VARCHAR,
timestamp TIMESTAMP
);
INSERT INTO sample_data VALUES
(1, 'Alice', TIMESTAMP '2024-11-18 10:00:00'),
(2, 'Bob', TIMESTAMP '2024-11-18 11:00:00'),
(1, 'Alice Updated', TIMESTAMP '2024-11-18 12:00:00'), -- Duplicate ID with later timestamp
(3, 'Charlie', TIMESTAMP '2024-11-18 09:00:00'),
(2, 'Bob Updated', TIMESTAMP '2024-11-18 12:00:00'); -- Duplicate ID with later timestamp
2. Remove Duplicates and Keep the Latest Record#
To remove duplicates and retain the most recent record for each id
, we use Trino’s window functions.
WITH ranked_data AS (
SELECT
id,
name,
timestamp,
ROW_NUMBER() OVER (
PARTITION BY id
ORDER BY timestamp DESC
) AS row_num
FROM sample_data
)
SELECT id, name, timestamp
FROM ranked_data
WHERE row_num = 1;
Result:
id |
name |
timestamp |
---|---|---|
1 |
Alice Updated |
2024-11-18 12:00:00 |
2 |
Bob Updated |
2024-11-18 12:00:00 |
3 |
Charlie |
2024-11-18 09:00:00 |
3. Why is the Window Function Required?#
a) Partitioning Data
The PARTITION BY id
groups the data by the id
column. This allows each id
to be processed independently, isolating duplicates within each group.
b) Sorting Within Partitions
The ORDER BY timestamp DESC
sorts records within each partition, ensuring the most recent record is at the top.
c) Row Number Assignment
Using ROW_NUMBER()
assigns a unique rank to each record based on the descending order of the timestamp. The most recent record for each id
is assigned row_num = 1
.
d) Filtering for the Latest Record
Finally, filtering where row_num = 1
retains only the most recent record for each id
, effectively eliminating duplicates.
4. Optimizing Data Processing with Filters#
Processing the entire dataset might not always be necessary. To optimize performance, apply filters before using window functions:
Filter by Time Window:
Process only records within a specific time range:SELECT * FROM sample_data WHERE timestamp BETWEEN TIMESTAMP '2024-01-01' AND TIMESTAMP '2024-01-31';
Filter by Entity:
Focus on data for a specific entity:SELECT * FROM sample_data WHERE id = 12345;
By applying these filters in your query, you reduce the dataset size and improve query performance.
Conclusion#
In this guide, we demonstrated how to handle duplicates and retain the latest updates using Trino’s window functions. The process involves:
Partitioning data by a unique identifier (
id
).Sorting records by a timestamp to identify the most recent entries.
Using
ROW_NUMBER()
to filter and keep only the latest record.
This approach is efficient and scalable, ensuring that large datasets are processed accurately to maintain only the most up-to-date information.