Trino Query Time Merge (non-id)#
Handling Duplicates and Latest Updates with Trino
See also
In this guide, we demonstrate how to manage duplicates at query time and retain the latest updates using Trino SQL. Similar to Apache Spark’s window functions, Trino’s window functions enable you to:
Partition the data based on a unique identifier or composite key (e.g.,
id
,first_name
,last_name
).Order the records by a timestamp (or another field to determine the most recent entry).
Assign row numbers to isolate and filter the latest record.
By leveraging this approach, we can efficiently handle large datasets while ensuring that only the most up-to-date information is processed.
1. Sample Data#
Let’s assume we have a table sample_data
with duplicate records:
CREATE TABLE sample_data (
id INTEGER,
first_name VARCHAR,
last_name VARCHAR,
timestamp TIMESTAMP
);
INSERT INTO sample_data VALUES
(1, 'Alice', 'Smith', TIMESTAMP '2024-11-18 10:00:00'),
(2, 'Bob', 'Johnson', TIMESTAMP '2024-11-18 11:00:00'),
(1, 'Alice', 'Smith', TIMESTAMP '2024-11-18 12:00:00'), -- Duplicate with later timestamp
(3, 'Charlie', 'Brown', TIMESTAMP '2024-11-18 09:00:00'),
(2, 'Bob', 'Johnson', TIMESTAMP '2024-11-18 12:00:00'); -- Duplicate with later timestamp
2. Remove Duplicates and Keep the Latest Record#
To remove duplicates and retain the most recent record for each unique combination of id
, first_name
, and last_name
, we use Trino’s window functions.
WITH ranked_data AS (
SELECT
id,
first_name,
last_name,
timestamp,
ROW_NUMBER() OVER (
PARTITION BY id, first_name, last_name
ORDER BY timestamp DESC
) AS row_num
FROM sample_data
)
SELECT id, first_name, last_name, timestamp
FROM ranked_data
WHERE row_num = 1;
Result:
id |
first_name |
last_name |
timestamp |
---|---|---|---|
1 |
Alice |
Smith |
2024-11-18 12:00:00 |
2 |
Bob |
Johnson |
2024-11-18 12:00:00 |
3 |
Charlie |
Brown |
2024-11-18 09:00:00 |
3. Handling Composite Keys#
If you need to identify duplicates based on a combination of fields (e.g., first_name
and last_name
instead of id
), you can adjust the PARTITION BY
clause accordingly.
For example, to handle duplicates based only on first_name
and last_name
:
WITH ranked_data AS (
SELECT
first_name,
last_name,
timestamp,
ROW_NUMBER() OVER (
PARTITION BY first_name, last_name
ORDER BY timestamp DESC
) AS row_num
FROM sample_data
)
SELECT first_name, last_name, timestamp
FROM ranked_data
WHERE row_num = 1;
This approach is useful when a single identifier (id
) is unavailable or if uniqueness is defined by a combination of attributes.
4. Why is the Window Function Required?#
a) Partitioning Data
The PARTITION BY
clause groups the data by the specified keys (id
, first_name
, last_name
or other composite fields).
b) Sorting Within Partitions
The ORDER BY timestamp DESC
ensures the latest records appear first within each partition.
c) Row Number Assignment
The ROW_NUMBER()
function assigns a unique rank to each record in the partition, starting with the most recent record.
d) Filtering for the Latest Record
Filtering where row_num = 1
isolates the most recent record for each unique key combination.
5. Optimizing Data Processing with Filters#
To optimize performance, apply filters before using window functions:
Filter by Time Window:
SELECT * FROM sample_data WHERE timestamp BETWEEN TIMESTAMP '2024-01-01' AND TIMESTAMP '2024-01-31';
Filter by Specific Entity:
SELECT * FROM sample_data WHERE id = 12345;
Conclusion#
In this guide, we demonstrated how to handle duplicates and retain the latest updates using Trino’s window functions. The process involves:
Partitioning data by a unique identifier or composite key (e.g.,
id
,first_name
,last_name
).Sorting records by a timestamp to identify the most recent entries.
Using
ROW_NUMBER()
to filter and keep only the latest record.
This approach is flexible, efficient, and scalable, ensuring that large datasets are processed accurately to maintain only the most up-to-date information.