Trino Query Time Merge (non-id)

Trino Query Time Merge (non-id)#

Handling Duplicates and Latest Updates with Trino

1. Sample Data#

Let’s assume we have a table sample_data with duplicate records:

CREATE TABLE sample_data (
    id INTEGER,
    first_name VARCHAR,
    last_name VARCHAR,
    timestamp TIMESTAMP
);

INSERT INTO sample_data VALUES 
(1, 'Alice', 'Smith', TIMESTAMP '2024-11-18 10:00:00'),
(2, 'Bob', 'Johnson', TIMESTAMP '2024-11-18 11:00:00'),
(1, 'Alice', 'Smith', TIMESTAMP '2024-11-18 12:00:00'), -- Duplicate with later timestamp
(3, 'Charlie', 'Brown', TIMESTAMP '2024-11-18 09:00:00'),
(2, 'Bob', 'Johnson', TIMESTAMP '2024-11-18 12:00:00');   -- Duplicate with later timestamp

2. Remove Duplicates and Keep the Latest Record#

To remove duplicates and retain the most recent record for each unique combination of id, first_name, and last_name, we use Trino’s window functions.

WITH ranked_data AS (
    SELECT 
        id,
        first_name,
        last_name,
        timestamp,
        ROW_NUMBER() OVER (
            PARTITION BY id, first_name, last_name
            ORDER BY timestamp DESC
        ) AS row_num
    FROM sample_data
)
SELECT id, first_name, last_name, timestamp
FROM ranked_data
WHERE row_num = 1;

Result:

id	first_name	last_name	timestamp
1	Alice	Smith	2024-11-18 12:00:00
2	Bob	Johnson	2024-11-18 12:00:00
3	Charlie	Brown	2024-11-18 09:00:00

3. Handling Composite Keys#

If you need to identify duplicates based on a combination of fields (e.g., first_name and last_name instead of id), you can adjust the PARTITION BY clause accordingly.

For example, to handle duplicates based only on first_name and last_name:

WITH ranked_data AS (
    SELECT 
        first_name,
        last_name,
        timestamp,
        ROW_NUMBER() OVER (
            PARTITION BY first_name, last_name
            ORDER BY timestamp DESC
        ) AS row_num
    FROM sample_data
)
SELECT first_name, last_name, timestamp
FROM ranked_data
WHERE row_num = 1;

This approach is useful when a single identifier (id) is unavailable or if uniqueness is defined by a combination of attributes.

4. Why is the Window Function Required?#

a) Partitioning Data
The PARTITION BY clause groups the data by the specified keys (id, first_name, last_name or other composite fields).

b) Sorting Within Partitions
The ORDER BY timestamp DESC ensures the latest records appear first within each partition.

c) Row Number Assignment
The ROW_NUMBER() function assigns a unique rank to each record in the partition, starting with the most recent record.

d) Filtering for the Latest Record
Filtering where row_num = 1 isolates the most recent record for each unique key combination.

5. Optimizing Data Processing with Filters#

To optimize performance, apply filters before using window functions:

Filter by Time Window:

SELECT *
FROM sample_data
WHERE timestamp BETWEEN TIMESTAMP '2024-01-01' AND TIMESTAMP '2024-01-31';

Filter by Specific Entity:

SELECT *
FROM sample_data
WHERE id = 12345;

Conclusion#

In this guide, we demonstrated how to handle duplicates and retain the latest updates using Trino’s window functions. The process involves:

Partitioning data by a unique identifier or composite key (e.g., id, first_name, last_name).
Sorting records by a timestamp to identify the most recent entries.
Using ROW_NUMBER() to filter and keep only the latest record.

This approach is flexible, efficient, and scalable, ensuring that large datasets are processed accurately to maintain only the most up-to-date information.