Query with ‘LIMIT n’#
See also
The Vast DB SDK API Documentation is available here.
Overview#
A common operation when interactively working with large datasets is to LIMIT the records returned so can quickly view some data without processing the whole dataset.
For more information see:
Instructions#
Let’s say you have a large table with 10 million rows. You want to execute a query that returns only a small set of rows, similar to the SQL LIMIT operator (e.g. SELECT * FROM table LIMIT n
).
Here’s how you might configure it:
from vastdb.config import QueryConfig
config = QueryConfig(
num_splits=1, # Manually specify 1 split
num_sub_splits=1, # Each split will be divided into 1 subsplits
limit_rows_per_sub_split=10, # Each subsplit will process 10 rows at a time
)
Here’s how it can be used:
import pyarrow
import vastdb
session = vastdb.connect(
endpoint=ENDPOINT, access=ACCESS_KEY, secret=SECRET_KEY
)
with session.transaction() as tx:
table = tx.bucket(DB_BUCKET).schema(DB_SCHEMA).table(DB_TABLE)
batches = table.select(config=config)
first_batch = next(batches)
assert first_batch.num_rows == 10
Note that table.select()
returns a pyarrow.RecordBatchReader
and we take the first batch. We then verify that batch has 10 rows.