Projections#
See also
The Vast DB SDK API Documentation is available here.
What is a projection?#
A projection is a subset of a database table’s columns that are copied into a separate table and sorted into chunks for accelerated queries
When the source table is queried for columns that have been projected, the query will access the projection for better performance.
A background process updates the projection, and if no projected value is available, the source table will be used.
How do projections work#
Create a projection with a subset of columns, choosing which column(s) to sort by
A background process creates the projection table and adds pointers to the source table and the projection table
The projection is sharded and sorted in 8*216 = 524,288 row chunks
The query engine see the pointers and can easily jump back and forth between the two
Updates and deletes don’t trigger a re-sort process (not needed)
Queries with the projected columns will use the projection table when possible, and if the projection doesn’t include a row, will use the source table instead.
End users shouldn’t need to understand the projection, they’ll just see that the query is accelerated.
Install and Import the Vast DB SDK#
Before doing anything else, we need to import the vastdb api library.
!pip install vastdb | tail -5
?25hDownloading markdown_it_py-3.0.0-py3-none-any.whl (87 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87.5/87.5 kB 6.8 MB/s eta 0:00:00
?25hDownloading mdurl-0.1.2-py3-none-any.whl (10.0 kB)
Installing collected packages: flatbuffers, xmltodict, sqlglot, parsy, mdurl, jmespath, bidict, backoff, atpublic, markdown-it-py, botocore, aws-requests-auth, s3transfer, rich, ibis-framework, boto3, vastdb
Successfully installed atpublic-4.1.0 aws-requests-auth-0.4.3 backoff-2.2.1 bidict-0.23.1 boto3-1.34.126 botocore-1.34.126 flatbuffers-24.3.25 ibis-framework-9.0.0 jmespath-1.0.1 markdown-it-py-3.0.0 mdurl-0.1.2 parsy-2.1 rich-13.7.1 s3transfer-0.10.1 sqlglot-23.12.2 vastdb-0.1.7 xmltodict-0.13.0
Note that below we import annotations. This mode makes Python’s behavior more strict, including raising errors for some cases where variables are accidentally undefined.
import vastdb
from __future__ import annotations # Enable stricter type checking
Creating the session#
In the code below, we read the connection details from environment variables to enable automated teesting of the notebook.
Change these to reflect your environment, e.g.
ENDPOINT = 'http://your_vast_endpoint:12345'
DATABASE_NAME = ...
import os
ENDPOINT = os.environ['ENDPOINT']
ACCESS_KEY = os.environ['ACCESS_KEY']
SECRET_KEY = os.environ['SECRET_KEY']
DATABASE_NAME = os.environ['DATABASE_NAME']
# Schema will get created if it doesn't exist
DATABASE_SCHEMA = os.environ['DATABASE_SCHEMA']
# The table name this notebook will create for projections
TABLE_NAME = os.environ['TABLE_NAME']
print(f"""
{ENDPOINT=}
{ACCESS_KEY=}
{DATABASE_NAME=}
{DATABASE_SCHEMA=}
{TABLE_NAME=}
""")
import pyarrow as pa
import vastdb
import os
session = vastdb.connect(
endpoint=ENDPOINT,
access=ACCESS_KEY,
secret=SECRET_KEY)
print("Vast Cluster version: ", session.api.vast_version)
Vast Cluster version: (5, 1, 0, 131)
Create a table for the projection#
import pyarrow as pa
from vastdb.errors import TableExists
ARROW_SCHEMA = pa.schema([
('Citizen_Age', pa.int64()),
('Citizen_Name', pa.string()),
('Citizen_experience', pa.float64()),
('Is_married', pa.bool_()),
])
with session.transaction() as tx:
try:
schema = tx.bucket(DATABASE_NAME).schema(name=DATABASE_SCHEMA, fail_if_missing=False)
if not schema:
schema = tx.bucket(DATABASE_NAME).create_schema(name=DATABASE_SCHEMA)
try:
table = schema.create_table(table_name=TABLE_NAME, columns=ARROW_SCHEMA)
print(f"Table created: {table.name}")
except TableExists as e:
raise RuntimeError("Couldn't create table because it already exists", e)
except Exception as e:
print("Couldn't create table:", e)
except Exception as e:
print("Schema doesn't exist:", e)
Table created: pythonsdkcitizen
Insert rows into the Citizen Table#
ROWS = {
'Citizen_Name': ['Alice', 'Bob', 'Koko', 'Menny'],
'Citizen_Age': [45, 38, 27, 51],
'Citizen_experience': [25.5, 17.9, 5.3, 28.2],
'Is_married': [True, False, False, True]
}
PA_RECORD_BATCH = pa.RecordBatch.from_pydict(ROWS)
with session.transaction() as tx:
try:
schema = tx.bucket(DATABASE_NAME).schema(name=DATABASE_SCHEMA, fail_if_missing=True)
try:
table = schema.table(name=TABLE_NAME)
table.insert(PA_RECORD_BATCH)
print("Data inserted.")
except Exception as e:
print("Couldn't insert data:", e)
except Exception as e:
print("Schema doesn't exist:", e)
Data inserted.
This uses the RecordBatch.from_pydict method. See also:
Query the table, listing all rows#
def list_rows():
print(f"Listing rows in: Database='{DATABASE_NAME}' Schema='{DATABASE_SCHEMA}' Table='{TABLE_NAME}'")
with session.transaction() as tx:
try:
schema = tx.bucket(DATABASE_NAME).schema(name=DATABASE_SCHEMA, fail_if_missing=True)
try:
table = schema.table(name=TABLE_NAME)
reader = table.select()
pyarrow_table = pa.Table.from_batches(reader)
df = pyarrow_table.to_pandas()
display(df)
except Exception as e:
print("Couldn't select data:", e)
except Exception as e:
print("Schema doesn't exist:", e)
list_rows()
Citizen_Age | Citizen_Name | Citizen_experience | Is_married | |
---|---|---|---|---|
0 | 45 | Alice | 25.5 | True |
1 | 38 | Bob | 17.9 | False |
2 | 27 | Koko | 5.3 | False |
3 | 51 | Menny | 28.2 | True |
create_projection
#
Usage: Creates a new semi-sorted projection on the table.
Parameters:
projection_name
(str): The name of the new projection.sorted_columns
(List[str]): A list of column names that should be sorted in the projection.unsorted_columns
(List[str]): A list of column names that should not be sorted in the projection.
Raises:
errors.NotSupportedCommand
: If the operation is not supported on the current table.
Returns:
Projection
: The newly created projection.
Note:
A projection is a subset of a table’s data, where some columns are sorted and others are not. This can be useful for optimizing certain types of queries.
PROJECTION_SORTED_COLUMNS = [ 'Citizen_Age' ]
PROJECTION_UNSORTED_COLUMNS = [ 'Citizen_Name', 'Is_married' ]
PROJECTION_NAME = 'demo-projection'
with session.transaction() as tx:
try:
table = tx.bucket(DATABASE_NAME).schema(name=DATABASE_SCHEMA).table(name=TABLE_NAME)
try:
projections = table.projections(PROJECTION_NAME)
if projections:
raise RuntimeError(f"Projection {PROJECTION_NAME} already exists.")
projection = table.create_projection(
PROJECTION_NAME,
sorted_columns=PROJECTION_SORTED_COLUMNS,
unsorted_columns=PROJECTION_UNSORTED_COLUMNS
)
print("Projection created:", projection)
except Exception as e:
print("Exception encountered:", e)
except Exception as e:
print("Schema doesn't exist:", e)
Projection created: Projection(name='demo-projection', table=Table(name='pythonsdkcitizen', schema=Schema(name='python-sdk-schema', bucket=Bucket(name='demo-database', tx=Transaction(id=0x000030000000002d))), handle=1877086195346196137, stats=TableStats(num_rows=6, size_in_bytes=169, is_external_rowid_alloc=False, endpoints=()), _imports_table=False), handle=8168954746074489682, stats=TableStats(num_rows=0, size_in_bytes=0, is_external_rowid_alloc=False, endpoints=()))
projections
#
Usage: Lists all semi-sorted projections of this table or a specific projection if a name is provided.
Parameters:
projection_name
(str, optional): The name of the projection to list. If not provided, all projections are listed.
Raises:
errors.NotSupportedCommand
: If the operation is not supported on the current table.
Returns:
List[Projection]
: A list of Projection objects representing the projections of this table.
Note:
A projection is a subset of a table’s data, where some columns are sorted and others are not. This can be useful for optimizing certain types of queries.
print(f"Executing: table.projections('{PROJECTION_NAME}')")
with session.transaction() as tx:
try:
table = tx.bucket(DATABASE_NAME).schema(name=DATABASE_SCHEMA).table(name=TABLE_NAME)
try:
projections = table.projections(PROJECTION_NAME)
for p in projections:
print(p)
except Exception as e:
print("Exception encountered:", e)
except Exception as e:
print("Schema doesn't exist:", e)
Projection created: [Projection(name='demo-projection', table=Table(name='pythonsdkcitizen', schema=Schema(name='python-sdk-schema', bucket=Bucket(name='demo-database', tx=Transaction(id=0x000030000000002e))), handle=1877086195346196137, stats=TableStats(num_rows=6, size_in_bytes=169, is_external_rowid_alloc=False, endpoints=()), _imports_table=False), handle=8168954746074489682, stats=TableStats(num_rows=0, size_in_bytes=0, is_external_rowid_alloc=False, endpoints=()))]
Projection Stats#
The projects are synced in the backend and it can take some time until you will see the stats.
The stats can be found in the reponse from table.projections()
. E.g.
Projection(
name='demo-projection',
table=Table(name='pythonsdkcitizen',
...
stats=TableStats(num_rows=6, size_in_bytes=169, is_external_rowid_alloc=False, endpoints=()),
...
)
rename
#
Usage: Renames this projection.
Parameters:
new_name
(str): The new name for the projection.
Raises:
errors.NotSupportedCommand
: If the operation is not supported on the current table.
Note:
This operation will change the name of the projection in the database. The change is reflected in the ‘name’ attribute of this object.
PROJECTION_NAME = 'demo-projection'
PROJECTION_NAME_RENAMED = 'demo-projection-renamed'
with session.transaction() as tx:
try:
table = tx.bucket(DATABASE_NAME).schema(name=DATABASE_SCHEMA).table(name=TABLE_NAME)
try:
projection = table.projection(PROJECTION_NAME)
print("Projection name before rename:", projection.name)
projection.rename(PROJECTION_NAME_RENAMED)
projection = table.projection(PROJECTION_NAME_RENAMED)
print("Projection name after rename:", projection.name)
except Exception as e:
print("Exception encountered:", e)
except Exception as e:
print("Schema doesn't exist:", e)
Projection name before rename: demo-projection
Projection name after rename: demo-projection-renamed
drop
#
Usage: Drops (deletes) this table from the VAST Database. This operation is irreversible and will permanently remove the table along with all its data and associated metadata from the database.
Parameters:
None
Raises:
VastDBError
: If there is an error communicating with the VAST server or if the table cannot be dropped due to permissions issues or other server-side constraints.
Note: This method will also remove any imports table associated with this table.
def list_projections():
with session.transaction() as tx:
try:
table = tx.bucket(DATABASE_NAME).schema(name=DATABASE_SCHEMA).table(name=TABLE_NAME)
try:
projections = table.projections()
print(f"Found {len(projections)} projections.")
for p in projections:
print(f'>>> {p.name}')
except Exception as e:
import sys, traceback
traceback.print_exc(file=sys.stdout)
print("Exception encountered:", e)
except Exception as e:
print("Schema doesn't exist:", e)
list_projections()
Found 1 projections.
>>> demo-projection-renamed
PROJECTION_NAME_RENAMED = 'demo-projection-renamed'
with session.transaction() as tx:
try:
table = tx.bucket(DATABASE_NAME).schema(name=DATABASE_SCHEMA).table(name=TABLE_NAME)
try:
projection = table.projection(PROJECTION_NAME_RENAMED)
projection.drop()
except vastdb.errors.MissingProjection:
print("Projection not found.")
except Exception as e:
import sys, traceback
traceback.print_exc(file=sys.stdout)
print("Exception encountered:", e)
except Exception as e:
print("Schema doesn't exist:", e)
list_projections()
Found 0 projections.