RAG with Row-Level Security on VAST DB

RAG with Row-Level Security on VAST DB#

This tutorial demonstrates how to build a Retrieval-Augmented Generation (RAG) pipeline that respects fine-grained data access controls.

We will:

Load two documents — an open-access PDF (the Greek Constitution) and a restricted CSV (salary data)
Chunk, embed, and store both in a VAST DB vector table
Build a simple RAG agent using an LLM via HuggingFace
Show that without row-level security, sensitive data leaks through retrieval
Show that with row-level security enabled, restricted data is automatically filtered out

Architecture Overview#

  Documents (PDF + CSV)
        │
        ▼
  Chunk + Embed (nomic-embed-text)
        │
        ▼
  VAST DB Vector Table  ◄──────── S3 Bucket IAM Policy
  (embeddings + S3 path)           (nireny/open/ = allowed)
        │                          (nireny/restricted/ = denied)
        ▼
  Cosine Similarity Search  ◄───── RowFilterByS3PathAuth
        │
        ▼
   Context → LLM → Answer

The key insight: each vector row stores a vastdb_s3_path_auth column pointing to the source file’s S3 path. When RowFilterByS3PathAuth is enabled in the DB IAM policy, VAST DB checks the caller’s S3 permissions and silently drops rows whose paths are inaccessible — before results are returned.

Prerequisites#

pip install vastdb adbc-driver-vastdb sentence-transformers langchain-community openai huggingface_hub

Step 1 — Authenticate with HuggingFace#

We use HuggingFace both to download the embedding model and to access the LLM inference router.

from huggingface_hub import login
login()

Step 2 — Configuration#

Set your VAST DB and S3 endpoints, credentials, and file paths.

Variable	Description
`VASTDB_ENDPOINT`	VAST DB native API endpoint
`QE_ENDPOINT`	VAST Query Engine (SQL over ADBC) endpoint
`DB_BUCKET_NAME`	S3 bucket holding the vector table
`BUCKET_NAME`	S3 bucket holding source files
`O_FILE`	Open-access PDF document
`R_FILE`	Restricted CSV document

import vastdb
import pandas as pd
import pyarrow as pa
import adbc_driver_vastdb
from adbc_driver_manager import dbapi
from sentence_transformers import SentenceTransformer

VASTDB_ENDPOINT = 'http://172.200.204.2'
QE_ENDPOINT     = 'http://172.200.204.8'

AWS_ACCESS_KEY_ID     = 'XXXXX'
AWS_SECRET_ACCESS_KEY = 'XXXXX'

DB_BUCKET_NAME = 'nireny-db'
BUCKET_NAME    = 'nireny'
SCHEMA_NAME    = 'db'
TABLE_NAME     = 'embeddings'

O_FILE = "THE CONSTITUTION OF GREECE.pdf"  # stored at nireny/open/
R_FILE = "salary.csv"                       # stored at nireny/restricted/

Step 3 — Connect to VAST DB#

session = vastdb.connect(
    endpoint=VASTDB_ENDPOINT,
    access=AWS_ACCESS_KEY_ID,
    secret=AWS_SECRET_ACCESS_KEY
)

Step 4 — Define the Embedding Function#

We use nomic-ai/nomic-embed-text-v1.5, a high-quality open embedding model.

def embed_text(chunks):
    model = SentenceTransformer('nomic-ai/nomic-embed-text-v1.5', trust_remote_code=True)
    embeddings = model.encode(chunks)
    return embeddings

Step 5 — Chunk the Documents#

Good chunking is critical for retrieval quality. We use two different strategies:

PDF: section-aware chunking via UnstructuredPDFLoader — splits on titles, respects heading hierarchy
CSV: one chunk per record row via CSVLoader — each row becomes its own retrievable unit

5a — Chunk the Open PDF#

from langchain_community.document_loaders import UnstructuredPDFLoader

def load_pdf_by_sections(pdf_path):
    loader = UnstructuredPDFLoader(
        pdf_path,
        mode="elements",
        strategy="fast",
        chunking_strategy="by_title",
        max_characters=8000,
        new_after_n_chars=3800,
        combine_text_under_n_chars=2000,
    )
    return loader.load()

try:
    section_chunks = load_pdf_by_sections(O_FILE)
    print(f"Successfully created {len(section_chunks)} section-aware chunks.")
except Exception as e:
    print(f"Error: {e}")

5b — Chunk the Restricted CSV#

from langchain_community.document_loaders import CSVLoader

loader = CSVLoader(file_path=R_FILE)
record_chunks = loader.load()
print(f"Successfully created {len(record_chunks)} record chunks.")

Step 6 — Prepare Chunks with S3 Path Labels#

This is the key security step: each chunk is tagged with its S3 path via the vastdb_s3_path_auth column.

Open chunks → nireny/open/<filename> (accessible to all)
Restricted chunks → nireny/restricted/<filename> (blocked by S3 bucket policy)

VAST DB will use these paths when RowFilterByS3PathAuth is enabled.

o_chunks = [(f'"{BUCKET_NAME}/open/{O_FILE}"',       chunk.page_content) for chunk in section_chunks]
r_chunks = [(f'"{BUCKET_NAME}/restricted/{R_FILE}"', chunk.page_content) for chunk in record_chunks]

chunks = o_chunks + r_chunks
df = pd.DataFrame(chunks, columns=['vastdb_s3_path_auth', 'chunk'])

print(f"Total chunks: {len(df)}  ({len(o_chunks)} open, {len(r_chunks)} restricted)")
df.head(3)

Step 7 — Generate Embeddings#

embeddings = embed_text(df['chunk'].to_list())
df['embedding'] = embeddings.tolist()

dimensions = len(embeddings[0])
print(f"Embedding dimensions: {dimensions}")

Step 8 — Load into VAST DB#

First, clear any existing rows (useful when re-running the notebook). Then insert all chunks as a typed PyArrow table.

8a — Clear existing rows#

with session.transaction() as tx:
    table = tx.bucket(DB_BUCKET_NAME).schema(SCHEMA_NAME).table(TABLE_NAME)
    reader = table.select(internal_row_id=True)
    pyarrow_table = reader.read_all()
    table.delete(pyarrow_table)

print("Existing rows cleared.")

8b — Insert all chunks#

arrow_table = pa.Table.from_pandas(df, schema=pa.schema([
    ("vastdb_s3_path_auth", pa.string()),
    ("chunk",              pa.string()),
    ("embedding",          pa.list_(pa.field(name="item", type=pa.float32(), nullable=False), dimensions))
]))

with session.transaction() as tx:
    table = (
        tx.bucket(DB_BUCKET_NAME).schema(SCHEMA_NAME).table(TABLE_NAME, fail_if_missing=False)
        or tx.bucket(DB_BUCKET_NAME).schema(SCHEMA_NAME).create_table(TABLE_NAME, arrow_table.schema)
    )
    table.insert(arrow_table)

print(f"Inserted {len(arrow_table)} rows.")

Step 9 — Set Up the Query Engine#

We use VAST’s ADBC SQL driver to run cosine similarity searches directly in SQL.

# Verify the table contents
with session.transaction() as tx:
    table = tx.bucket(DB_BUCKET_NAME).schema(SCHEMA_NAME).table(TABLE_NAME)
    cursor = table.select()
    results = cursor.read_all()

results.to_pandas()[['vastdb_s3_path_auth', 'chunk']].head(5)

full_table_name = f'"{DB_BUCKET_NAME}/{SCHEMA_NAME}".{TABLE_NAME}'

def run_query(query):
    with dbapi.connect(
        driver=adbc_driver_vastdb.get_driver_path(),
        db_kwargs={
            "vast.db.endpoint":   QE_ENDPOINT,
            "vast.db.access_key": AWS_ACCESS_KEY_ID,
            "vast.db.secret_key": AWS_SECRET_ACCESS_KEY,
        },
    ) as connection:
        with connection.cursor() as cursor:
            cursor.execute(query)
            return cursor.fetch_arrow_table().to_pandas()

# Inspect the table schema
run_query(f"DESCRIBE TABLE {full_table_name}")

Step 10 — Build the RAG Agent#

The agent takes a question, searches for relevant chunks using cosine distance, and passes them as context to the LLM.

from env import HF_API_KEY
from openai import OpenAI

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=HF_API_KEY,
)

def my_agent(query, context=""):
    """Ask the LLM a question, optionally with retrieved context."""
    answer = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct:cheapest",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant. Use the provided Context to answer questions. "
                           "If you don't have the answer just say you don't know."
            },
            {
                "role": "user",
                "content": f"Context: {context}\n\nQuestion: {query}"
            }
        ],
    )
    print(answer.choices[0].message.content)

Step 11 — RAG Demo: Open Question#

First, let’s confirm RAG works correctly for public data. We’ll ask about the Greek Constitution.

Without RAG, the LLM has no context:

question = "According to the greek constitution, what does part two article 4 say?"

print("=== Without RAG ===")
my_agent(question)

With RAG, we retrieve the relevant chunks first:

question_embedding = embed_text([question])
dimensions = len(question_embedding.tolist()[0])

context = run_query(f"""
    SELECT chunk FROM {full_table_name}
    ORDER BY
        array_cosine_distance(embedding, {question_embedding.tolist()[0]}::FLOAT[{dimensions}])
    LIMIT 15
""")

contexts = context['chunk'].tolist()
print(f"Retrieved {len(contexts)} chunks.")
display(contexts[:2])  # preview

print("=== With RAG ===")
my_agent(question, context="\n".join(contexts))

Step 12 — Security Demo: Restricted Data#

Now we test with a question that requires salary data from salary.csv, which is stored under nireny/restricted/.

The S3 Bucket Policy#

{
  "Statement": [
    { "Sid": "DefaultAccess",  "Effect": "Allow", "Action": "s3:*",          "Resource": ["nireny/*", "nireny"] },
    { "Sid": "DenyAccess",     "Effect": "Deny",  "Action": "s3:GetObject", "Resource": "nireny/restricted/*" }
  ]
}

The DB IAM Policy (key setting)#

{ "Sid": "NirenyDbRowFilterByS3PathAuth",
  "Effect": "RowColumnSecurity",
  "Resource": "arn:aws:s3:::nireny-db/*",
  "RowFilterByS3PathAuth": "True"    ← Toggle this to see the difference
}

12a — `RowFilterByS3PathAuth: False` (security disabled)#

With security off, retrieval returns chunks from both open and restricted files. The LLM can answer with salary data:

question = "What is the salary of the CEO?"

question_embedding = embed_text([question])
dimensions = len(question_embedding.tolist()[0])

context = run_query(f"""
    SELECT chunk FROM {full_table_name}
    ORDER BY
        array_cosine_distance(embedding, {question_embedding.tolist()[0]}::FLOAT[{dimensions}])
    LIMIT 5
""")

contexts = context['chunk'].tolist()
print("Retrieved chunks (includes restricted data):")
display(contexts)

print("\n=== LLM Answer (security OFF) ===")
my_agent(question, context="\n".join(contexts))

12b — `RowFilterByS3PathAuth: True` (security enabled)#

Now enable the setting in your DB IAM policy and re-run.

VAST DB checks S3 GetObject permissions for each row’s path. Rows under nireny/restricted/ are denied by the bucket policy, so they are filtered out before being returned. The same query now returns only open-access chunks, and the LLM correctly says it doesn’t know the salary:

# Run the identical query — now with RowFilterByS3PathAuth: True in the DB policy
context = run_query(f"""
    SELECT chunk FROM {full_table_name}
    ORDER BY
        array_cosine_distance(embedding, {question_embedding.tolist()[0]}::FLOAT[{dimensions}])
    LIMIT 5
""")

contexts = context['chunk'].tolist()
print("Retrieved chunks (restricted rows filtered out):")
display(contexts)

print("\n=== LLM Answer (security ON) ===")
my_agent(question, context="\n".join(contexts))

Summary#

Setting	Salary in retrieved chunks?	LLM leaks salary?
`RowFilterByS3PathAuth: False`	✅ Yes	✅ Yes
`RowFilterByS3PathAuth: True`	❌ No	❌ No

By tagging each vector row with its source S3 path and enabling RowFilterByS3PathAuth, VAST DB transparently enforces your existing S3 access policies on vector search results — no changes required in your RAG application code.

This pattern composes well: different users or roles can have different S3 permissions, and the same vector table serves all of them securely.