Querying VAST NDB with Spark Connect

Querying VAST NDB with Spark Connect#

Spark Connect (introduced in Spark 3.4) provides a lightweight gRPC-based client that lets you run Spark queries remotely — no local Spark installation required beyond pyspark[connect].

How it works#

  Your notebook / script
  (pyspark[connect] only)
         │  sc://  (gRPC)
         ▼
  Spark Connect Server  ──────►  Spark Cluster
  (172.200.13.12:15002)               │
                                       ▼
                               VAST NDB Catalog
                               (172.200.204.12)

All Spark and VAST NDB configuration lives on the server side. The client just sends SQL or DataFrame operations over gRPC.

Key caveats#

Direct connection required. Connect only works when the client connects directly to the server endpoint. No proxy or gateway in between.

Use spark.executor.instances for parallelism, not spark.ndb.data_endpoints.

Scala version: the server must run against the system Spark at /opt/spark (Scala 2.13). pip install pyspark bundles Scala 2.12 and will not work on the server side.

Environment#

Component	Value
Spark master	`spark://172.200.204.12:2424`
NDB endpoint	`http://172.200.204.12`
Driver host	`172.200.13.12`
Connect port	`15002` (default)
VAST JARs	`/opt/spark/vast/*.jar`
Example dataset	`nba-db` / `games_schema` / `games_tbl`

Step 1 — Start the Connect Server#

Run this once on the Spark driver node. All VAST NDB configuration is passed here — the client needs none of it.

# Run in a terminal on the Spark driver node
print("""
/opt/spark/sbin/start-connect-server.sh \\
  --master spark://172.200.204.12:2424 \\
  --conf spark.executor.memory=10g \\
  --conf spark.executor.instances=3 \\
  --conf spark.executor.cores=3 \\
  --driver-class-path $(echo /opt/spark/vast/*.jar | tr ' ' ':') \\
  --conf spark.executor.extraClassPath=$(echo /opt/spark/vast/*.jar | tr ' ' ':') \\
  --jars $(echo /opt/spark/vast/*.jar | tr ' ' ',') \\
  --conf spark.executor.userClassPathFirst=true \\
  --conf spark.driver.userClassPathFirst=true \\
  --conf spark.driver.host=172.200.13.12 \\
  --conf spark.ndb.endpoint=http://172.200.204.12 \\
  --conf spark.ndb.access_key_id=G7H5I7QSQ7CSSZYGO3K8 \\
  --conf spark.ndb.secret_access_key=R2C/CmJFAMHfYirPpxbbG+2g3hI8UhH6p/UUWo6u \\
  --conf spark.sql.catalog.ndb=spark.sql.catalog.ndb.VastCatalog \\
  --conf spark.sql.extensions=ndb.NDBSparkSessionExtension \\
  --conf spark.sql.catalogImplementation=in-memory \\
  --driver-memory 32g
""")

Step 2 — Install the client#

On the client machine (your notebook or script), install only the Connect extras — no full Spark needed:

# pip install pyspark[connect]

Step 3 — Connect#

Use a sc:// URI pointing directly at the Connect server. No Spark configuration required on the client.

from pyspark.sql import SparkSession

CONNECT_HOST = "172.200.13.12"
CONNECT_PORT = 15002            # default Spark Connect gRPC port
TABLE_FQN    = "ndb.`nba-db`.games_schema.games_tbl"

spark = SparkSession.builder \
    .remote(f"sc://{CONNECT_HOST}:{CONNECT_PORT}") \
    .getOrCreate()

print(f"Connected. Spark version: {spark.version}")

Step 4 — Query VAST NDB#

Once connected, use spark.sql() or the DataFrame API exactly as you would in a local session. The catalog reference format is:

ndb.`<bucket>`.<schema>.<table>

Backticks are required around bucket names that contain hyphens.

# Inspect the schema
spark.sql(f"DESCRIBE TABLE {TABLE_FQN}").show(truncate=False)

# Row count
total = spark.sql(f"SELECT COUNT(*) FROM {TABLE_FQN}").collect()[0][0]
print(f"Total games: {total}")

# Sample rows
spark.sql(
    f"""
    SELECT gameId, gameDate, hometeamName, awayteamName, homeScore, awayScore, gameType
    FROM {TABLE_FQN}
    LIMIT 10
    """
).show(truncate=False)

# Aggregation
spark.sql(
    f"""
    SELECT gameType, COUNT(*) AS games
    FROM {TABLE_FQN}
    GROUP BY gameType
    ORDER BY games DESC
    """
).show(truncate=False)

spark.stop()