Querying VAST NDB with Spark Connect#
Spark Connect (introduced in Spark 3.4) provides a lightweight gRPC-based client that lets you run Spark queries remotely — no local Spark installation required beyond pyspark[connect].
How it works#
Your notebook / script
(pyspark[connect] only)
│ sc:// (gRPC)
▼
Spark Connect Server ──────► Spark Cluster
(172.200.13.12:15002) │
▼
VAST NDB Catalog
(172.200.204.12)
All Spark and VAST NDB configuration lives on the server side. The client just sends SQL or DataFrame operations over gRPC.
Key caveats#
Direct connection required. Connect only works when the client connects directly to the server endpoint. No proxy or gateway in between.
Use
spark.executor.instancesfor parallelism, notspark.ndb.data_endpoints.
Scala version: the server must run against the system Spark at
/opt/spark(Scala 2.13).pip install pysparkbundles Scala 2.12 and will not work on the server side.
Environment#
Component |
Value |
|---|---|
Spark master |
|
NDB endpoint |
|
Driver host |
|
Connect port |
|
VAST JARs |
|
Example dataset |
|
Step 1 — Start the Connect Server#
Run this once on the Spark driver node. All VAST NDB configuration is passed here — the client needs none of it.
# Run in a terminal on the Spark driver node
print("""
/opt/spark/sbin/start-connect-server.sh \\
--master spark://172.200.204.12:2424 \\
--conf spark.executor.memory=10g \\
--conf spark.executor.instances=3 \\
--conf spark.executor.cores=3 \\
--driver-class-path $(echo /opt/spark/vast/*.jar | tr ' ' ':') \\
--conf spark.executor.extraClassPath=$(echo /opt/spark/vast/*.jar | tr ' ' ':') \\
--jars $(echo /opt/spark/vast/*.jar | tr ' ' ',') \\
--conf spark.executor.userClassPathFirst=true \\
--conf spark.driver.userClassPathFirst=true \\
--conf spark.driver.host=172.200.13.12 \\
--conf spark.ndb.endpoint=http://172.200.204.12 \\
--conf spark.ndb.access_key_id=G7H5I7QSQ7CSSZYGO3K8 \\
--conf spark.ndb.secret_access_key=R2C/CmJFAMHfYirPpxbbG+2g3hI8UhH6p/UUWo6u \\
--conf spark.sql.catalog.ndb=spark.sql.catalog.ndb.VastCatalog \\
--conf spark.sql.extensions=ndb.NDBSparkSessionExtension \\
--conf spark.sql.catalogImplementation=in-memory \\
--driver-memory 32g
""")
Step 2 — Install the client#
On the client machine (your notebook or script), install only the Connect extras — no full Spark needed:
# pip install pyspark[connect]
Step 3 — Connect#
Use a sc:// URI pointing directly at the Connect server. No Spark configuration required on the client.
from pyspark.sql import SparkSession
CONNECT_HOST = "172.200.13.12"
CONNECT_PORT = 15002 # default Spark Connect gRPC port
TABLE_FQN = "ndb.`nba-db`.games_schema.games_tbl"
spark = SparkSession.builder \
.remote(f"sc://{CONNECT_HOST}:{CONNECT_PORT}") \
.getOrCreate()
print(f"Connected. Spark version: {spark.version}")
Step 4 — Query VAST NDB#
Once connected, use spark.sql() or the DataFrame API exactly as you would in a local session. The catalog reference format is:
ndb.`<bucket>`.<schema>.<table>
Backticks are required around bucket names that contain hyphens.
# Inspect the schema
spark.sql(f"DESCRIBE TABLE {TABLE_FQN}").show(truncate=False)
# Row count
total = spark.sql(f"SELECT COUNT(*) FROM {TABLE_FQN}").collect()[0][0]
print(f"Total games: {total}")
# Sample rows
spark.sql(
f"""
SELECT gameId, gameDate, hometeamName, awayteamName, homeScore, awayScore, gameType
FROM {TABLE_FQN}
LIMIT 10
"""
).show(truncate=False)
# Aggregation
spark.sql(
f"""
SELECT gameType, COUNT(*) AS games
FROM {TABLE_FQN}
GROUP BY gameType
ORDER BY games DESC
"""
).show(truncate=False)
spark.stop()