Tutorials & Use Cases·March 21, 2026·9 min read

Deploying an Agent Platform on Cloud Run + Neo4j

A real-world topology for running an agent platform on Google Cloud — three Cloud Run services, a self-hosted Neo4j VM, and the gotchas that bite in production.

By Matrix Team

Most "deploy your AI app" guides stop at docker run. That's fine for a demo. It is not fine for a platform that holds open WebSockets for the length of a phone call, serves a Next.js dashboard, runs a Python tool service, and persists every tenant's data in a graph database that other parts of the system query on the hot path.

This post is the topology we actually run to deploy an AI agent platform on GCP — three Cloud Run services plus a self-hosted Neo4j VM — and, more importantly, the production gotchas that took real calls to discover. If you're standing up something similar, read the gotchas section before you write a deploy script. Several of them fail silently.

Tenancy underpins all of this. If you haven't read Multi-Tenancy Is Not a Feature You Bolt On, start there — the deployment story assumes every row is already tenant-scoped.

The topology at a glance

Four moving parts, each chosen for a specific runtime characteristic:

  Telephony ──HTTP/wss──▶  backend (Cloud Run)
                              ├── Gemini Live / Vertex AI
                              └── bolt://  ──▶  Neo4j VM (Compute Engine, static IP)

  Browser   ──HTTPS────▶  web (Cloud Run, Next.js)  ──/api/* /mcp/* rewrites──▶  backend

  Tool svc  (Cloud Run, Python)  ◀── backend HTTP skills

  Secret Manager:  JWT secret · master key · master salt · webhook secret · Neo4j password

The split is deliberate. Three services with three different scaling profiles, and one stateful VM that Cloud Run is wrong for.

Service 1 — the backend (Spring, always warm)

This is the agent runtime: a Spring Boot app holding the WebSocket bridge for voice, the SSE stream for chat, the MCP server, and every query into Neo4j. It runs with a specific set of Cloud Run flags, and each one earns its place:

gcloud run deploy backend \
  --image "$REGION-docker.pkg.dev/<project>/matrix/backend:$TAG" \
  --region "$REGION" \
  --min-instances=1 \
  --timeout=3600 \
  --no-cpu-throttling \
  --cpu-boost \
  --allow-unauthenticated \
  --set-secrets=NEO4J_PASSWORD=NEO4J_PASSWORD:latest,... \
  --set-env-vars=NEO4J_URI=bolt://<neo4j-ip>:7687

--min-instances=1 — no cold start on the first inbound call. A caller will not wait out a JVM boot. This is the dominant idle cost and it is worth it.
--timeout=3600 — the request timeout is the maximum WebSocket lifetime. A voice call is one long-lived request; the default 5-minute timeout would guillotine every call. (The gcloud flag is --timeout, not --request-timeout — see the gotchas.)
--no-cpu-throttling — Cloud Run throttles CPU to near-zero between requests by default. A voice bridge does real work between request frames (audio buffering, the post-call memory extractor firing). Throttling would stall it.
--cpu-boost — full CPU during the ~3-second Spring startup so a scaled-up instance is ready fast.

Service 2 — the web dashboard (Next.js, scales to zero)

The operator dashboard has the opposite profile: bursty human traffic, no long-lived connections, fine to cold-start. So it scales to zero and costs almost nothing at idle.

The interesting part is how the browser reaches the backend. The dashboard does not call the backend's URL directly from the browser — that would trip CORS preflight on every request. Instead, /api/* and /mcp/* are Next.js rewrites() that proxy to the backend server-side:

// next.config.js
async rewrites() {
  return [
    { source: "/api/:path*", destination: `${process.env.MATRIX_BACKEND_URL}/api/:path*` },
    { source: "/mcp/:path*", destination: `${process.env.MATRIX_BACKEND_URL}/mcp/:path*` },
  ];
}

Same-origin from the browser's perspective, no CORS dance. (There's a sharp edge here about when MATRIX_BACKEND_URL is read — covered below.)

Service 3 — the Python tool service (always warm)

Domain tools that aren't a good fit for the JVM live in a separate Python service. It runs at --min-instances=1 too, for one reason: an agent that calls a tool mid-conversation cannot afford a cold start. Dead air during a voice call is the worst UX you can ship. The cost of one warm instance is far cheaper than the conversation it saves.

Service 4 — Neo4j on a Compute Engine VM (not Cloud Run)

Cloud Run is stateless and ephemeral by design — exactly wrong for a graph database. So Neo4j 5 + APOC runs on a single Compute Engine VM. The backend talks to it over bolt:// using a strong password from Secret Manager.

The one thing you must get right: reserve a static external IP.

gcloud compute addresses create matrix-neo4j-ip --region=<region>
# attach it when creating the instance, then:
gcloud compute instances create matrix-neo4j \
  --address=<neo4j-ip> --zone=<zone> ...

Without a reserved address, the VM's external IP changes on every stop/start cycle — and the moment it does, NEO4J_URI on the backend points at nothing. Every query hangs until the connect timeout, the dashboard looks dead, and you're debugging a "Neo4j is down" incident that is really just a changed IP. Reserve the address once; stop/start cycles no longer break the wire.

Secrets live in Secret Manager, never in the image

The JWT signing secret, the master encryption key + salt (for BYOK provider keys encrypted at rest), the telephony webhook secret, and the Neo4j password are all in Secret Manager, mounted into the backend via --set-secrets. None of them are ever in the container image, in --set-env-vars, or in the repo. The runtime service account gets secretmanager.secretAccessor and nothing reads the values off disk.

Deploy in one shot

The whole thing is a single script. The ordering matters because of the Next.js build-time quirk (next section): deploy the backend first, capture its stable URL, then build the web image with that URL baked in.

PROJECT=<project> NEO4J_HOST=<neo4j-ip> ./scripts/deploy.sh

It builds and pushes both images, deploys the backend, reads back its URL, builds the web image with --build-arg MATRIX_BACKEND_URL=<backend-url>, deploys web, and prints the webhook URLs you paste into your telephony provider.

The gotchas that actually bite

Everything above is the happy path. Here is what we learned the hard way. The fixes are in the deploy script; this list exists so you don't re-discover them on a live call.

1. Cloud Run WebSockets are HTTP/1.1 only

This one cost the most. Google Frontend negotiates HTTP/2 via ALPN with any client that offers h2. WebSockets only work over HTTP/1.1. So a WS client that advertises h2,http/1.1 in its ALPN list gets h2 selected, and the upgrade is rejected with Can "Upgrade" only to "WebSocket".

Cloud Run has no flag to disable HTTP/2 negotiation. The fix is on the client: force the ALPN list to http/1.1 only.

WS client ALPN: ["http/1.1"]   // NOT ["h2", "http/1.1"]

Pure WS libraries usually default to this and work fine. The ones that bite share an HTTPS connection pool with your REST calls — then the pool offers h2 and your WS upgrade breaks unpredictably. If voice works locally and dies on Cloud Run, this is your first suspect.

2. Next.js bakes `rewrites()` at BUILD time, not runtime

next.config.js rewrites() is evaluated during next build and frozen into routes-manifest.json. If you set MATRIX_BACKEND_URL only as a Cloud Run runtime env var, the rewrites array was already locked — to an empty list — when the image was built. The dashboard then 404s every /api/* call and you spend an hour staring at a config that looks correct.

The fix: pass the backend URL as a --build-arg, not just a runtime env.

# web/Dockerfile
ARG MATRIX_BACKEND_URL=""
ENV MATRIX_BACKEND_URL=$MATRIX_BACKEND_URL
RUN npm run build

This is why the deploy script deploys the backend first — it needs the URL at build time, not at deploy time.

3. Cloud Run `/tmp` is per-container and ephemeral

Anything written to /tmp lives only as long as that container instance, and is invisible to every other instance. The per-agent tool sandbox and knowledge-file processing live there intentionally as a cache — files are re-materialized from Neo4j on demand. Never put durable state in /tmp. If you assume a file you wrote on one request is there on the next, you'll be right just often enough to be dangerous, then wrong when Cloud Run scales or recycles the instance.

4. Size the Neo4j VM correctly — or it OOM-locks

A too-small VM is worse than a slow one. Under memory pressure Neo4j can OOM-lock in a way that keeps the TCP port open but hangs every bolt session. From the backend's side this is the cruelest failure mode: the connection appears reachable, so every query sits and waits out the full connect timeout before failing. The whole platform feels frozen, and a port check tells you everything is fine.

Size the VM with headroom (we run a 4 GB instance after a 2 GB one locked up). And keep the recovery in your back pocket — if bolt hangs, reset the VM:

gcloud compute instances reset matrix-neo4j --zone=<zone>

A reset clears the lock far faster than diagnosing it under pressure.

5. `--set-env-vars` on redeploy can wipe your flags

A real footgun. gcloud run deploy --set-env-vars=... replaces the entire env var set — it does not merge. If you ever set platform flags out-of-band (for example via SPRING_APPLICATION_JSON to flip a feature on), the next deploy that uses --set-env-vars silently wipes them, and the feature quietly reverts to its default. Use --update-env-vars for incremental changes, and keep every flag your deployment depends on in the deploy script itself so it's reapplied every time. A flag that lives only in the live service and not in source control is a flag you will lose on the next deploy.

A note on voice paths

The WebSocket gotcha (#1) only matters because the backend bridges audio server-side for telephony. The browser-direct voice path holds its own WebSocket straight to the model and never crosses Cloud Run's WS edge — a meaningfully different deployment surface. If you're weighing the two, One Agent, Two Voice Paths: Telephony Bridge vs. Browser-Direct walks through the trade-offs.

Takeaway

The shape that works for a production agent platform on GCP: one always-warm backend with a long request timeout and no CPU throttling for the long-lived connections, one scale-to-zero dashboard that proxies the API through Next.js rewrites to dodge CORS, one always-warm tool service so agents never hit a cold start mid-call, and one stateful Neo4j VM on a reserved static IP so a routine restart never silently breaks the wire. Secrets in Secret Manager, deploy in one ordered script, and respect the four silent failure modes: HTTP/1.1-only WebSockets, build-time Next.js rewrites, ephemeral /tmp, and an OOM-locking graph DB.

None of these are exotic. They're just the difference between a deploy that demos and a deploy that takes calls.

Ready to ship one? Spin up a workspace, point your own GCP project at it, and adapt scripts/deploy.sh to your tenant — the topology here is exactly what runs in production. The full runbook lives in docs/GCP_DEPLOY.md.

#deploy ai agent gcp#cloud run#neo4j#devops

Build your first agent on Matrix

Spin up a workspace, wire up tools and knowledge, give your agent a voice, and talk to it in real time — no agent code required.

Create a workspace Read more articles