Production deployment (Helm)
This page describes how to deploy Matyan to Kubernetes in production using the Helm chart. The chart deploys the application services (backend, frontier, UI, ingestion workers, control workers) and optionally in-cluster infrastructure (Kafka, S3-compatible store, FoundationDB). For production you typically use external FDB, Kafka, and S3 and deploy only the application layer.
Overview
The Helm chart lives at deploy/helm/matyan/. It:
- Always deploys (when enabled): backend (REST API), frontier (ingestion gateway), UI (React app), ingestion workers (Kafka → FDB), control workers (control-events → S3 cleanup).
- Optionally deploys (via subcharts): Kafka (Bitnami), S3-compatible store (RustFS), FoundationDB operator and cluster. These are enabled by default for a self-contained install. For production with external services, disable each subchart by setting the corresponding
*.install: false.
All application services are stateless and can be scaled by increasing replicaCount. The chart creates Deployments, Services, optional Ingress, optional ServiceMonitors, and optional CronJobs for periodic cleanup (orphan S3 deletion, tombstone removal). See Periodic cleanup jobs (CronJobs) for how to enable and configure them.
Prerequisites
- Kubernetes 1.24+
- Helm 3.10+
- A FoundationDB cluster reachable from the cluster (or deploy via the chart with
fdb-cluster.install: trueand the FDB operator). - A Kafka broker (or deploy via the chart with
kafka.install: true). - An S3-compatible object store for blob artifacts (or deploy RustFS with
rustfs.install: true). - (Recommended) An ingress controller (e.g. ingress-nginx, Traefik) and TLS certificates (e.g. cert-manager) for exposing the UI and backend.
Before first install, fetch chart dependencies:
cd deploy/helm/matyan
helm dependency build
This pulls the Bitnami Kafka and RustFS subcharts (and the bundled FDB operator) according to Chart.yaml; they are only deployed when the corresponding *.install condition is true.
Published chart (Kustomize / Helm repo)
The chart is published at https://4gt-104.github.io/matyan-core/helm (version 0.1.0, app version 0.1.0). You can use it from Kustomize:
helmCharts:
- name: matyan
repo: https://4gt-104.github.io/matyan-core/helm
version: "0.1.0"
namespace: matyan
releaseName: matyan
valuesFile: values.yaml
Or add the repo and install with Helm:
helm repo add matyan https://4gt-104.github.io/matyan-core/helm
helm install my-matyan matyan/matyan --version 0.1.0 -f values.yaml
To publish a new chart version: run ./scripts/publish-helm-to-gh-pages.sh from the repo root, then commit and push the gh-pages submodule (git add helm, git commit, git push).
Chart structure and values
Default values
values.yaml defines all options with inline comments. Important defaults:
- kafka.install:
true— deploys an in-cluster Kafka broker. Setfalseand configurekafkaClient.bootstrapServersto use an external cluster. - rustfs.install:
true— deploys an in-cluster RustFS (S3-compatible) store. Setfalseand configures3.*to use an external S3. - fdb-operator.install:
true— deploys the FDB operator. Setfalseif the operator is already installed cluster-wide. - fdb-cluster.install:
true— creates a FoundationDBCluster CR. Setfalseto supply the FDB cluster file yourself viaexistingConfigMap,existingSecret, orclusterFileContent. - periodicJobs.cleanupOrphanS3.enabled / periodicJobs.cleanupTombstones.enabled:
true— periodic cleanup CronJobs are on by default. Setenabled: falseorschedule: ""to disable. - backend.hostBase, ui.hostBase:
""— required; you must set these to the public URLs (e.g.https://api.matyan.example.com,https://matyan.example.com). - ingress.enabled:
false— enable and configure when you want the chart to create the Ingress.
Values overlays
- values-production.yaml — Intended for production: deploys the FDB cluster (operator assumed pre-installed), disables in-cluster Kafka and RustFS in favour of external services, credentials via
existingSecret, Ingress enabled. Use it as a base and override with--setor a custom file. - values-dev.yaml — Deploys all subcharts in-cluster (single-node FDB, Kafka, RustFS) for local/dev; useful with a single-node cluster or Minikube. Overrides only what differs from the defaults (process counts, storage class, storage size).
You typically run:
helm upgrade --install matyan ./deploy/helm/matyan \
-f deploy/helm/matyan/values-production.yaml \
--set ui.hostBase=https://matyan.example.com \
--set backend.hostBase=https://api.matyan.example.com \
--namespace matyan --create-namespace
and supply the rest (Kafka bootstrap, S3 endpoint, secrets) via --set, -f, or a sealed/encrypted values file.
Required configuration for production
These values must be set (in a values file or via --set) for a production-style deploy with external services.
1. Public URLs (required)
- backend.hostBase — Public URL of the backend API (e.g.
https://api.matyan.example.com). Used for CORS and by the UI as the default API base; Helm will fail if empty. - ui.hostBase — Public URL of the frontend (e.g.
https://matyan.example.com). Used for CORS and Ingress; required.
You can use the same host for both (path-based routing: / → UI, /api/ → backend). The chart merges Ingress rules when both URLs use the same hostname.
2. FoundationDB
The backend and workers need the FDB cluster file (connection string). You must provide it in one of these ways:
- fdb-cluster.existingConfigMap — Name of a ConfigMap that contains the cluster file. The key inside the ConfigMap is fdbClient.clusterFileKey (default
fdb.cluster). The chart mounts this into backend and worker pods. - fdb-cluster.existingSecret — Same idea with a Secret (e.g. for sensitive cluster files).
- fdb-cluster.clusterFileContent — Inline string; the chart creates a ConfigMap from it. Prefer ConfigMap/Secret in production.
Set fdbClient.clusterFileDir and fdbClient.clusterFileName if your ConfigMap/Secret uses a different key or you want a different mount path. The backend and workers read FDB_CLUSTER_FILE (path = clusterFileDir/clusterFileName).
If you deploy FDB via the chart (fdb-operator.install and fdb-cluster.install), the operator writes the cluster file to a ConfigMap and the chart wires it automatically; you do not set existingConfigMap in that case.
3. Kafka
- kafkaClient.bootstrapServers — Comma-separated list of broker addresses (e.g.
kafka-broker-1:9092,kafka-broker-2:9092). Required whenkafka.installis false. - kafkaClient.securityProtocol — Optional; set to
SASL_PLAINTEXTorSASL_SSLif your broker uses SASL. - kafkaClient.saslMechanism, kafkaClient.saslUsername, kafkaClient.saslPassword — Or use kafkaClient.existingSecret with keys saslUsernameKey / saslPasswordKey (default
sasl-username,sasl-password) to avoid plaintext in values.
The chart creates no Kafka topics when using an external broker; ensure the topics data-ingestion and control-events exist (and partition counts match your expectations). The kafka-init-job only runs when kafka.install is true and creates these topics in-cluster.
4. S3 (blob storage)
- s3.endpoint — Internal endpoint URL used by backend and frontier pods (e.g.
https://s3.amazonaws.comorhttp://minio.example.svc:9000). Required whenrustfs.installis false. - s3.bucket — Bucket name (default
matyan-artifacts). The bucket must exist when using an external S3; the s3-init-job only runs whenrustfs.installis true. - s3.publicEndpoint — Optional; public URL used in presigned upload URLs returned to training clients. Must be reachable from outside the cluster. If unset, the chart may derive it from RustFS S3 Ingress when RustFS is in use; otherwise presigned URLs use
s3.endpoint(which may not be reachable from clients). - s3.existingSecret — Recommended. Name of a Secret with S3 credentials. Keys default to s3.accessKeyKey / s3.secretKeyKey (e.g.
s3-access-key,s3-secret-key). If not set, you must set s3.accessKey and s3.secretKey (plaintext; avoid in production).
5. Blob URI secret (required for stable blob URLs)
The backend encrypts blob references (for images, audio, etc.) with a Fernet key. If this key is not fixed, blob URIs break across pod restarts and across replicas.
- blobUriSecret.existingSecret — Recommended. Name of a Secret containing the Fernet key. The key inside the Secret is blobUriSecret.key (default
blob-uri-secret). - blobUriSecret.value — Alternative: inline base64-encoded Fernet key. Prefer existingSecret in production.
Generate a key once and store it in a Secret:
python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
kubectl create secret generic matyan-blob-uri --from-literal=blob-uri-secret="<paste-key>" -n matyan
Then set blobUriSecret.existingSecret: matyan-blob-uri.
6. Ingress and TLS
- ingress.enabled:
true— Create the Ingress resource. - ingress.className — Ingress class (e.g.
nginx,traefik). Omit to use the cluster default. - ingress.annotations — Controller-specific annotations (e.g. cert-manager issuer, body size).
- ingress.uiTlsSecretName / ingress.backendTlsSecretName — Optional; TLS secret names for the UI and backend hostnames. If not set, the chart still generates TLS entries (when hostBase uses
https://) without asecretName, so the controller or cert-manager can attach certificates.
The chart builds Ingress rules from ui.hostBase and backend.hostBase: same host → one host rule with path-based routing (/ → UI, /api/ → backend, /api/v1/ws/ and artifact paths → frontier); different hosts → two host rules.
What the chart deploys (resources)
| Resource type | What |
|---|---|
| Deployments | backend, frontier, ui, ingestion-worker, control-worker (each with replicas, image, env, probes). |
| Services | One ClusterIP (or override) per Deployment; workers have an extra Service for the metrics port when metricsPort is set to a non-zero value. |
| ConfigMaps | FDB cluster file (when using clusterFileContent or operator-managed); optional per-service env. |
| Secrets | Not created for credentials; you create S3, Kafka SASL, blob-uri, and optionally UI API token secrets and reference them via existingSecret. |
| Ingress | One Ingress with rules for UI and backend (and frontier paths) when ingress.enabled is true. |
| Jobs | kafka-init-job (post-install/upgrade hook) when kafka.install is true; s3-init-job when rustfs.install is true. |
| CronJobs | cleanup-orphan-s3, cleanup-tombstones when periodicJobs..enabled and schedule* are set. |
| ServiceMonitors | One per service when metrics.serviceMonitor.enabled is true (requires Prometheus Operator CRDs). |
| Subcharts | FDB operator, Kafka (Bitnami), RustFS — only when the corresponding *.install is true. |
Backend, ingestion worker, and control worker share the same image (matyan-backend); workers override the container command to matyan-backend ingest-worker and matyan-backend control-worker. Frontier and UI use their own images.
Verifying the deployment
After installing or upgrading the chart, use these checks to confirm everything is deployed correctly.
1. Helm release and pods
# Replace RELEASE_NAME and NAMESPACE with your install (e.g. matyan / matyan)
helm status RELEASE_NAME -n NAMESPACE
helm list -n NAMESPACE
kubectl get pods -n NAMESPACE -l app.kubernetes.io/instance=RELEASE_NAME
All application pods (backend, frontier, ui, ingestion-worker, control-worker) should be Running and Ready (e.g. 1/1). If any pod is CrashLoopBackOff or not ready, use kubectl describe pod <name> -n NAMESPACE and kubectl logs <name> -n NAMESPACE to inspect events and logs.
2. Services and (if enabled) Ingress
kubectl get svc,ingress -n NAMESPACE -l app.kubernetes.io/instance=RELEASE_NAME
Ensure Services exist and, if you use Ingress, that the Ingress has an address or is admitted by your controller.
3. Health endpoints
Backend and frontier expose readiness probes. From inside the cluster (or via port-forward):
# Backend: readiness checks FDB + Kafka
kubectl exec -n NAMESPACE deploy/RELEASE_NAME-backend -- curl -s -o /dev/null -w "%{http_code}" http://localhost:53800/health/ready/
# Frontier: readiness checks Kafka + S3
kubectl exec -n NAMESPACE deploy/RELEASE_NAME-frontier -- curl -s -o /dev/null -w "%{http_code}" http://localhost:53801/health/ready/
A 200 response means the component considers its dependencies healthy. If you don’t have curl in the image, use kubectl port-forward and hit the same paths from your machine.
4. UI and API
- Open the UI in a browser at ui.hostBase (or the URL from your Ingress). You should see the Matyan dashboard.
- Create a project or open an existing one; the UI talks to the backend. If the UI loads but projects fail, check backend logs and CORS (backend and UI hostBase must be correct).
- Optionally create a run (e.g. with the Python client) and confirm it appears in the UI and that ingestion workers are consuming (check worker logs or Kafka consumer lag).
5. Optional: Kafka and FDB
- If you deployed Kafka or FDB via the chart, check their pods and CRs:
kubectl get pods -n NAMESPACE,kubectl get foundationdbclusters -n NAMESPACE(when using the FDB operator). - For external Kafka, ensure topics data-ingestion and control-events exist; backend and workers expect them.
Init containers and startup order
Pods wait for dependencies via init containers so they do not start until services are ready:
- Backend and workers: wait-for-fdb (if a cluster file is configured) then wait-for-kafka (if kafka.waitForReady.enabled). Both can be disabled if FDB and Kafka are guaranteed ready.
- Frontier: wait-for-kafka only.
Timeout and image for these are configured under fdbClient.waitForReady and kafka.waitForReady. The kafka-init-job runs only when Kafka is deployed by the chart and creates the two topics; with external Kafka you create topics out of band.
Scaling and tuning
- backend.replicaCount, frontier.replicaCount — Scale for API and ingestion load; backend is read-heavy, frontier is WebSocket + presign.
- ingestionWorker.replicaCount — Scale with the number of data-ingestion topic partitions; Kafka assigns partitions to consumers. A common starting point is one replica per partition (e.g. 6 if kafka.dataIngestionPartitions is 6).
- controlWorker.replicaCount — control-events has one partition by default; one replica is usually enough. Add more only if you add partitions and need throughput.
- ui.replicaCount — Scale for frontend traffic; stateless.
Set resources (requests/limits) for each component to avoid overcommit and to improve scheduling. Use nodeSelector, tolerations, and affinity to place pods where appropriate (e.g. workers on dedicated nodes).
Credentials and secrets (production)
Use Kubernetes Secrets for all sensitive values; reference them with existingSecret (and key names) so nothing is stored in Helm values or in Git:
| What | Values | Secret keys (default) |
|---|---|---|
| S3 | s3.existingSecret | s3.accessKeyKey, s3.secretKeyKey |
| Kafka SASL | kafkaClient.existingSecret | kafkaClient.saslUsernameKey, kafkaClient.saslPasswordKey |
| Blob URI Fernet | blobUriSecret.existingSecret | blobUriSecret.key |
| UI → backend auth token | ui.apiAuthTokenSecret | ui.apiAuthTokenKey |
| FDB cluster file | fdb-cluster.existingSecret (or existingConfigMap) | fdbClient.clusterFileKey |
Example: create S3 and blob-uri secrets, then in values or --set:
s3:
endpoint: "https://s3.amazonaws.com"
bucket: "matyan-artifacts"
existingSecret: "matyan-s3"
blobUriSecret:
existingSecret: "matyan-blob-uri"
Periodic maintenance (CronJobs)
The chart creates two CronJobs that run the backend CLI. Both are enabled by default:
- cleanupOrphanS3 — Deletes S3 objects for runs that have a deletion tombstone in FDB (e.g. if control-worker never processed the event). Default schedule: daily at 03:00 (
"0 3 * * *"). - cleanupTombstones — Removes old deletion tombstones from FDB so the
_deletedindex does not grow. Default schedule: weekly Sunday at 04:00 ("0 4 * * 0").
To disable a job set periodicJobs.<name>.enabled: false (or schedule: ""). Optional settings (lock TTL, limits) are documented in values.yaml under periodicJobs.*.
Prometheus metrics
Backend and frontier expose metrics on their HTTP port (/metrics/). Workers expose a separate metrics port (default 9090) when ingestionWorker.metricsPort / controlWorker.metricsPort are non-zero. Enable metrics.serviceMonitor.enabled: true and set metrics.serviceMonitor.labels (e.g. release: kube-prometheus-stack) so your Prometheus instance scrapes them. See the chart README for metric names and labels.
Full configuration reference
Every option is documented in values.yaml with comments. The chart README at deploy/helm/matyan/README.md contains:
- Architecture diagram and service roles
- Prerequisites and
helm dependency build - Quick start for development (values-dev) and production (values-production)
- Detailed tables for every section (Kafka, Kafka client, RustFS, FDB operator, FDB cluster, FDB client, S3, blobUriSecret, CORS, backend, frontier, UI, workers, Ingress)
- Credential management (existingSecret for each credential)
- Fallbacks and automations (bootstrap servers, S3 endpoint/public endpoint, CORS, FDB ConfigMap name, Ingress same-host vs split-host, TLS)
- FoundationDB operating modes (operator + CR, CR only, external cluster file)
- Init jobs (kafka-init, s3-init) and pod startup ordering
- Prometheus metrics and ServiceMonitor usage
- Upgrading and dependency updates
Use that README as the authoritative reference; this page summarizes production deployment and the main concepts so you can get a deployment running and then tune via values.