Skip to content

Troubleshooting

Common failure modes and recovery procedures for the Virtufin WorkManager.

See also: Development guide → Troubleshooting for service-startup, port-conflict, Python-engine, and protoc-generation issues. Deployment guide → Troubleshooting for Kubernetes pod, image-pull, and worker-recovery issues.

Worker lifecycle failures

Worker stuck in "Created" state (never started)

Symptom: A worker is created via CreateWorker but never transitions to "Running" after StartWorker is called.

Likely causes: - The Python engine subprocess crashed during startup - The C# engine failed to compile the source (Roslyn error) - The .NET DLL engine failed to load the assembly - The worker code is unreachable (URL fetch failed)

Diagnose:

# Check the engine startup logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager | grep -i "engine\|start\|load"

# For Python workers, check the Python subprocess stderr
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager | grep -i "python\|subprocess"

# For C# workers, check the Roslyn compilation errors
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager | grep -i "roslyn\|compile"

"Worker not found" after restart

Symptom: After a WorkManager restart, workers that were Running are no longer found.

Likely cause: Worker recovery is asynchronous and runs in the WorkManagerRecoveryHostedService. The recovery sweeps the state store and re-creates worker instances, but this can take a few seconds depending on the worker count.

Diagnose:

# Check the recovery logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager | grep -i "recovery"

# Verify the state store has the worker records
docker exec -it dapr_redis_1 redis-cli KEYS 'worker|*'

Fix: Wait for the recovery to complete (typically < 30s for 100 workers). The readiness probe gates traffic until recovery finishes.

Python engine failures

"Python worker process exited with code N"

Common causes:

Exit code Meaning Fix
127 Python not found Install Python 3.9+ on the WorkManager host
1 Worker code raised an unhandled exception Check the worker code for runtime errors
-1 Killed by SIGTERM The engine called StopProcessAsync (expected during shutdown)
137 Killed by OOM Increase memory limits or reduce worker memory footprint

Diagnose:

# Check the Python engine logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager | grep -i "python\|sandbox"

# Check the blocked module list
kubectl get configmap -n virtufin workmanager-config -o yaml | grep PYTHON_BLOCKED

Sandbox module import errors

Symptom: Worker logs show ImportError: <module> is not allowed.

Likely cause: The worker code imports a module that's in the PYTHON_BLOCKED_MODULES list (default: subprocess, os, socket, requests, etc.).

Fix: Either remove the import from the worker code, or add the module to PYTHON_ALLOWED_PACKAGES (note: this only allows pip packages, not stdlib modules).

Code fetched from URL fails

Symptom: LoadCodeFromUrl returns success but the worker fails to start.

Likely cause: The fetched content doesn't match the expected MIME type (loaded but unparseable), or the SHA-256 hash doesn't match (if ContentSha256 is set).

Diagnose:

# Check the code fetcher logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager | grep -i "fetcher\|url\|sha256"

# Verify the URL is in the allow-list
kubectl get configmap -n virtufin workmanager-config -o yaml | grep ALLOWED_CODE_SOURCE_HOSTS

C# / .NET engine failures

C# source compilation errors

Symptom: Worker is created but the C# engine returns a compilation error when loading the source.

Likely cause: The C# source has syntax errors or references types not available in the runtime assemblies.

Diagnose:

# Check the Roslyn compilation error in the logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager | grep -i "error CS\|roslyn"

Fix: Fix the worker source code and call LoadCodeFromContent with the updated bytes.

.NET DLL load errors

Symptom: LoadCodeFromUrl returns success but the worker fails with InvalidOperationException: Assembly load failed.

Likely cause: A dependency DLL in the ZIP is missing, has a version conflict with the WorkManager's assemblies, or is for the wrong platform (any CPU vs x64 vs ARM).

Diagnose:

# Check the assembly load logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager | grep -i "assembly\|dependency\|resolving"

State store failures

Worker state not persisting across restarts

Symptom: Workers created before a restart are not recovered.

Likely cause: The state store (Redis/Valkey) is unavailable during restart, or the recovery hosted service is misconfigured.

Diagnose:

# Check the state store contents
docker exec -it dapr_redis_1 redis-cli KEYS 'worker|*'

# Check the recovery service logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager | grep -i "recovery"

Memory and resource issues

Memory leak in long-running workers

Symptom: The WorkManager process's memory usage grows over time without bound.

Likely cause: Each worker holds an engine instance with a Python subprocess. If workers are not being deleted when no longer needed, the subprocess count grows.

Diagnose:

# Count Python subprocesses
ps aux | grep -c "python3.*worker"

# Compare with the worker count
docker exec -it dapr_redis_1 redis-cli KEYS 'worker|*' | wc -l

Fix:

  1. Ensure your application calls DeleteWorker when workers are no longer needed
  2. Configure a state store TTL on worker records (recommended: 24h)
  3. Increase the WorkManager's memory limits if legitimate load

Recovery procedures

Restart all workers

# Stop all workers (they will be re-created on next start)
docker exec -it dapr_redis_1 redis-cli --scan --pattern 'worker|*' | xargs -L 100 docker exec -i dapr_redis_1 redis-cli DEL

# Restart the WorkManager
kubectl rollout restart deployment/virtufin-workmanager -n virtufin
kubectl rollout status deployment/virtufin-workmanager -n virtufin

Force recovery re-run

# Restart the WorkManager — recovery runs on startup
kubectl rollout restart deployment/virtufin-workmanager -n virtufin
kubectl rollout status deployment/virtufin-workmanager -n virtufin

# Watch the recovery logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager -f | grep -i recovery

Reporting issues

If none of the above resolves your issue, gather the following and file a Gitea issue:

  • WorkManager version (LIBRARY_VERSION from the build info endpoint)
  • Worker ID(s) affected
  • Engine type (Python / C# source / .NET DLL)
  • Worker code (or a minimal reproduction)
  • Full error message and gRPC status code
  • Relevant log lines from the WorkManager