Troubleshooting
Common failure modes and recovery procedures for the Virtufin WorkManager.
See also: Development guide → Troubleshooting for service-startup, port-conflict, Python-engine, and protoc-generation issues. Deployment guide → Troubleshooting for Kubernetes pod, image-pull, and worker-recovery issues.
Worker lifecycle failures
Worker stuck in "Created" state (never started)
Symptom: A worker is created via CreateWorker but never transitions to
"Running" after StartWorker is called.
Likely causes: - The Python engine subprocess crashed during startup - The C# engine failed to compile the source (Roslyn error) - The .NET DLL engine failed to load the assembly - The worker code is unreachable (URL fetch failed)
Diagnose:
# Check the engine startup logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager | grep -i "engine\|start\|load"
# For Python workers, check the Python subprocess stderr
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager | grep -i "python\|subprocess"
# For C# workers, check the Roslyn compilation errors
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager | grep -i "roslyn\|compile"
"Worker not found" after restart
Symptom: After a WorkManager restart, workers that were Running are no longer found.
Likely cause: Worker recovery is asynchronous and runs in the
WorkManagerRecoveryHostedService. The recovery sweeps the state store and
re-creates worker instances, but this can take a few seconds depending on the
worker count.
Diagnose:
# Check the recovery logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager | grep -i "recovery"
# Verify the state store has the worker records
docker exec -it dapr_redis_1 redis-cli KEYS 'worker|*'
Fix: Wait for the recovery to complete (typically < 30s for 100 workers). The readiness probe gates traffic until recovery finishes.
Python engine failures
"Python worker process exited with code N"
Common causes:
| Exit code | Meaning | Fix |
|---|---|---|
| 127 | Python not found | Install Python 3.9+ on the WorkManager host |
| 1 | Worker code raised an unhandled exception | Check the worker code for runtime errors |
| -1 | Killed by SIGTERM | The engine called StopProcessAsync (expected during shutdown) |
| 137 | Killed by OOM | Increase memory limits or reduce worker memory footprint |
Diagnose:
# Check the Python engine logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager | grep -i "python\|sandbox"
# Check the blocked module list
kubectl get configmap -n virtufin workmanager-config -o yaml | grep PYTHON_BLOCKED
Sandbox module import errors
Symptom: Worker logs show ImportError: <module> is not allowed.
Likely cause: The worker code imports a module that's in the
PYTHON_BLOCKED_MODULES list (default: subprocess, os, socket,
requests, etc.).
Fix: Either remove the import from the worker code, or add the module
to PYTHON_ALLOWED_PACKAGES (note: this only allows pip packages, not stdlib
modules).
Code fetched from URL fails
Symptom: LoadCodeFromUrl returns success but the worker fails to start.
Likely cause: The fetched content doesn't match the expected MIME type
(loaded but unparseable), or the SHA-256 hash doesn't match (if
ContentSha256 is set).
Diagnose:
# Check the code fetcher logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager | grep -i "fetcher\|url\|sha256"
# Verify the URL is in the allow-list
kubectl get configmap -n virtufin workmanager-config -o yaml | grep ALLOWED_CODE_SOURCE_HOSTS
C# / .NET engine failures
C# source compilation errors
Symptom: Worker is created but the C# engine returns a compilation error when loading the source.
Likely cause: The C# source has syntax errors or references types not available in the runtime assemblies.
Diagnose:
# Check the Roslyn compilation error in the logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager | grep -i "error CS\|roslyn"
Fix: Fix the worker source code and call LoadCodeFromContent with the
updated bytes.
.NET DLL load errors
Symptom: LoadCodeFromUrl returns success but the worker fails with
InvalidOperationException: Assembly load failed.
Likely cause: A dependency DLL in the ZIP is missing, has a version conflict with the WorkManager's assemblies, or is for the wrong platform (any CPU vs x64 vs ARM).
Diagnose:
# Check the assembly load logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager | grep -i "assembly\|dependency\|resolving"
State store failures
Worker state not persisting across restarts
Symptom: Workers created before a restart are not recovered.
Likely cause: The state store (Redis/Valkey) is unavailable during restart, or the recovery hosted service is misconfigured.
Diagnose:
# Check the state store contents
docker exec -it dapr_redis_1 redis-cli KEYS 'worker|*'
# Check the recovery service logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager | grep -i "recovery"
Memory and resource issues
Memory leak in long-running workers
Symptom: The WorkManager process's memory usage grows over time without bound.
Likely cause: Each worker holds an engine instance with a Python subprocess. If workers are not being deleted when no longer needed, the subprocess count grows.
Diagnose:
# Count Python subprocesses
ps aux | grep -c "python3.*worker"
# Compare with the worker count
docker exec -it dapr_redis_1 redis-cli KEYS 'worker|*' | wc -l
Fix:
- Ensure your application calls
DeleteWorkerwhen workers are no longer needed - Configure a state store TTL on worker records (recommended: 24h)
- Increase the WorkManager's memory limits if legitimate load
Recovery procedures
Restart all workers
# Stop all workers (they will be re-created on next start)
docker exec -it dapr_redis_1 redis-cli --scan --pattern 'worker|*' | xargs -L 100 docker exec -i dapr_redis_1 redis-cli DEL
# Restart the WorkManager
kubectl rollout restart deployment/virtufin-workmanager -n virtufin
kubectl rollout status deployment/virtufin-workmanager -n virtufin
Force recovery re-run
# Restart the WorkManager — recovery runs on startup
kubectl rollout restart deployment/virtufin-workmanager -n virtufin
kubectl rollout status deployment/virtufin-workmanager -n virtufin
# Watch the recovery logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-workmanager -f | grep -i recovery
Reporting issues
If none of the above resolves your issue, gather the following and file a Gitea issue:
- WorkManager version (
LIBRARY_VERSIONfrom the build info endpoint) - Worker ID(s) affected
- Engine type (Python / C# source / .NET DLL)
- Worker code (or a minimal reproduction)
- Full error message and gRPC status code
- Relevant log lines from the WorkManager