Chapter 10: Deployment
Taking PCCL from Lab to Production
🚀 The Restaurant Analogy
A recipe that works in your kitchen might fail in a restaurant. You need: consistent ingredients, trained staff, backup plans, health inspections.
Deploying PCCL is similar: what works on your laptop needs hardening for production!
Deployment Checklist
💡 Before You Deploy
- ☐ Network connectivity verified between all peers
- ☐ Firewall rules configured (ports open)
- ☐ Master node has stable IP/DNS
- ☐ Monitoring and alerting set up
- ☐ Checkpoint storage configured
- ☐ Rollback plan documented
Network Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| Bandwidth | 100 Mbps | 1 Gbps+ |
| Latency (RTT) | <500ms | <100ms |
| Packet Loss | <5% | <0.1% |
| Jitter | <50ms | <10ms |
Port Configuration
# Required ports (default)
Master coordination: TCP 7117
P2P data transfer: TCP 7118-7150 (range for multiple peers)
Heartbeat: UDP 7117
# Firewall rules (iptables example)
iptables -A INPUT -p tcp --dport 7117:7150 -j ACCEPT
iptables -A INPUT -p udp --dport 7117 -j ACCEPT
# Cloud security groups (AWS example)
aws ec2 authorize-security-group-ingress \
--group-id sg-xxx \
--protocol tcp \
--port 7117-7150 \
--cidr 0.0.0.0/0
Deployment Architectures
Option 1: Single Master (Simple)
Single Master Deployment
─────────────────────────────────────────────────────────────────────────
┌─────────────┐
│ MASTER │
│ (stable) │
└──────┬──────┘
│
┌─────────────────┼─────────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Worker1 │◄─────►│ Worker2 │◄─────►│ Worker3 │
└─────────┘ └─────────┘ └─────────┘
P2P mesh for data transfer
Pros: Simple, easy to debug
Cons: Master is single point of failure
Option 2: Master with Failover
Master Failover Deployment
─────────────────────────────────────────────────────────────────────────
┌─────────────┐ ┌─────────────┐
│ MASTER │◄───────►│ STANDBY │
│ (active) │ sync │ (passive) │
└──────┬──────┘ └──────┬──────┘
│ │
└───────────┬───────────┘
│ VIP/DNS failover
▼
┌─────────────────────────────────┐
│ Workers │
└─────────────────────────────────┘
Pros: High availability
Cons: More complex, needs shared state
Option 3: Kubernetes Deployment
# pccl-master.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: pccl-master
spec:
replicas: 1
selector:
matchLabels:
app: pccl-master
template:
spec:
containers:
- name: master
image: pccl/master:latest
ports:
- containerPort: 7117
env:
- name: PCCL_ROLE
value: "master"
---
apiVersion: v1
kind: Service
metadata:
name: pccl-master
spec:
selector:
app: pccl-master
ports:
- port: 7117
targetPort: 7117
type: LoadBalancer
Configuration Best Practices
# pccl_config.yaml
network:
master_host: "master.example.com"
master_port: 7117
heartbeat_interval_ms: 100
heartbeat_timeout_ms: 500
fault_tolerance:
enable_buffer_backup: true
checkpoint_interval_steps: 100
max_recovery_attempts: 3
performance:
ring_chunk_size: 4194304 # 4MB chunks
num_io_threads: 4
enable_compression: false # disable for LAN
logging:
level: "INFO"
file: "/var/log/pccl/worker.log"
max_size_mb: 100
Monitoring Setup
⚠️ Key Metrics to Monitor
- peer_count: Number of active peers (alert if drops)
- collective_latency_p99: Operation latency (alert if spikes)
- recovery_count: Number of recoveries (alert if frequent)
- throughput_gbps: Data transfer rate
- buffer_backup_size: Memory used for fault tolerance
# Prometheus metrics endpoint
# PCCL exposes metrics at :9090/metrics
# prometheus.yml
scrape_configs:
- job_name: 'pccl'
static_configs:
- targets: ['worker1:9090', 'worker2:9090', 'worker3:9090']
# Grafana dashboard query examples
# Peer count over time
pccl_peer_count
# P99 latency
histogram_quantile(0.99, pccl_collective_latency_bucket)
# Recovery rate
rate(pccl_recovery_total[5m])
Troubleshooting Guide
| Symptom | Likely Cause | Solution |
|---|---|---|
| Peers can't connect | Firewall blocking | Check ports 7117-7150 |
| Frequent recoveries | Network instability | Increase heartbeat timeout |
| Low throughput | Small chunk size | Increase ring_chunk_size |
| High latency | Too many peers | Use hierarchical topology |
| OOM errors | Buffer backup too large | Reduce model size or disable backup |
| Master unreachable | DNS/IP changed | Use stable DNS or static IP |
Security Considerations
⚠️ Production Security
- TLS: Enable encryption for all connections
- Authentication: Use shared secrets or certificates
- Network isolation: Use VPN or private network
- Access control: Limit who can join the cluster
# Enable TLS
network:
tls:
enabled: true
cert_file: "/etc/pccl/server.crt"
key_file: "/etc/pccl/server.key"
ca_file: "/etc/pccl/ca.crt"
# Shared secret authentication
auth:
enabled: true
secret: "${PCCL_AUTH_SECRET}" # from environment
Scaling Guidelines
Recommended Configurations by Scale
─────────────────────────────────────────────────────────────────────────
Peers Topology Master Resources Notes
───── ──────── ──────────────── ─────
2-8 Flat ring 1 CPU, 1GB RAM Simple setup
8-32 Flat ring 2 CPU, 4GB RAM Monitor latency
32-128 Hierarchical 4 CPU, 8GB RAM Group by region
128+ Multi-level 8 CPU, 16GB RAM Expert tuning needed
✏️ Deployment Planning
You need to deploy PCCL for 16 workers across 2 data centers (8 each).
- What topology would you use?
- Where should the master be located?
- What's your failover strategy?
Suggested answers:
1. Hierarchical: local rings within each DC, then cross-DC sync
2. In the DC with more stable network, or use DNS failover
3. Standby master in other DC, shared state via database
1. Hierarchical: local rings within each DC, then cross-DC sync
2. In the DC with more stable network, or use DNS failover
3. Standby master in other DC, shared state via database