Chapter 10: Deployment

Taking PCCL from Lab to Production

🚀 The Restaurant Analogy

A recipe that works in your kitchen might fail in a restaurant. You need: consistent ingredients, trained staff, backup plans, health inspections.

Deploying PCCL is similar: what works on your laptop needs hardening for production!

Deployment Checklist

💡 Before You Deploy

Network Requirements

Requirement Minimum Recommended
Bandwidth 100 Mbps 1 Gbps+
Latency (RTT) <500ms <100ms
Packet Loss <5% <0.1%
Jitter <50ms <10ms

Port Configuration

# Required ports (default)
Master coordination:  TCP 7117
P2P data transfer:    TCP 7118-7150 (range for multiple peers)
Heartbeat:            UDP 7117

# Firewall rules (iptables example)
iptables -A INPUT -p tcp --dport 7117:7150 -j ACCEPT
iptables -A INPUT -p udp --dport 7117 -j ACCEPT

# Cloud security groups (AWS example)
aws ec2 authorize-security-group-ingress \
    --group-id sg-xxx \
    --protocol tcp \
    --port 7117-7150 \
    --cidr 0.0.0.0/0

Deployment Architectures

Option 1: Single Master (Simple)

Single Master Deployment ───────────────────────────────────────────────────────────────────────── ┌─────────────┐ │ MASTER │ │ (stable) │ └──────┬──────┘ │ ┌─────────────────┼─────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Worker1 │◄─────►│ Worker2 │◄─────►│ Worker3 │ └─────────┘ └─────────┘ └─────────┘ P2P mesh for data transfer Pros: Simple, easy to debug Cons: Master is single point of failure

Option 2: Master with Failover

Master Failover Deployment ───────────────────────────────────────────────────────────────────────── ┌─────────────┐ ┌─────────────┐ │ MASTER │◄───────►│ STANDBY │ │ (active) │ sync │ (passive) │ └──────┬──────┘ └──────┬──────┘ │ │ └───────────┬───────────┘ │ VIP/DNS failover ▼ ┌─────────────────────────────────┐ │ Workers │ └─────────────────────────────────┘ Pros: High availability Cons: More complex, needs shared state

Option 3: Kubernetes Deployment

# pccl-master.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pccl-master
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pccl-master
  template:
    spec:
      containers:
      - name: master
        image: pccl/master:latest
        ports:
        - containerPort: 7117
        env:
        - name: PCCL_ROLE
          value: "master"
---
apiVersion: v1
kind: Service
metadata:
  name: pccl-master
spec:
  selector:
    app: pccl-master
  ports:
  - port: 7117
    targetPort: 7117
  type: LoadBalancer

Configuration Best Practices

# pccl_config.yaml
network:
  master_host: "master.example.com"
  master_port: 7117
  heartbeat_interval_ms: 100
  heartbeat_timeout_ms: 500

fault_tolerance:
  enable_buffer_backup: true
  checkpoint_interval_steps: 100
  max_recovery_attempts: 3

performance:
  ring_chunk_size: 4194304  # 4MB chunks
  num_io_threads: 4
  enable_compression: false  # disable for LAN

logging:
  level: "INFO"
  file: "/var/log/pccl/worker.log"
  max_size_mb: 100

Monitoring Setup

⚠️ Key Metrics to Monitor

# Prometheus metrics endpoint
# PCCL exposes metrics at :9090/metrics

# prometheus.yml
scrape_configs:
  - job_name: 'pccl'
    static_configs:
      - targets: ['worker1:9090', 'worker2:9090', 'worker3:9090']

# Grafana dashboard query examples
# Peer count over time
pccl_peer_count

# P99 latency
histogram_quantile(0.99, pccl_collective_latency_bucket)

# Recovery rate
rate(pccl_recovery_total[5m])

Troubleshooting Guide

Symptom Likely Cause Solution
Peers can't connect Firewall blocking Check ports 7117-7150
Frequent recoveries Network instability Increase heartbeat timeout
Low throughput Small chunk size Increase ring_chunk_size
High latency Too many peers Use hierarchical topology
OOM errors Buffer backup too large Reduce model size or disable backup
Master unreachable DNS/IP changed Use stable DNS or static IP

Security Considerations

⚠️ Production Security

# Enable TLS
network:
  tls:
    enabled: true
    cert_file: "/etc/pccl/server.crt"
    key_file: "/etc/pccl/server.key"
    ca_file: "/etc/pccl/ca.crt"
  
  # Shared secret authentication
  auth:
    enabled: true
    secret: "${PCCL_AUTH_SECRET}"  # from environment

Scaling Guidelines

Recommended Configurations by Scale ───────────────────────────────────────────────────────────────────────── Peers Topology Master Resources Notes ───── ──────── ──────────────── ───── 2-8 Flat ring 1 CPU, 1GB RAM Simple setup 8-32 Flat ring 2 CPU, 4GB RAM Monitor latency 32-128 Hierarchical 4 CPU, 8GB RAM Group by region 128+ Multi-level 8 CPU, 16GB RAM Expert tuning needed

✏️ Deployment Planning

You need to deploy PCCL for 16 workers across 2 data centers (8 each).

  1. What topology would you use?
  2. Where should the master be located?
  3. What's your failover strategy?
Suggested answers:
1. Hierarchical: local rings within each DC, then cross-DC sync
2. In the DC with more stable network, or use DNS failover
3. Standby master in other DC, shared state via database