Chapter 10: Deployment

Taking PCCL from Lab to Production

🚀 The Restaurant Analogy

A recipe that works in your kitchen might fail in a restaurant. You need: consistent ingredients, trained staff, backup plans, health inspections.

Deploying PCCL is similar: what works on your laptop needs hardening for production!

Deployment Checklist

💡 Before You Deploy

☐ Network connectivity verified between all peers
☐ Firewall rules configured (ports open)
☐ Master node has stable IP/DNS
☐ Monitoring and alerting set up
☐ Checkpoint storage configured
☐ Rollback plan documented

Network Requirements

Requirement	Minimum	Recommended
Bandwidth	100 Mbps	1 Gbps+
Latency (RTT)	<500ms	<100ms
Packet Loss	<5%	<0.1%
Jitter	<50ms	<10ms

Port Configuration

# Required ports (default)
Master coordination:  TCP 7117
P2P data transfer:    TCP 7118-7150 (range for multiple peers)
Heartbeat:            UDP 7117

# Firewall rules (iptables example)
iptables -A INPUT -p tcp --dport 7117:7150 -j ACCEPT
iptables -A INPUT -p udp --dport 7117 -j ACCEPT

# Cloud security groups (AWS example)
aws ec2 authorize-security-group-ingress \
    --group-id sg-xxx \
    --protocol tcp \
    --port 7117-7150 \
    --cidr 0.0.0.0/0

Deployment Architectures

Option 1: Single Master (Simple)

Single Master Deployment ───────────────────────────────────────────────────────────────────────── ┌─────────────┐ │ MASTER │ │ (stable) │ └──────┬──────┘ │ ┌─────────────────┼─────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Worker1 │◄─────►│ Worker2 │◄─────►│ Worker3 │ └─────────┘ └─────────┘ └─────────┘ P2P mesh for data transfer Pros: Simple, easy to debug Cons: Master is single point of failure

Option 2: Master with Failover

Master Failover Deployment ───────────────────────────────────────────────────────────────────────── ┌─────────────┐ ┌─────────────┐ │ MASTER │◄───────►│ STANDBY │ │ (active) │ sync │ (passive) │ └──────┬──────┘ └──────┬──────┘ │ │ └───────────┬───────────┘ │ VIP/DNS failover ▼ ┌─────────────────────────────────┐ │ Workers │ └─────────────────────────────────┘ Pros: High availability Cons: More complex, needs shared state

Option 3: Kubernetes Deployment

# pccl-master.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pccl-master
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pccl-master
  template:
    spec:
      containers:
      - name: master
        image: pccl/master:latest
        ports:
        - containerPort: 7117
        env:
        - name: PCCL_ROLE
          value: "master"
---
apiVersion: v1
kind: Service
metadata:
  name: pccl-master
spec:
  selector:
    app: pccl-master
  ports:
  - port: 7117
    targetPort: 7117
  type: LoadBalancer

Configuration Best Practices

# pccl_config.yaml
network:
  master_host: "master.example.com"
  master_port: 7117
  heartbeat_interval_ms: 100
  heartbeat_timeout_ms: 500

fault_tolerance:
  enable_buffer_backup: true
  checkpoint_interval_steps: 100
  max_recovery_attempts: 3

performance:
  ring_chunk_size: 4194304  # 4MB chunks
  num_io_threads: 4
  enable_compression: false  # disable for LAN

logging:
  level: "INFO"
  file: "/var/log/pccl/worker.log"
  max_size_mb: 100

Monitoring Setup

⚠️ Key Metrics to Monitor

peer_count: Number of active peers (alert if drops)
collective_latency_p99: Operation latency (alert if spikes)
recovery_count: Number of recoveries (alert if frequent)
throughput_gbps: Data transfer rate
buffer_backup_size: Memory used for fault tolerance

# Prometheus metrics endpoint
# PCCL exposes metrics at :9090/metrics

# prometheus.yml
scrape_configs:
  - job_name: 'pccl'
    static_configs:
      - targets: ['worker1:9090', 'worker2:9090', 'worker3:9090']

# Grafana dashboard query examples
# Peer count over time
pccl_peer_count

# P99 latency
histogram_quantile(0.99, pccl_collective_latency_bucket)

# Recovery rate
rate(pccl_recovery_total[5m])

Troubleshooting Guide

Symptom	Likely Cause	Solution
Peers can't connect	Firewall blocking	Check ports 7117-7150
Frequent recoveries	Network instability	Increase heartbeat timeout
Low throughput	Small chunk size	Increase ring_chunk_size
High latency	Too many peers	Use hierarchical topology
OOM errors	Buffer backup too large	Reduce model size or disable backup
Master unreachable	DNS/IP changed	Use stable DNS or static IP

Security Considerations

⚠️ Production Security

TLS: Enable encryption for all connections
Authentication: Use shared secrets or certificates
Network isolation: Use VPN or private network
Access control: Limit who can join the cluster

# Enable TLS
network:
  tls:
    enabled: true
    cert_file: "/etc/pccl/server.crt"
    key_file: "/etc/pccl/server.key"
    ca_file: "/etc/pccl/ca.crt"
  
  # Shared secret authentication
  auth:
    enabled: true
    secret: "${PCCL_AUTH_SECRET}"  # from environment

Scaling Guidelines

Recommended Configurations by Scale ───────────────────────────────────────────────────────────────────────── Peers Topology Master Resources Notes ───── ──────── ──────────────── ───── 2-8 Flat ring 1 CPU, 1GB RAM Simple setup 8-32 Flat ring 2 CPU, 4GB RAM Monitor latency 32-128 Hierarchical 4 CPU, 8GB RAM Group by region 128+ Multi-level 8 CPU, 16GB RAM Expert tuning needed

✏️ Deployment Planning

You need to deploy PCCL for 16 workers across 2 data centers (8 each).

What topology would you use?
Where should the master be located?
What's your failover strategy?

Suggested answers:
1. Hierarchical: local rings within each DC, then cross-DC sync
2. In the DC with more stable network, or use DNS failover
3. Standby master in other DC, shared state via database