# System Architecture Documentation

## System Design Overview

The 8K Motion Tracking and Voxel Processing System is designed as a distributed, multi-layer architecture optimized for real-time processing of high-resolution multi-modal sensor data.

### Design Principles

1. **Modularity**: Each component is independently testable and replaceable
2. **Scalability**: Horizontal scaling across multiple GPU nodes
3. **Fault Tolerance**: Automatic failover and recovery mechanisms
4. **Performance**: CUDA acceleration and zero-copy data transfers
5. **Extensibility**: Plugin architecture for new sensor types and algorithms

---

## Component Interactions

### System Layers

```
┌────────────────────────────────────────────────────────────────────────┐
│                         Application Layer                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                │
│  │   Tracking   │  │  Detection   │  │     3D       │                │
│  │   Service    │  │   Service    │  │  Rendering   │                │
│  └──────────────┘  └──────────────┘  └──────────────┘                │
└────────────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌────────────────────────────────────────────────────────────────────────┐
│                      Processing Layer                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                │
│  │    Fusion    │  │    Voxel     │  │  Detection   │                │
│  │   Manager    │  │    Grid      │  │   Tracker    │                │
│  └──────────────┘  └──────────────┘  └──────────────┘                │
└────────────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌────────────────────────────────────────────────────────────────────────┐
│                    Distributed Layer                                   │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                │
│  │  Task        │  │     Load     │  │    Fault     │                │
│  │  Scheduler   │  │   Balancer   │  │  Tolerance   │                │
│  └──────────────┘  └──────────────┘  └──────────────┘                │
└────────────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌────────────────────────────────────────────────────────────────────────┐
│                      Data Pipeline Layer                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                │
│  │ Ring Buffers │  │ Shared Memory│  │   Network    │                │
│  │ (Lock-free)  │  │ (Zero-copy)  │  │   (RDMA)     │                │
│  └──────────────┘  └──────────────┘  └──────────────┘                │
└────────────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌────────────────────────────────────────────────────────────────────────┐
│                        Hardware Layer                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                │
│  │   Cameras    │  │     GPUs     │  │   Network    │                │
│  │  (GigE/USB3) │  │   (CUDA)     │  │ (10GbE/IB)   │                │
│  └──────────────┘  └──────────────┘  └──────────────┘                │
└────────────────────────────────────────────────────────────────────────┘
```

---

## Detailed Component Architecture

### 1. Camera Management System

**Purpose**: Manages 10 camera pairs (20 cameras total) with synchronized acquisition.

**Components**:

```
CameraManager
├── CameraInterface (x20)
│   ├── Connection Management (GigE Vision)
│   ├── Configuration (Resolution, FPS, Exposure)
│   ├── Frame Acquisition
│   └── Health Monitoring
├── CameraPair (x10)
│   ├── Stereo Calibration
│   ├── Frame Synchronization
│   └── Registration Parameters
└── Health Monitor
    ├── FPS Tracking
    ├── Temperature Monitoring
    ├── Packet Loss Detection
    └── Error Recovery
```

**Interaction Flow**:

1. **Initialization**: Connect to cameras via GigE Vision protocol
2. **Configuration**: Set resolution (7680x4320), frame rate (30 FPS), trigger mode
3. **Acquisition**: Hardware-triggered synchronized frame capture
4. **Monitoring**: Continuous health checks (FPS, temperature, packet loss)
5. **Recovery**: Automatic reconnection on failure

**Performance Characteristics**:
- Connection time: <2 seconds per camera
- Synchronization accuracy: <1ms between camera pairs
- Health check frequency: 1 Hz
- Maximum packet loss tolerance: 0.1%

---

### 2. Video Processing Pipeline

**Purpose**: Decode and extract motion from 8K video streams in real-time.

**Architecture**:

```
VideoProcessor
├── Decoder Thread
│   ├── Hardware Decoder (NVDEC/QSV)
│   ├── Codec Handler (HEVC, H.264)
│   └── Frame Buffer (Ring Buffer)
├── Motion Extractor (C++)
│   ├── Background Subtraction
│   ├── Connected Components
│   ├── Centroid Calculation
│   └── Velocity Estimation
└── Synchronization Manager
    ├── Multi-stream Sync
    ├── Timestamp Alignment
    └── Frame Dropping (if needed)
```

**Data Flow**:

```
[Video File/Stream]
       │
       ▼
[Hardware Decoder] ──────────> [Decoded Frame Buffer]
       │ (HEVC/H.264)                    │
       │ 5-8ms                            │
       ▼                                  ▼
[Preprocessing] ──────────> [Motion Extractor (C++)]
       │ (Resize/Convert)         │ (OpenMP Parallel)
       │ 2-3ms                     │ 12-18ms
       │                           │
       ▼                           ▼
[Frame Metadata] <───────── [Motion Data Output]
                                   │
                                   ├── Coordinates
                                   ├── Bounding Boxes
                                   ├── Velocities
                                   └── Confidence
```

**Optimization Techniques**:
- Hardware-accelerated decoding (NVDEC)
- Multi-threaded motion extraction (OpenMP)
- SIMD instructions for pixel operations
- Lock-free ring buffers for thread communication

**Performance**:
- Decode throughput: 60+ FPS (hardware) vs 15-20 FPS (software)
- Motion extraction: 35+ FPS for 8K frames
- Memory usage: ~500MB per stream

---

### 3. Fusion System

**Purpose**: Combine thermal and monochrome data for enhanced target detection.

**Architecture**:

```
FusionManager
├── Registration Engine
│   ├── Feature Detection (SIFT/ORB)
│   ├── Homography Estimation (RANSAC)
│   ├── Image Warping (OpenCV/CUDA)
│   └── Quality Metrics
├── Multi-Spectral Detector
│   ├── Thermal Detection
│   ├── Monochrome Detection
│   ├── Confidence Fusion
│   └── Cross-Validation
├── False Positive Reducer
│   ├── Signature Verification
│   ├── Spatial Consistency
│   └── Temporal Tracking
└── Worker Thread Pool
    ├── Task Queue
    ├── Result Queue
    └── Load Balancing
```

**Fusion Algorithm**:

```python
# Pseudo-code for fusion process
def fuse_frame_pair(thermal_frame, mono_frame):
    # Step 1: Update registration if needed
    if needs_registration_update():
        reg_params = estimate_homography(thermal_frame, mono_frame)

    # Step 2: Align images
    aligned_thermal = warp_image(thermal_frame, reg_params)

    # Step 3: Detect in both modalities
    thermal_detections = detect_thermal(aligned_thermal)
    mono_detections = detect_mono(mono_frame)

    # Step 4: Fuse detections
    fused_detections = []
    for t_det in thermal_detections:
        for m_det in mono_detections:
            if spatial_overlap(t_det, m_det) > threshold:
                confidence = fusion_confidence(t_det, m_det)
                if confidence > min_confidence:
                    fused_detections.append(
                        FusedDetection(t_det, m_det, confidence)
                    )

    # Step 5: Cross-validate to remove false positives
    validated = cross_validate(fused_detections, thermal_frame, mono_frame)

    # Step 6: Update tracks
    tracked = update_tracks(validated)

    return tracked
```

**Performance Characteristics**:
- Registration update: 1 Hz (or when quality degrades)
- Registration accuracy: <2 pixel RMSE
- False positive reduction: 40-60% improvement
- Processing time: 8-12ms per frame pair
- Target confirmation rate: 85-95%

---

### 4. Distributed Processing System

**Purpose**: Coordinate task distribution across multiple GPU nodes.

**Architecture**:

```
DistributedProcessor
├── Cluster Manager
│   ├── Node Discovery (UDP Broadcast)
│   ├── Resource Tracking (GPU, CPU, Memory)
│   ├── Topology Optimization (Floyd-Warshall)
│   └── Heartbeat System (1 Hz)
├── Task Scheduler
│   ├── Priority Queue
│   ├── Dependency Resolution
│   ├── Task Registry
│   └── Completion Tracking
├── Load Balancer
│   ├── Worker Selection (Weighted)
│   ├── Load Monitoring
│   ├── Performance Tracking
│   └── Rebalancing Logic
├── Worker Manager
│   ├── Worker Thread Pool
│   ├── GPU Assignment
│   ├── Task Execution
│   └── Result Collection
└── Fault Tolerance
    ├── Failure Detection (Heartbeat Timeout)
    ├── Task Reassignment
    ├── Worker Recovery
    └── Failover Metrics
```

**Task Scheduling Algorithm**:

```python
# Weighted load balancing
def select_worker(available_workers, task):
    scores = []
    for worker in available_workers:
        # Current load factor (0.0 = idle, 1.0 = busy)
        load = worker_loads[worker.id]

        # Performance factor (based on historical execution time)
        perf = 1.0 / max(avg_execution_time[worker.id], 0.1)

        # Task priority factor
        priority = task.priority / 10.0

        # Combined score (lower is better)
        score = load - perf + priority
        scores.append((score, worker))

    # Select worker with lowest score
    return min(scores, key=lambda x: x[0])[1]
```

**Communication Patterns**:

1. **Master-Worker**: Task assignment and result collection
2. **Peer-to-Peer**: Direct data transfer between nodes (RDMA)
3. **Broadcast**: Cluster-wide status updates
4. **Heartbeat**: Node health monitoring

**Performance**:
- Node discovery: <2 seconds
- Task assignment latency: <1ms
- Failover time: <5 seconds
- Load imbalance detection: 5 second intervals
- Support for 4-16 GPU nodes

---

### 5. Data Pipeline

**Purpose**: High-throughput, low-latency data transfer with zero-copy optimizations.

**Architecture**:

```
DataPipeline
├── Ring Buffers (per camera)
│   ├── Lock-free Implementation
│   ├── Multi-producer Support
│   ├── Multi-consumer Support
│   └── Configurable Size (default: 60 frames)
├── Shared Memory Manager
│   ├── mmap-based Allocation
│   ├── IPC Support (POSIX)
│   ├── Zero-copy Transfers
│   └── Memory Pool
└── Network Transport
    ├── RDMA Support (InfiniBand)
    ├── Zero-copy Send/Receive
    ├── Scatter-Gather I/O
    └── Fallback to TCP/IP
```

**Memory Layout**:

```
Shared Memory Segment (per camera)
┌────────────────────────────────────────────────────────────┐
│ Header (64 bytes)                                          │
│  ├── Version                                               │
│  ├── Buffer Size                                           │
│  ├── Frame Width/Height                                    │
│  └── Metadata Offset                                       │
├────────────────────────────────────────────────────────────┤
│ Frame Buffer 0 (7680 x 4320 = 33.2 MB)                    │
├────────────────────────────────────────────────────────────┤
│ Frame Buffer 1 (33.2 MB)                                   │
├────────────────────────────────────────────────────────────┤
│ ...                                                         │
├────────────────────────────────────────────────────────────┤
│ Frame Buffer N (33.2 MB)                                   │
├────────────────────────────────────────────────────────────┤
│ Metadata Array                                             │
│  ├── Frame 0 Metadata (timestamp, frame_id, etc.)         │
│  ├── Frame 1 Metadata                                      │
│  └── ...                                                   │
└────────────────────────────────────────────────────────────┘
```

**Lock-free Ring Buffer Algorithm**:

```cpp
// Simplified lock-free ring buffer
class LockFreeRingBuffer {
    std::atomic<uint64_t> write_index_{0};
    std::atomic<uint64_t> read_index_{0};
    size_t capacity_;

    bool push(const Frame& frame) {
        uint64_t current_write = write_index_.load(std::memory_order_relaxed);
        uint64_t next_write = (current_write + 1) % capacity_;
        uint64_t current_read = read_index_.load(std::memory_order_acquire);

        // Check if buffer is full
        if (next_write == current_read) {
            return false;  // Buffer full
        }

        // Write data
        buffer_[current_write] = frame;

        // Update write index
        write_index_.store(next_write, std::memory_order_release);
        return true;
    }

    bool pop(Frame& frame) {
        uint64_t current_read = read_index_.load(std::memory_order_relaxed);
        uint64_t current_write = write_index_.load(std::memory_order_acquire);

        // Check if buffer is empty
        if (current_read == current_write) {
            return false;  // Buffer empty
        }

        // Read data
        frame = buffer_[current_read];

        // Update read index
        uint64_t next_read = (current_read + 1) % capacity_;
        read_index_.store(next_read, std::memory_order_release);
        return true;
    }
};
```

**Performance Characteristics**:
- Write throughput: 2.5+ GB/s per camera
- Read throughput: 2.0+ GB/s
- Latency: <100 microseconds (local), <5ms (network with RDMA)
- Zero-copy efficiency: 95%+ (eliminates memory copies)
- Scalability: Supports 10-100 cameras per node

---

### 6. Voxel Reconstruction System

**Purpose**: Project motion coordinates into 3D voxel space for spatial tracking.

**Architecture**:

```
VoxelGrid (CUDA Accelerated)
├── Sparse Voxel Storage
│   ├── Hash Table (GPU)
│   ├── Octree Structure
│   ├── Voxel Activation
│   └── Memory Management
├── Projection Engine
│   ├── Camera Model (Pinhole)
│   ├── Ray Casting (CUDA Kernels)
│   ├── Voxel Update (Atomic Ops)
│   └── Confidence Weighting
└── Optimization
    ├── Spatial Hashing
    ├── Parallel Reduction
    ├── Coalesced Memory Access
    └── Shared Memory Caching
```

**CUDA Kernel Architecture**:

```cuda
// Simplified voxel projection kernel
__global__ void project_to_voxel_kernel(
    const float* __restrict__ coords,     // 2D coordinates
    const float* __restrict__ camera_pose, // Camera position/orientation
    VoxelGrid* grid,                       // Sparse voxel grid
    int num_points
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= num_points) return;

    // Load 2D coordinate
    float2 pixel = make_float2(coords[idx*2], coords[idx*2+1]);

    // Unproject to 3D ray
    float3 ray_dir = unproject(pixel, camera_pose);

    // Ray-march through voxel grid
    float3 pos = camera_pose.position;
    float step = grid->voxel_size;

    for (float t = 0; t < max_distance; t += step) {
        float3 voxel_pos = pos + ray_dir * t;

        // Compute voxel index
        int3 voxel_idx = world_to_voxel(voxel_pos, grid);

        // Atomically update voxel
        atomicAdd(&grid->data[hash(voxel_idx)], 1.0f);
    }
}
```

**Performance**:
- Voxel update rate: 30 FPS for 10,000 points
- Memory usage: Sparse storage (~10% of dense grid)
- GPU utilization: 30-40%
- Ray casting: 1M rays/second

---

## Data Flow Diagrams

### End-to-End Pipeline

```
┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│ Camera   │────>│  Video   │────>│ Motion   │────>│  Fusion  │
│ Capture  │     │  Decode  │     │ Extract  │     │ Process  │
│ (0ms)    │     │ (5-8ms)  │     │ (12-18ms)│     │ (8-12ms) │
└──────────┘     └──────────┘     └──────────┘     └──────────┘
                                                           │
                                                           ▼
┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│  Output  │<────│  Voxel   │<────│  Distrib │<────│ Detection│
│ (Display)│     │   Grid   │     │  Process │     │ Tracking │
│          │     │ (5-8ms)  │     │ (2-5ms)  │     │ (3-5ms)  │
└──────────┘     └──────────┘     └──────────┘     └──────────┘

Total Latency: ~35-56ms (excluding camera capture)
Target: <33ms for 30 FPS
```

### Distributed Processing Flow

```
Master Node                     Worker Node 1              Worker Node 2
     │                               │                          │
     │  [Task Assignment]            │                          │
     ├──────────────────────────────>│                          │
     │                               │                          │
     │                          [GPU Process]                   │
     │                               │                          │
     │  [Result Collection]          │                          │
     │<──────────────────────────────┤                          │
     │                               │                          │
     │  [Task Assignment]            │                          │
     ├────────────────────────────────────────────────────────>│
     │                               │                          │
     │                               │                    [GPU Process]
     │                               │                          │
     │  [Result Collection]          │                          │
     │<────────────────────────────────────────────────────────┤
     │                               │                          │
     │  [Heartbeat]                  │                          │
     │<──────────────────────────────┤                          │
     │<────────────────────────────────────────────────────────┤
     │                               │                          │
```

---

## Performance Characteristics

### Throughput Analysis

| Component | Sequential | Parallel (4 threads) | GPU |
|-----------|------------|---------------------|-----|
| 8K Decode | 15-20 FPS | 60+ FPS (HW) | N/A |
| Motion Extract | 8-10 FPS | 35+ FPS | N/A |
| Fusion | 12-15 FPS | 30+ FPS | 50+ FPS |
| Voxel Project | 5-8 FPS | 15-20 FPS | 30+ FPS |

### Latency Breakdown

```
Frame Pipeline (Target: <33ms for 30 FPS)
─────────────────────────────────────────────────────────
Video Decode      ████░░░░░░░░░░░░░░░░░░░░░  5-8ms
Motion Extract    ████████████░░░░░░░░░░░░░  12-18ms
Fusion Process    ████████░░░░░░░░░░░░░░░░░  8-12ms
Detection Track   ███░░░░░░░░░░░░░░░░░░░░░░  3-5ms
Voxel Project     ██████░░░░░░░░░░░░░░░░░░░  5-8ms
Distributed       ██░░░░░░░░░░░░░░░░░░░░░░░  2-5ms
─────────────────────────────────────────────────────────
Total             ██████████████████████████  35-56ms

Optimization needed to meet <33ms target:
- Parallel fusion processing
- Async voxel updates
- Pipeline overlapping
```

### Scalability

**Horizontal Scaling** (Adding more nodes):
- 1 Node: 2 camera pairs (4 cameras)
- 2 Nodes: 5 camera pairs (10 cameras)
- 4 Nodes: 10 camera pairs (20 cameras)
- 8 Nodes: 20 camera pairs (40 cameras)

**Vertical Scaling** (More GPUs per node):
- 1 GPU: 1-2 camera pairs
- 2 GPUs: 3-4 camera pairs
- 4 GPUs: 5-8 camera pairs

---

## Scalability Considerations

### Design for Scale

1. **Stateless Workers**: Workers don't maintain state between tasks
2. **Data Locality**: Tasks assigned to nodes with required data
3. **Load Balancing**: Dynamic task distribution based on worker load
4. **Fault Isolation**: Node failures don't affect other nodes
5. **Resource Pools**: Pre-allocated GPU memory and thread pools

### Bottlenecks and Solutions

| Bottleneck | Impact | Solution |
|------------|--------|----------|
| Network Bandwidth | Data transfer delays | RDMA, compression, local processing |
| GPU Memory | Limited camera pairs/node | Sparse data structures, streaming |
| CPU-GPU Transfer | PCIe bottleneck | Pinned memory, async transfers |
| Synchronization | Lock contention | Lock-free data structures |
| Task Scheduling | Load imbalance | Weighted scheduling, work stealing |

### Future Expansion

- **More Cameras**: Add nodes, scale horizontally
- **Higher Resolution**: Upgrade GPUs, optimize CUDA kernels
- **More Modalities**: Extend fusion system, add sensor interfaces
- **Lower Latency**: Optimize pipeline, reduce buffering
- **Cloud Deployment**: Add network optimization, edge computing

---

## Design Patterns

### 1. Producer-Consumer Pattern
- Cameras produce frames → Pipeline consumes
- Lock-free ring buffers for thread-safe communication

### 2. Pipeline Pattern
- Sequential stages with data flow
- Each stage can be parallelized independently

### 3. Master-Worker Pattern
- Master coordinates, workers execute
- Dynamic task distribution

### 4. Observer Pattern
- Callbacks for motion detection, errors, status updates
- Decouples components

### 5. Factory Pattern
- Camera creation based on type (Mono/Thermal, GigE/USB)
- Codec selection based on format

---

## Technology Stack

### Languages
- **Python 3.8+**: Application logic, data pipeline
- **C++17**: Performance-critical components (motion extraction, fusion)
- **CUDA**: GPU-accelerated kernels (voxel processing, detection)

### Libraries
- **OpenCV 4.5+**: Image processing, calibration
- **NumPy**: Array operations
- **PyBind11**: C++/Python bindings
- **Protocol Buffers**: Serialization
- **ZeroMQ**: Network messaging
- **RDMA**: High-speed network transfers (optional)

### Hardware Requirements
- **GPU**: NVIDIA RTX 3090/4090 with CUDA 11.0+
- **Network**: 10GbE or InfiniBand for multi-node
- **Cameras**: GigE Vision compatible

---

## Security Considerations

- Camera access control (IP filtering, authentication)
- Encrypted network communication (TLS/SSL)
- Secure calibration data storage
- Input validation for all external data
- Resource limits to prevent DoS

---

## References

- [CUDA Programming Guide](https://docs.nvidia.com/cuda/)
- [GigE Vision Standard](https://www.automate.org/vision/gige-vision)
- [Lock-free Programming](https://preshing.com/20120612/an-introduction-to-lock-free-programming/)
- [RDMA Programming](https://www.rdmamojo.com/)