vdb

Asynchronous GPU-Accelerated Vector Distance Computation and Dynamic Memory Load Manager

This project extends the previous year’s GPU-based vector clustering work by developing an asynchronous GPU acceleration framework for large-scale vector distance computation.
The Memory Load Manager maximizes GPU utilization and eliminates CPU–GPU transfer bottlenecks through overlapping computation and data movement.

Features

Asynchronous Memory–Compute Pipeline
Dual-stream CUDA architecture allows concurrent data transfer (H2D/D2H) and computation.
Streams are synchronized via CUDA events to ensure consistency without blocking.
Tile-Based Distance Computation
Uses cuBLAS GEMM operations to compute pairwise vector distances efficiently with limited GPU memory.
GPU-Side Matrix Transposition
Adds a GPU kernel to convert row-major to column-major matrices for coalesced memory access.

Implementation Details

Memory Transfer Optimization
Asynchronous data movement using pinned memory enables direct DMA transfers, synchronized with CUDA events.
GPU Compute Kernel Pipeline
Includes on-device transposition, cuBLAS GEMM for matrix multiplication, and a fused kernel to compute final distances.

Architectural Summary

The asynchronous GPU-accelerated distance computation framework consists of several key components:

Memory Load Manager
- Controls asynchronous CPU–GPU data transfer using dual CUDA streams (I/O and Compute).
- Ensures continuous GPU workload by overlapping data loading and computation.
Pinned Memory Buffers
- Allocated using cudaMallocHost for high-throughput DMA transfers.
- Eliminates page-fault latency and improves bandwidth.
Dual CUDA Streams
- Separate I/O and Compute streams enable simultaneous data transfer and kernel execution.
- Stream-level concurrency maximizes utilization.
Event-Based Synchronization
- Coordinates execution using cudaEventRecord and cudaStreamWaitEvent.
- Enables precise timing control without stalling the pipeline.
cuBLAS GEMM Integration
- Uses cublasSgemm() for tile-wise distance computation.
- Achieves efficient large-matrix multiplication with optimized memory access.
Tile Partitioning
- Divides datasets into smaller submatrices (tiles) for out-of-core computation.
- Allows scalability even when data exceeds GPU memory capacity.
GPU Transpose Kernel
- Converts data from row-major to column-major before GEMM.
- Ensures coalesced access for maximum throughput.

This modular architecture achieves high concurrency between data movement and computation, resulting in a continuous, high-throughput GPU pipeline suitable for large-scale vector similarity search and clustering.