This project extends the previous year’s GPU-based vector clustering work by developing an asynchronous GPU acceleration framework for large-scale vector distance computation.
The Memory Load Manager maximizes GPU utilization and eliminates CPU–GPU transfer bottlenecks through overlapping computation and data movement.
Asynchronous Memory–Compute Pipeline
Dual-stream CUDA architecture allows concurrent data transfer (H2D/D2H) and computation.
Streams are synchronized via CUDA events to ensure consistency without blocking.
Tile-Based Distance Computation
Uses cuBLAS GEMM operations to compute pairwise vector distances efficiently with limited GPU memory.
GPU-Side Matrix Transposition
Adds a GPU kernel to convert row-major to column-major matrices for coalesced memory access.
Memory Transfer Optimization
Asynchronous data movement using pinned memory enables direct DMA transfers, synchronized with CUDA events.
GPU Compute Kernel Pipeline
Includes on-device transposition, cuBLAS GEMM for matrix multiplication, and a fused kernel to compute final distances.
The asynchronous GPU-accelerated distance computation framework consists of several key components:
cudaMallocHost for high-throughput DMA transfers.cudaEventRecord and cudaStreamWaitEvent.cublasSgemm() for tile-wise distance computation.This modular architecture achieves high concurrency between data movement and computation, resulting in a continuous, high-throughput GPU pipeline suitable for large-scale vector similarity search and clustering.