# FPGA Hardware Acceleration over PCIe ## What This Is Multi-threaded Linux application + custom kernel driver + FPGA hardware design that accelerates Sobel edge detection on images. Demonstrates full-stack embedded systems engineering from RTL to application layer. ![System Overview][system_overview] ## Project Summary This project implements a **full-stack hardware acceleration platform** that offloads compute-intensive image processing tasks from a multi-threaded Linux application to custom FPGA accelerators connected via PCIe. The system demonstrates advanced concepts in **computer architecture, hardware-software co-design, parallel processing, and driver development**. ### Key Achievement Successfully designed and implemented a multi-acceleration-group architecture supporting **concurrent hardware acceleration** for up to **16 simultaneous threads**, with intelligent resource scheduling and DMA-based data transfers. - **See [Engineering Challenges Solved](#engineering-challenges-solved)** ## Technical Overview **Hardware (Xilinx Virtex-7 FPGA)** - 7 parallel acceleration units supporting up to 16 concurrent threads - Custom IP cores designed in C/C++ (Vivado HLS), synthesized to RTL - PCIe Gen2 x4 interface with DMA engines for high-throughput data transfer - Sobel filter accelerator processing up to 1080p images **Software (Linux)** - **Kernel driver**: PCIe device management, MSI interrupts, multi-thread resource scheduling - **User application**: pthreads, memory-mapped I/O, DMA buffer management - **MicroBlaze firmware**: FPGA system initialization --- ## Architecture Highlights ``` ┌─────────────────────────────────────────┐ │ Multi-threaded Application (pthreads) │ └──────────────┬──────────────────────────┘ │ ioctl(), mmap() ┌──────────────▼──────────────────────────┐ │ Kernel Driver (Resource Scheduler) │ ← Thread arbitration, DMA setup └──────────────┬──────────────────────────┘ │ PCIe, MSI Interrupts ┌──────────────▼──────────────────────────┐ │ FPGA Hardware (7 Accel Groups) │ ← Parallel processing │ • Fetch/Send Schedulers (DMA) │ │ • Sobel Filter Accelerators │ │ • Interrupt Manager │ └─────────────────────────────────────────┘ ``` **Why 7 acceleration groups?** - 2 Direct-mode (PCIe → BRAM, low latency) - 4 Indirect-mode (PCIe → DDR3, higher throughput) - 1 Scatter-Gather (supports fragmented user memory) Each can process different images simultaneously with driver-managed scheduling. --- ## Engineering Challenges Solved **1. Multi-thread resource arbitration** 16 threads competing for 7 hardware units → Implemented two scheduling policies (greedy, best-available) in kernel driver with per-thread state tracking **2. PCIe interrupt routing** Designed custom Interrupt Manager IP to map 7 accelerators to MSI vectors, coordinated with GPIO-triggered interrupts **3. Zero-copy DMA from userspace** Used `get_user_pages()` + scatter-gather tables for direct DMA to/from application buffers without memcpy overhead **4. Hardware-software timing correlation** FPGA global timer accessible via memory-mapped registers for nanosecond-precision performance analysis --- ## Results - **Throughput**: Supports 16 concurrent requests with linear scaling up to 7 threads - **Latency**: ~50-100 μs for VGA images (640x480) --- ## Quick Start ```bash # Generate custom IPs (one-time) cd Hardware/Vivado_HLS_IPs/Sobel_Filter && vivado_hls run_hls.tcl # ... repeat for 8 other IPs # Build bitstream cd Hardware && vivado -source create_project.tcl # Flow → Generate Bitstream # Load driver & run cd Software/Linux_App_Driver make ./make_device insmod xilinx_pci_driver.ko ./ui image. bmp 100 16 1 10 # 100 iterations, 16 threads ``` --- ## Repository Structure ``` Hardware/Vivado_HLS_IPs/ 9 custom IP cores (C++ → RTL) Hardware/Vivado_Block_Design/ System integration (AXI, PCIe, DDR3) Software/Linux_App_Driver/ Kernel driver + test application Software/Microblaze_XSDK/ FPGA firmware ``` [system_overview]: /Images/system_overview.png "System Overview Diagram"