High-performance GPU techniques for 2D convolution operators in deep neural networks

Enjie Liu

Research Output: Contribution to journal Article Peer-review

Abstract

This paper presents an optimized implementation of Winograd non-fused convolution. Our optimizations include both application independent and Winograd-specific software techniques, such as a specialized interface-kernel data format (tile-united CNHW layout) to enhance memory access efficiency; warp specialization and double-buffered prefetching to effectively exploit computational resources and memory bandwidth; and the use of shuffle instructions to conserve hardware resources. We propose a GPU-based Multi-Modal Parallelism Method (MMPM) for 2D Winograd non-fused convolution and provide a supplementary explanation of Winograd’s tile extraction, which
reduces memory usage and computation. The proposed techniques were evaluated at the kernel level in two environments (ENV1 – GTX 980 GPU, CUDA 9.2, cuDNN 7.6.4; ENV2 – GTX 1650Ti GPU, CUDA 10.2, cuDNN 8.2.0) using a wide range of CNN layer benchmark-compliant parameters. Compared with the state-of-the-art Winograd non-fused convolution in cuDNN, our implementation achieves speedups of 1.64× and 1.28× for the two environments, respectively.

Publication Information

Output type

Research Output: Contribution to journal Article Peer-review

Original language

English

Article number

CPE-25-1771

Journal (Volume, Issue Number)

Concurrency and Computation: Practice and Experience

Publication milestones

Accepted/In press - 18/06/2026

Publication status

Accepted/In press - 18/06/2026

ISSN

1532-0626

Access to documents

High-Performance GPU Techniques for 2D Convolution Operators in Deep Neutral Networks

Accepted author manuscript, 2.9 MB

License:CC BY-NC-ND

Access to file: Embargo ends 29/06/2027