BabelTower: Learning to Auto-parallelized Program Translation

1University of Science and Technology of China
2State Key Lab of Processors, Institute of Computing Technology, CAS
3Cambricon Technologies
4University of Chinese Academy of Sciences 5Institute of Software, CAS

Overview

GPUs have become the dominant computing platforms for many applications, while programming GPUs with the widely-used CUDA parallel programming model is difficult. As sequential C code is relatively easy to obtain either from legacy repositories or by manual implementation, automatically translating C to its parallel CUDA counterpart is promising to relieve the burden of GPU programming. However, because of huge differences between the sequential C and the parallel CUDA programming model, existing approaches fail to conduct the challenging auto-parallelized program translation. In this paper, we propose a learning-based framework, i.e., BabelTower, to address this problem. We first create a large-scale dataset consisting of computeintensive function-level monolingual corpora. We further propose using back-translation with a discriminative reranker to cope with unpaired corpora and parallel semantic conversion. Experimental results show that BabelTower outperforms state-of-the-art by 1.79, 6.09, and 9.39 in terms of BLEU, CodeBLEU, and specifically designed ParaBLEU, respectively. The CUDA code generated by BabelTower attains a speedup of up to 347× over the sequential C code, and the developer productivity is improved by at most 3.8×.

Overview of BabelTower learning framework. We train the discriminative ranking model in the back-translation step, i.e., CUDA-C-CUDA, to synthesize paired data. Further, we specially designed the metrics ParaBLEU for CUDA, and learn to predict the ParaBLEU score by minimizing the KL-divergence between the output distribution of the model and target distribution.

Main Results

We evaluate the results of auto-parallelization approaches (i.e., Bones and PPCG) and statistical program translation (i.e., Transcoder, BabelTower) and in different metrics: BLEU, CodeBLEU, ParaBLEU, and compilation accuracy. We perform beam decoding with beam size 50 for Statistical program translation approaches.

Translation Cases

These results are produced by BabelTower.

Source C

void transpositionCPU(int * vector, int * transposed, int size) {
    for (int i = 0; i < size; i++)
        for (int j = 0; j < size; j++) transposed[i + j * size] = vector[j + i * size];
}
                      
Translated CUDA

__global__ void transpositionCPU(int * vector, int * transposed, int size) {
    for (int i = threadIdx.x + blockIdx.x * blockDim.x; i < size; i += blockDim.x * gridDim.x)
        for (int j = threadIdx.y + blockIdx.y * blockDim.y; j < size; j += blockDim.y * gridDim.y) transposed[i + j * size] = vector[j + i * size];
}
                      
Source C

void gemm(...) {
    int i, j, k;
    for (i = 0; i < NI; i++)
        for (j = 0; j < NJ; j++) {
            C[i * NJ + j] *= beta;
            for (k = 0; k < NK; ++k) C[i * NJ + j] += alpha * A[i * NK + k] * B[k * NJ + j];
        }
}
                      
Translated CUDA

__global__ void gemm(...) {
    int i, j, k;
    i = blockIdx.x * blockDim.x + threadIdx.x;
    j = blockIdx.y * blockDim.y + threadIdx.y;
    if ((i < NI) && (j < NJ)) {
        C[i * NJ + j] *= beta;
        for (k = 0; k < NK; ++k) C[i * NJ + j] += alpha * A[i * NK + k] * B[k * NJ + j];
    }
}
                      

BibTex

  @InProceedings{pmlr-v162-wen22b,
    title     = {{B}abel{T}ower: Learning to Auto-parallelized Program Translation},
    author    = {Wen, Yuanbo and Guo, Qi and Fu, Qiang and Li, Xiaqing and Xu, Jianxing and Tang, Yanlin and Zhao, Yongwei and Hu, Xing and Du, Zidong and Li, Ling and Wang, Chao and Zhou, Xuehai and Chen, Yunji},
    booktitle = {Proceedings of the 39th International Conference on Machine Learning},
    year      = {2022}