QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach

1University of Science and Technology of China
2Cambricon Technologies
3SKL of Processors, Institute of Computing Technology, Chinese Academy of Sciences
4Institute of Software, Chinese Academy of Sciences

Overview

Heterogeneous deep learning systems (DLS) such as GPUs and ASICs have been widely deployed in industrial data centers, which requires to develop multiple low-level tensor programs for different platforms. An attractive solution to relieve the programming burden is to transcompile the legacy code of one platform to others. However, current transcompilation techniques struggle with either tremendous manual efforts or functional incorrectness, rendering “Write Once, Run Any- where" of tensor programs an open question. We propose a novel transcompiler, i.e., QiMeng-Xpiler, for auto-matically translating tensor programs across DLS via both large language models (LLMs) and symbolic program synthesis, i.e., neural-symbolic synthesis. The key insight is leveraging the powerful code generation ability of LLM to make costly search-based symbolic synthesis computationally tractable. Concretely, we propose multiple LLM-assisted compilation passes via pre-defined meta-prompts for program transformation. During each program transformation, efficient symbolic program synthesis is employed to repair incorrect code snippets with a limited scale. To attain high performance, we propose a hierarchical auto-tuning approach to systemically explore both the parameters and sequences of transformation passes. Experiments on 4 DLS with distinct programming interfaces, i.e., Intel DL Boost with VNNI, NVIDIA GPU with CUDA, AMD MI with HIP, and Cambricon MLU with BANG, demonstrate that QiMeng-Xpiler correctly translates different tensor programs at the accuracy of 95% on average, and the performance of translated programs achieves up to 2.0× over vendor-provided manually-optimized libraries. As a result, the programming productivity of DLS is improved by up to 96.0× via transcompiling legacy tensor programs.

The overview of QiMeng-Xpiler, a novel transcompiler for automatic transcompilation of tensor programs across different programming models. QiMeng-Xpiler consists of two parts: (a) neural-symbolic program synthesis, which utilizes LLM to transform code and repair incorrect transformation through symbolic synthesis with limited scales, and (b) hierarchical performance auto-tuning, which systemically explores both the parameters and sequences of transformation passes.

Main Results

We present the evaluations on compilation/computation accuracy, where we compare QiMeng-Xpiler with state-of-the-art methods in different transcompilation directions. We conclude that

(1) QiMeng-Xpiler performs the best in all directions with close to 100% accuracy for compilation and 86.9% to 100% accuracy for computation. This clearly indicates that QiMeng-Xpiler is capable of handling source-to-source code translation tasks on vlarious DLS with minimal human efforts, bringing revolutionary advancements to the DLS programming domain.

(2) QiMeng-Xpiler performs better than the SOTA LLM-based methods. Although the LLM-based methods have achieved high accuracy in certain cases, it is challenging for them to reach 100% accuracy due to the uncertainty of LLMs. This means that LLM-based methods cannot be applied to transcompilers which have an extremely high demand for accuracy. In contrast, our approach can achieve 100% accuracy in most situations, demonstrating its practical applicability as a transcompiler.

(3) QiMeng-Xpiler performs better than the SOTA rule-based methods. For C → CUDA C, QiMeng-Xpiler achieves 100% compilation and 98.2% computation accuracy, which is ∼50% higher than PPCG. For the easier CUDA C → HIP task, QiMeng-Xpiler successfully converts and executes with 100% accuracy, outperforming HIPIFY, which achieves 85.7%. Also, this result shows that QiMeng-Xpiler’s flexibility across various DLS without much adaptation cost while rule-based methods cannot.

Translation Cases

Given a CUDA code, the translated results for MLU, HIP, and DL Boost are generated by QiMeng-Xpiler.

CUDA


extern "C" __global__ void matmul(half *A, half *B, float *D)
{
	int ix = (blockIdx.x * blockDim.x + threadIdx.x)/32;
	int iy = (blockIdx.y * blockDim.y + threadIdx.y);
	
	wmma::fragment a_frag;
	wmma::fragment b_frag;
	wmma::fragment ab_frag;
	
	wmma::fill_fragment(ab_frag, 0.0f);

	int a_row = ix * 16;
	int b_row = iy * 16;
	for (int k=0; k<512; k+=16) {
		int a_col = k;
        int b_col = k;

		if (a_row < 512 && a_col < 512 && b_row < 512 && b_col < 512) {
			// Load the inputs
			wmma::load_matrix_sync(a_frag, A + a_col + a_row * 512, 512);
			wmma::load_matrix_sync(b_frag, B + b_col + b_col * 512, 512);

			// Perform the matrix multiplication
			wmma::mma_sync(ab_frag, a_frag, b_frag, ab_frag);
		}
	}

	if (a_row < 512 && b_row < 512) {
		// Store the output
		wmma::store_matrix_sync(D + b_row + a_row * N_TOTAL, ab_frag, N_TOTAL, wmma::mem_row_major);
	}
}
                

MLU


extern "C" __mlu_global__ void matmul(half* data, half* filter, float* output) {
  __nram__ half date_block[768];
  __wram__ half filter_block[65536];
  for (int i = 0; i < 64; ++i) {
    for (int j = 0; j < 8; ++j) {
      __memcpy(date_block, data + i * 4096 + j * 512, 1024, GDRAM2NRAM);
      for (int k = 0; k < 4; ++k) {
        __memcpy(filter_block, filter + k * 128, 256, GDRAM2WRAM, 256, 1024, 511);
        __bang_mlp(date_block + 256, date_block, filter_block, 512, 128, 0);
        __memcpy(output i * 4096 + j * 512 + k * 128, date_block + 256, 512, NRAM2GDRAM);
      }
    }
  }
}
                    

HIP


extern "C" __global__ void matmul(half *A, half *B, float *C) {
    using float16x4 = __attribute__((__vector_size__(4 * sizeof(float16_t)))) float16_t;
    using floatx4 = __attribute__((__vector_size__(4 * sizeof(float)))) float;

    const int c_row_base = blockIdx.y * 16;
    const int c_col_base = blockIdx.x * 16;

    floatx4 d = {0.0f};
    for(int k_step = 0; k_step < 512; k_step += 16) {
        float16x4 a, b;
        for(int i = 0; i < 4; ++i) {

            int a_row = c_row_base + threadIdx.x;
            int a_col = k_step + threadIdx.y * 4 + i;
            a[i] = A[a_row * 512 + a_col];

            int b_row = k_step + threadIdx.y * 4 + i;
            int b_col = c_col_base + threadIdx.x;
            b[i] = B[b_row * 512 + b_col];
        }

        d = __builtin_amdgcn_mfma_f32_16x16x16f16(a, b, d, 0, 0, 0);
    }

    for(int i = 0; i < 4; ++i) {
        int c_row = c_row_base + threadIdx.x;
        int c_col = c_col_base + threadIdx.y * 4 + i;
        if(c_row < 512 && c_col < 512) {
            C[c_row * 512 + c_col] = d[i];
        }
    }
}
                    

DL Boost


extern "C"  void matmul(const int8_t* A, const int8_t* B, int32_t* C) {
    #pragma omp parallel for
    for (int i = 0; i < 512; ++i) {
        for (int j = 0; j < 512; j += 16) {
            __m512i acc = _mm512_setzero_si512();
            for (int p = 0; p < 512; ++p) {
                __m512i a_vec = _mm512_set1_epi8(A[i * 512 + p]); 

                __m128i b_vec_128 = _mm_loadu_si128((__m128i*)&B[p * 512 + j]);
                __m512i b_vec_512 = _mm512_cvtepi8_epi32(b_vec_128);

                acc = _mm512_dpbusds_epi32(acc, a_vec, b_vec_512);
            }

            _mm512_storeu_si512((__m512i*)&C[i * 512 + j], acc);
        }
    }
}
                    

BibTex

      
  @article{dong2025qimeng,
  title={QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach},
  author={Dong, Shouyang and Wen, Yuanbo and Bi, Jun and Huang, Di and Guo, Jiaming and Xu, Jianxing and Xu, Ruibai and Song, Xinkai and Hao, Yifan and Zhou, Xuehai and others},
  journal={arXiv preprint arXiv:2505.02146},
  year={2025}
}