A Toolkit for Profiling and Call Graph Analysis for RISC architectures based on Program Execution Traces
Published in 2025 lEEE International Conference on Cloud Computing Technology and Science (CloudCom), 2025
This paper proposes a novel matrix multiplication optimization for Huawei Ascend NPUs that offloads narrow MatMul computations from the underutilized Cube Unit to the Vector Unit using AscendC instructions. Applied to MLA inference in DeepSeek-V3, the method achieves a 20% mean performance gain in single-token processing by overlapping AIV and AIC execution.
