Zheng, Ningxin (郑宁馨)

ByteDance Seed,
Caohejing, Xuhui District,
Shanghai, China
E-mail: NingxinZheng@sjtu.edu.cn

About me

I received a B.S. from HuaZhong University of Science and Technology, and an M.S. from Shanghai Jiao Tong University, under the guidance of Professors Minyi Guo and Quan Chen. Presently, I contribute to ByteDance's AML team, focusing on enhancing the efficiency and scalability of Large Language Model (LLM) training. My research pursuits encompass AI systems, with an emphasis on LLM training optimization, model deployment (inference), and sparsity; cloud computing, aiming to boost resource utilization through job co-location and data center resource management; and model compression.

Research

Selected System Publications

Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu, "COMET: Fine-grained Computation-communication Overlapping for Mixture-of-Experts", Corresponding Author, MLSys25
Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Xin Liu, "TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives", MLSys25
Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu, "FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion", Preprint, Co-first Author
Lei Wang, Lingxiao Ma, Shijie Cao, Quanlu Zhang, Jilong Xue, Yining Shi, Ningxin Zheng, Ziming Miao, Fan Yang, Ting Cao, Yuqing Yang, Mao Yang, "Bitter: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation", OSDI24
Ningxin Zheng, Huiqiang Jiang, Quanlu Zhang, Zhenhua Han, Lingxiao Ma, Yuqing Yang, Fan Yang, Chengruidong Zhang, Lili Qiu, Mao Yang, Lidong Zhou, "PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation", SOSP23
Weihao Cui, Zhenhua Han, Lingji Ouyang, Yichuan Wang, Ningxin Zheng, Lingxiao Ma, Yuqing Yang, Fan Yang, Jilong Xue, Lili Qiu, Lidong Zhou, Quan Chen, Haisheng Tan, Minyi Guo, "Optimizing Dynamic Neural Networks with Brainstorm", OSDI23
Lei Wang, Lingxiao Ma, Shijie Cao, Ningxin Zheng, Quanlu Zhang, Jilong Xue, Ziming Miao, Ting Cao, Yuqing Yang, "LADDER: Efficient Tensor Compilation on Customized Data Format", OSDI23 POSTER Session
Bin Lin, Ningxin Zheng, Shijie Cao, Lingxiao Ma, Quanlu Zhang, Yi Zhu, Ting Cao, Jilong Xue, Yuqing Yang, Fan Yang, "Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning", Co-first Author, MLSys23 [code]
Ningxin Zheng, Bin Lin, Quanlu Zhang, Lingxiao Ma, Yuqing Yang, Fan Yang, Yang Wang, Mao Yang, Lidong Zhou, "SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute", OSDI22 [pdf][code]
Wei Zhang, Quan Chen, Kaihua Fu, Ningxin Zheng, Zhiyi Huang, Jingwen Leng, Minyi Guo, "Astraea: towards QoS-aware and resource-efficient multi-stage GPU services", ASPLOS22, [pdf]
Kaihua Fu, Jiuchen Shi, Quan Chen, Ningxin Zheng, Wei Zhang, Deze Zeng, Minyi Guo, "QoS-Aware Irregular Collaborative Inference for Improving Throughput of DNN Services", SC22, [pdf]
Wei Zhang, Kaihua Fu, Ningxin Zheng, Quan Chen, Chao Li, Wenli Zheng, Minyi Guo, "CHARM: Collaborative Host and Accelerator Resource Management for GPU Datacenters", ICCD21, [pdf]
Weihao Cui, Han Zhao, Quan Chen, Ningxin Zheng, Jingwen Leng, Jieru Zhao, Zhuo Song, Tao Ma, Yong Yang, Chao Li, Minyi Guo, "Enable simultaneous DNN services based on deterministic operator overlap and precise latency prediction.", SC21, [pdf]
Li Lyna Zhang, Shihao Han, Jianyu Wei, Ningxin Zheng, Ting Cao, Yuqing Yang, Yunxin Liu, "nn-Meter: Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices", MobiSys21, Best Paper Award && SigMobile Research Highlight [pdf][code]
Wei Zhang, Quan Chen, Ningxin Zheng, Weihao Cui, Kaihua Fu, Minyi Guo, "Towards QoS-awareness and Improved Utilization of Spatial Multitasking GPUs", TC21, [pdf]
Wei Zhang, Ningxin Zheng, Quan Chen, Yong Yang, Zhuo Song, Tao Ma, Jingwen Leng, Minyi Guo, "URSA: Precise Capacity Planning and Fair Scheduling based on Low-level Statistics for Public Clouds", ICPP20, Co-first Author [pdf]
Ningxin Zheng, Quan Chen, Yong Yang, Jin Li, Wenli Zheng, Minyi Guo, "POSTER:Precise Capacity Planning for Database Public Clouds", PACT19
Ningxin Zheng, Quan Chen, Chen Chen, Minyi Guo, "CLIBE: Precise Cluster-Level I/O Bandwidth Enforcement in Distributed File System", HPCC18

Selected Algorithm Publications

Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen, "ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference", ICML25 [pdf]
Li Lyna Zhang, Xudong Wang, Jiahang Xu, Quanlu Zhang, Yujing Wang, Yuqing Yang, Ningxin Zheng, Ting Cao, Mao Yang, "SpaceEvo: Hardware-Friendly Search Space Design for Efficient INT8 Inference", ICCV23 [pdf]
Xinyu Liu, Houwen Peng, Ningxin Zheng, Yuqing Yang, Han Hu, Yixuan Yuan, "EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention", CVPR23 [pdf]
Jun Xiao, Xinyang Jiang, Ningxin Zheng, Huan Yang, Yifan Yang, Yuqing Yang, Dongsheng Li, Kin-Man Lam, "Online Video Super-Resolution with Convolutional Kernel Bypass Graft", IEEE Transaction on Multimedia 22 [pdf]

Full list of publications in Google Scholar.

Projects

Flux

Flux is a communication-overlapping library for dense/MoE models on GPUs, providing high-performance and pluggable kernels to support various parallelisms in model training/inference. Flux's efficient kernels are compatible with Pytorch and can be integrated into existing frameworks easily, supporting various Nvidia GPU architectures and data types. Flux has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.

NNI is a very popular Deep learning framework (over 10k stars) including Neural Architecture Search(NAS), Model Compression, Hyperparameter Tuning, and Feature engineering. As the DNN models grow significantly, they are inevitably becoming sparse. Model Compression is an essential step before model deployment. As a core contributor, I designed and developed the automatic deployment process of the compressed model("Speedup" Module in NNI). "Speedup" can infer the sparsity of the whole model and generate the corresponding optimized faster model automatically. It simplifies the model deployment progress.

SparTA

SparTA is an extensible sparse framework based on Pytorch that supports different kinds of sparsity scenarios. It is the open-source implementation of our OSDI paper(SparTA). It contains lots of easy-to-use sparse modules that can be easily used in many scenarios such as large model training and sparse model inference. Compared to other sparse libraries, SparTA has achieved better performance and covered more application scenarios.

Performance Optimization for High frequency trading system

China foreign exchange trade system(CFETS) receives a large number of transaction requests every second, therefore it has extremely high requirements for performance. Constrained by the complex transaction logic, it is difficult to improve system throughput through task parallelism. In order to improve the performance of the system, we analyzed the performance bottleneck of the system through "Perf", split the transaction logic into three parts, and parallelized them in the pipeline. Finally, the end-to-end throughput improves by around 30%.

CLIBE

CLIBE provides a precise cluster-level I/O bandwidth enforcement mechanism for distributed file systems. The big data file system is widely used in different scenarios and such distributed file system(DFS) is usually shared by multiple tenants/jobs. Such sharing may lead to uncontrollable I/O bandwidth interference. The Quality-of-Service(QoS) of high-priority jobs may be violated due to I/O bandwidth interference. CLIBE allows the user to allocate a cluter-level I/O bandwidth quota for each jod and ensures that the I/O bandwidth consumed by the target task in the entire cluster is lower than the allocated quota.

Patent

Precise Capacity Planning and Fair Scheduling based on Low-level Statistics for Public Clouds, Alibaba

SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute, Microsoft

Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation

Education

Master degree, Computer Science and Technology, Shanghai Jiao Tong University

Bachelor degree, Computer Science and Technology, Huazhong University of Science and Technology

Competitions and Awards

Mobisys Best Paper
SigMobile Research Highlight
Outstanding Graduate of Shanghai Jiao Tong University
Scholarship of DongShi DongFang of Shanghai Jiao Tong University
Bronze medal of Intel Parallel performance Optimization Competition
First-Class Scholarship of Shanghai Jiao Tong University
Outstanding student of Huazhong University of Science and Technology

Work experience

ByteDance Seed

2023.09~Currnet

Microsoft Research

Research Software Development Engineer, Shanghai System Group, 2020.03~2023.9

Alibaba Cloud

Software Developer Intern, Linux Kernel Group, 2018.03~2019.07