Weili Xu

Seeking research opportunities in MLsys
Check out my CV here

I am a third-year undergraduate in Computer Engineering, currently pursuing a dual degree from University of Illinois Urbana-Champaign and Zhejiang University.

I used to work with Wenhao Chai and Enxin Song, working on Efficient Long Video Understanding. We built AuroraLong, a hybrid MLLM that efficiently handles hour-long videos on a single consumer GPU while achieving comparable performance to its Transformer counterparts on multiple video understanding benchmarks such as MLVU, MovieChat-1k and VDC.

I’m interested in various aspects of machine learning and computer systems:

Efficient sequence modeling algorithms with hardware awareness
Kernel and runtime optimization for agentic systems
Applications of multi-modal (text, video, audio, etc.) long-context modeling

news

Oct 20, 2025	Video-MMLU is granted Outstanding Paper Awared by ICCV 2025 Workshop on Knowledge-Intensive Multimodal Reasoning!
Jul 11, 2025	One paper accepted by ICCV 2025 Findings
Jun 25, 2025	One paper accepted by ICCV 2025, see you in Hawaii!
Mar 31, 2025	One paper accepted by the second CVPR workshop on Efficient Large Vision Models

selected publications

AuroraLong
Bringing RNNs Back to Efficient Open-Ended Video Understanding

Weili Xu, Enxin Song, Wenhao Chai, and 3 more authors

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2025

Abs arXiv Bib PDF Supp Code

The challenge of long video understanding lies in its high computational complexity and prohibitive memory cost, since the memory and computation required by transformer-based LLMs scale quadratically with input sequence length. We propose AuroraLong to address this challenge by replacing the LLM component in MLLMs with a linear RNN language model that handles input sequence of arbitrary length with constant-size hidden states. To further increase throughput and efficiency, we combine visual token merge with linear RNN models by reordering the visual tokens by their sizes in ascending order. Despite having only 2B parameters and being trained exclusively on public data, AuroraLong achieves performance comparable to Transformer-based models of similar size trained on private datasets across multiple video benchmarks. This demonstrates the potential of efficient, linear RNNs to democratize long video understanding by lowering its computational entry barrier. To our best knowledge, we are the first to use a linear RNN based LLM backbone in a LLaVA-like model for open-ended video understanding.
@inproceedings{xu2025auroralong, title = {Bringing RNNs Back to Efficient Open-Ended Video Understanding}, author = {Xu, Weili and Song, Enxin and Chai, Wenhao and Wen, Xuexiang and Ye, Tian and Wang, Gaoang}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = oct, year = {2025}, pages = {23453-23465}, }
Video-MMLU
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark

Enxin Song, Wenhao Chai, Weili Xu, and 3 more authors

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct 2025

Abs arXiv Bib PDF Supp Code

Recent advancements in language multimodal models (LMMs) for video have demonstrated their potential for understanding video content, yet the task of comprehending multi-discipline lectures remains largely unexplored. We introduce Video-MMLU, a massive benchmark designed to evaluate the capabilities of LMMs in understanding Multi-Discipline Lectures. We evaluate over 90 open-source and proprietary models, ranging from 0.5B to 40B parameters. Our results highlight the limitations of current models in addressing the cognitive challenges presented by these lectures, especially in tasks requiring both perception and reasoning. Additionally, we explore how the number of visual tokens and the large language models influence performance, offering insights into the interplay between multimodal perception and reasoning in lecture comprehension.
@inproceedings{song2025videommlu, author = {Song, Enxin and Chai, Wenhao and Xu, Weili and Xie, Jianwen and Liu, Yuxuan and Wang, Gaoang}, title = {Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = oct, year = {2025}, pages = {6099-6113}, }