About Zhiheng
Hi, I’m Zhiheng. Welcome to my personal website! I’m currently a second-year Master’s student at the University of Waterloo, supervised by Professor Wenhu Chen at the TIGER Lab. My research focuses on AI for Software Engineering, particularly in agentic post-training, benchmarks, and causal reasoning.
I did my undergrad in Computer Science at the University of Hong Kong, where I was active in algorithm competitions—I entered the ICPC World Finals and won two regional gold medals. I’ve had the privilege to work with research groups at Berkeley, ETH Zürich, and University of Michigan.
Currently, I’m focused on post-training for large models using RL-based methods. I’m a core contributor to the open-source framework VerlTool. I’ve also worked as a Research Scientist Intern at MiniMax on software engineering tasks.
Research Areas
Agentic Post-Training
I’m deeply involved in the full pipeline of post-training for software engineering agents. As a core contributor to VerlTool, I develop environment interaction modules and post-training setups for SWE tasks. My work on MiniMax-M2 achieved 69% Pass@1 on SWE-Verified and ranked #2 on MultiSWE and TerminalBench. I’ve designed large-scale SWE data synthesis pipelines generating over 36K verifiable tasks from 5K+ sandbox environments.
I’m also working on BrowserAgent, which focuses on information-seeking tasks through direct browser environment interaction, moving beyond traditional tool-based approaches to enable more natural web navigation and information extraction.
Benchmarks & Evaluation
I believe that as models get stronger, the definition of tasks becomes increasingly important. My benchmark work spans three approaches:
- Synthesis: Converting existing data (PixelWorld converts textual reasoning to images, Corr2Cause generates causal reasoning problems)
- Human-in-the-loop: Developing repo-level QA benchmarks with crowdsourced annotation and validation
- Structural data: Building benchmarks from web pages and GitHub repositories
I’ve contributed to StructEval for structured output evaluation and VideoScore for video generation assessment.
Causal Reasoning & Knowledge Methods
My work explores lightweight ways to enhance LLM capabilities without retraining. At Berkeley, I developed FactTrack for time-aware world state tracking in story outlines, decomposing complex narratives into atomic facts for contradiction detection.
I’ve investigated how large language models understand causal relations through Psychologically-Inspired Causal Prompts, exploring different psychological processes in sentiment classification. The Corr2Cause dataset tests pure causal inference skills of LLMs.
Current Focus
I’m particularly interested in AI for Software Engineering because it combines structural data that’s easy to synthesize, real-world relevance with immediate impact, and strong economic value. My research explores decomposing SWE tasks into skill-specific components: debugging, performance optimization, refactoring, test generation, repository-level QA, and security.
For detailed future research directions, see my Research Statements page. My complete background is in my CV.
Publications
PixelWorld: Towards Perceiving Everything as Pixels
Published in TMLR, 2025
Converting textual reasoning data into images to probe vision-language model reasoning capabilities
Recommended citation: Lyu, Z., Ma, X., & Chen, W. (2025). PixelWorld: Towards Perceiving Everything as Pixels. Transactions on Machine Learning Research. https://arxiv.org/abs/placeholder
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Published in Technical Report, 2025
Technical report on MiniMax-M1 model with focus on software engineering capabilities and test-time compute scaling
Recommended citation: MiniMax. (2025). MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention. Technical Report. https://arxiv.org/abs/placeholder
FACTTRACK: Time-Aware World State Tracking in Story Outlines
Published in NAACL 2025 (Oral), 2025
A novel approach to tracking dynamic world states and detecting contradictions in story narratives
Recommended citation: Lyu, Z., Yang, K., Kong, L., & Klein, D. (2025). FACTTRACK: Time-Aware World State Tracking in Story Outlines. NAACL 2025. https://arxiv.org/abs/placeholder
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions
Published in Under Review, 2025
A framework for training web agents that directly interact with browser environments for information-seeking tasks
Recommended citation: Yu, T., Zhang, Z., Lyu, Z., Gong, J., Yi, H., Wang, X., Zhou, Y., Yang, J., Nie, P., Huang, Y., & Chen, W. (2025). BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions. Under Review. https://arxiv.org/abs/placeholder
Can Large Language Models Infer Causation from Correlation?
Published in , 2023
This research introduces the first benchmark dataset, Corr2Cause, to test large language models (LLMs) pure causal inference skills.
Recommended citation: Jin Z, Liu J, Lyu Z, et al. Can Large Language Models Infer Causation from Correlation? arXiv preprint arXiv:2306.05836, 2023. https://arxiv.org/abs/2306.05836
Can Large Language Models Distinguish Cause from Effect?
Published in , 2023
Our paper conducts a post-hoc analysis to check whether large language models can be used to distinguish cause from effect.
Recommended citation: Jin, Lalwani, A., Vaidhya, T., Shen, X., Ding, Y., Lyu, Z., Sachan, M., Mihalcea, R., & Schölkopf, B. (2022). Logical Fallacy Detection. https://openreview.net/forum?id=ucHh-ytUkOH
Psychologically-Inspired Causal Prompts.
Published in , 2023
This paper is about a prompting method embedded causal direction and analyze the performance gap of LLMs
Recommended citation: Lyu, Z., Jin, Z., Mattern, J., Mihalcea, R., Sachan, M., & Schoelkopf, B. (2023). Psychologically-Inspired Causal Prompts. arXiv preprint arXiv:2305.01764. https://arxiv.org/pdf/2305.01764
Logical Fallacy Detection
Published in , 2022
This paper is about the a dataset of Logical Fallacy Detection and its baseline model
Recommended citation: Jin, Lalwani, A., Vaidhya, T., Shen, X., Ding, Y., Lyu, Z., Sachan, M., Mihalcea, R., & Schölkopf, B. (2022). Logical Fallacy Detection. https://arxiv.org/abs/2202.13758
Contact
I’m currently seeking opportunities in industry related to AI for Software Engineering. If you have relevant positions or can provide recommendations, I would greatly appreciate it.
Feel free to reach out at z63lyu@uwaterloo.ca for research collaboration, open source projects, job opportunities, or mentorship.
