PixelWorld: Towards Perceiving Everything as Pixels

Published in TMLR, 2025

Recommended citation: Lyu, Z., Ma, X., & Chen, W. (2025). PixelWorld: Towards Perceiving Everything as Pixels. Transactions on Machine Learning Research. https://arxiv.org/abs/placeholder

PixelWorld investigates reasoning mechanisms in vision-language models (VLMs) across modalities by converting textual reasoning data into visual representations. We develop benchmarks and visualizations to analyze attention patterns and interpretability in VLMs when processing pixel-based reasoning tasks.

This work explores how VLMs handle structured reasoning when information is presented visually rather than textually, providing insights into cross-modal reasoning capabilities and limitations.

Download paper here

Recommended citation:

@article{lyu2025pixelworld,
  title={PixelWorld: Towards Perceiving Everything as Pixels},
  author={Lyu, Zhiheng and Ma, Xiangtao and Chen, Wenhu},
  journal={Transactions on Machine Learning Research},
  year={2025}
}