PixelWorld: Towards Perceiving Everything as Pixels
Published in TMLR, 2025
Recommended citation: Lyu, Z., Ma, X., & Chen, W. (2025). PixelWorld: Towards Perceiving Everything as Pixels. Transactions on Machine Learning Research. https://arxiv.org/abs/placeholder
PixelWorld investigates reasoning mechanisms in vision-language models (VLMs) across modalities by converting textual reasoning data into visual representations. We develop benchmarks and visualizations to analyze attention patterns and interpretability in VLMs when processing pixel-based reasoning tasks.
This work explores how VLMs handle structured reasoning when information is presented visually rather than textually, providing insights into cross-modal reasoning capabilities and limitations.
Recommended citation:
@article{lyu2025pixelworld,
title={PixelWorld: Towards Perceiving Everything as Pixels},
author={Lyu, Zhiheng and Ma, Xiangtao and Chen, Wenhu},
journal={Transactions on Machine Learning Research},
year={2025}
}
