PhotoFramer: Multi-modal Image Composition Instruction

1Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
2Multimedia Laboratory, The Chinese University of Hong Kong
3Adobe NextCam 4Adobe Research 5INSAIT, Sofia University "St. Kliment Ohridski"
6Shanghai AI Laboratory 7CPII under InnoHK 8Shenzhen University of Advanced Technology
Abstract

Composition matters during the photo-taking process, yet many casual users struggle to frame well-composed images. To provide composition guidance, we introduce PhotoFramer, a multi-modal composition instruction framework. Given a poorly composed image, PhotoFramer first describes how to improve the composition in natural language and then generates a well-composed example image. To train such a model, we curate a large-scale dataset. Inspired by how humans take photos, we organize composition guidance into a hierarchy of sub-tasks: shift, zoom-in, and view-change tasks. Shift and zoom-in data are sampled from existing cropping datasets, while view-change data are obtained via a two-stage pipeline. First, we sample pairs with varying viewpoints from multi-view datasets, and train a degradation model to transform well-composed photos into poorly composed ones. Second, we apply this degradation model to expert-taken photos to synthesize poor images to form training pairs. Using this dataset, we finetune a model that jointly processes and generates both text and images, enabling actionable textual guidance with illustrative examples. Extensive experiments demonstrate that textual instructions effectively steer image composition, and coupling them with exemplars yields consistent improvements over exemplar-only baselines. PhotoFramer offers a practical step toward composition assistants that make expert photographic priors accessible to everyday users.

Task Paradigm

Given a poorly composed image, our PhotoFramer is required to generate a textual guidance (describing how to improve the composition) together with an example image (depicting what a well-composed image looks like). Motivated by three key photography factors (vantage point, focal choice, and subject placement), our PhotoFramer comprises three tasks: (a) Shift: adjust the framing to place the subject properly and remove border distractions; (b) Zoom-in: select a tighter crop (simulating a longer focal length) that yields a stronger composition; (c) View-change: choose a new vantage point or camera pose to reframe the scene.

Results

We propose PhotoFramer, a model designed for composition instruction during photo capturing. Given a poorly composed image, PhotoFramer first describes how to improve the composition in natural language and then generates an example image that follows the described suggestions. The photo-taker can follow the textual guidance and the example image to capture a better-composed photo.

Click the following links to view more qualitative samples for each composition task.

BibTeX

@article{photoframer,
    title={PhotoFramer: Multi-modal Image Composition Instruction},
    author={You, Zhiyuan and Wang, Ke and Zhang, He and Cai, Xin and Gu, Jinjin and Xue, Tianfan and Dong, Chao and Zhang, Zhoutong},
    journal={arXiv preprint},
    year={2025}
}