Text2Earth

Text2Earth: Unlocking Text-driven Remote Sensing Image Generation

with a Global-Scale Dataset and a Foundation Model

Chenyang Liu,   Keyan Chen,   Rui Zhao,   Zhengxia Zou,   Zhenwei Shi   Beihang University

Overview

Generative foundation models have significantly advanced large-scale text-driven natural image generation and have become a prominent research trend across various vertical domains. However, in the remote sensing field, there is still a lack of research on large-scale text-to-image (text2image) generation technology. Existing remote sensing image-text datasets are small in scale and confined to specific geographic areas and scene types. Besides, existing text2image methods have struggled to achieve global-scale, multi-resolution controllable, and unbounded image generation. To address these challenges, this paper presents two key contributions: the Git-10M dataset and the Text2Earth foundation model. Git-10M is a global-scale image-text dataset comprising 10 million image-text pairs, 5 times larger than the previous largest dataset. The dataset contains essential resolution information and covers a wide range of geographic scenes, significantly surpassing existing datasets in both size and diversity. Building on Git-10M, we propose Text2Earth, a 1.3 billion parameter generative foundation model based on the diffusion framework to model global-scale remote sensing scenes. Text2Earth integrates a resolution control mechanism, enabling users to specify image resolutions. A dynamic condition adaptation strategy is proposed for training and inference to improve image generation quality. Text2Earth not only excels in resolution-controllable text2image generation but also demonstrates robust generalization and flexibility across multiple tasks, including unbounded scene construction, image editing, and cross-modal image generation. This robust capability significantly surpasses previous models that were restricted to the basic fixed-size text2image generation with limited scene types. On the previous text2image benchmark dataset, Text2Earth surpasses the previous models with a significant improvement of +26.23 FID and +20.95% Zero-shot OA metric.

Git-10M: Global-Scale Image-Text Dataset

The Git-10M dataset is a global-scale remote sensing image-text pair dataset, consisting of 10 million image-text pairs with geographical locations and resolution information.

Pipeline Image
Pipeline Image

Text2Earth Foundation Model

Building on the Git-10M dataset, we developed Text2Earth, a 1.3 billion parameter generative foundation model. Text2Earth excels in resolution-controllable text2image generation and demonstrates robust generalization and flexibility across multiple tasks.

Zero-Shot text2image generation

Text2Earth can generate specific image content based on user-free text input, without scene-specific fine-tuning or retraining.

On the previous benchmark dataset RSICD, Text2Earth surpasses the previous models with a significant improvement of +26.23 FID and +20.95% Zero-shot OA metric.

RSICD_result

Unbounded Remote Sensing Scene Construction

Using our Text2Earth, users can seamlessly and infinitely generate remote sensing images on a canvas, effectively overcoming the fixed-size limitations of traditional generative models. Text2Earth’s resolution controllability is the key to maintaining visual coherence across the generated scene during the expansion process.

Remote Sensing Image Editing

Text2Earth can perform scene modifications based on user-provided text such as replacing or removing geographic features. And it ensures that these modifications are seamlessly integrated with the surrounding areas, maintaining continuity and coherence.

Cross-Modal Image Generation

Text2Earth can be used for Text-Driven Multi-modal Image Generation, including RGB, SAR, NIR, and PAN images.

text2multi_modal

Text2Earth also exhibits potential in Image-to-Image Translation, containing cross-modal translation and image enhancement, such as PAN to RGB (PAN2RGB), NIR to RGB (NIR2RGB), PAN to NIR (PAN2NIR), super-resolution, and image dehazing.

cross_transf

Citation

@misc{liu2025text2earthunlockingtextdrivenremote,
      title={Text2Earth: Unlocking Text-driven Remote Sensing Image Generation with a Global-Scale Dataset and a Foundation Model},
      author={Chenyang Liu and Keyan Chen and Rui Zhao and Zhengxia Zou and Zhenwei Shi},
      year={2025},
      eprint={2501.00895},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.00895},
}