Text4Seg: Reimagining Image Segmentation as Text Generation

Mengcheng Lan¹ Chaofeng Chen¹ Yue Zhou¹ Jiaxing Xu² Yiping Ke² Xinjiang Wang³ Litong Feng³ Wayne Zhang³

¹S-Lab, Nanyang Technological University ²CCDS, Nanyang Technological University ³SenseTime Research

Online Demo Project Website arXiv Paper

Abstract

Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks; however, effectively integrating image segmentation into these models remains a significant challenge. In this paper, we introduce Text4Seg, a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. This unified representation allows seamless integration into the auto-regressive training pipeline of MLLMs for easier optimization. We demonstrate that representing an image with $16\times16$ semantic descriptors yields competitive segmentation performance. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by $3\times$, without compromising performance. Extensive experiments across various vision tasks, such as referring expression segmentation and comprehension, show that Text4Seg achieves state-of-the-art performance on multiple datasets by fine-tuning different MLLM backbones. Our approach provides an efficient, scalable solution for vision-centric tasks within the MLLM framework.

Different paradigms of MLLMs based image segmentation: (a) embeddings-as-mask paradigm that relies on additional segmentation decoder and loss (e.g., LISA); (b) polygon coordinates for instance segmentation (e.g., VisionLLM); (c) our text-as-mask paradigm that relies on semantically consistent text sequences.

🏆 Contributions

Semantic descriptors: We introduce semantic descriptors, a textual sequence representation of segmentation masks that seamlessly integrates with existing MLLMs for easier optimization. We demonstrate that $16\times16$ semantic descriptors are sufficient for achieving strong performance.
R-RLE: We develop Row-wise Run-Length Encoding (R-RLE) to compress semantic descriptors, significantly reducing its length and inference costs without compromising performance.
Text4Seg framework: We propose Text4Seg, a novel text-as-mask paradigm that redefines image segmentation as a text generation problem, fully leveraging the text generation capabilities of MLLMs.

💬 Semantic Descriptors and Row-wise Run-Length Encoding

Without compromising performance, R-RLE achieves a 74% reduction in semantic descriptors length and speeds up inference by 3× on average.

🔍 Text4Seg Framework

The left side of the figure illustrates the proposed visual instruction data format, and the right side illustrates the model architecture of Text4Seg. Text4Seg could be seamlessly built upon existing MLLMs without any modifications to the model architecture.

🚀 Qualitative and Quantitative Results

📷 Referring Expression Segmentation (RES) (Single Target)

Performance on refCOCO series benchmark.

📷 Generalized Referring Expression Segmentation (GRES) (Multiple and Empty Targets)

Performance on grefCOCO benchmark.

📷 Referring Expression Comprehension (REC)

Text4Seg can be directly applied in object detection with a simple mask2box paradigm, which first generates a segmentation mask based on the input and then derives the bounding box from the mask.

Performance on refCOCO series benchmark.

📷 Open-vocabulary Semantic Segmentation (OVSS)

Text4Seg is built upon LLaVA-1.5-7B and trained on the COCOStuff-171 dataset.

Performance on PASCAL VOC 20 benchmark.

Performance on Pascal Context 59 benchmark.

📷 Visual Question Answering (VQA)

Our text-as-mask paradigm allows for seamless integration of downstream segmentation tasks into the pre-training of MLLMs. Text4Seg, built upon stage-2 of LLaVA-1.5-7B, is trained on both the LLaVA-v1.5-mix665k dataset and our referring segmentation datasets.

Performance comparison on visual understanding.

📜 Citation

@misc{lan2024text4seg,
  title={Text4Seg: Reimagining Image Segmentation as Text Generation},
  author={Mengcheng Lan and Chaofeng Chen and Yue Zhou and Jiaxing Xu and Yiping Ke and Xinjiang Wang and Litong Feng and Wayne Zhang},
  year={2024},
  eprint={2410.09855},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2410.09855},
}