Computer Science > Computer Vision and Pattern Recognition

arXiv:2208.05647 (cs)

[Submitted on 11 Aug 2022]

Title:PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding

Authors:Zihan Ding, Zi-han Ding, Tianrui Hui, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Si Liu

View PDF

Abstract:Panoptic Narrative Grounding (PNG) is an emerging task whose goal is to segment visual objects of things and stuff categories described by dense narrative captions of a still image. The previous two-stage approach first extracts segmentation region proposals by an off-the-shelf panoptic segmentation model, then conducts coarse region-phrase matching to ground the candidate regions for each noun phrase. However, the two-stage pipeline usually suffers from the performance limitation of low-quality proposals in the first stage and the loss of spatial details caused by region feature pooling, as well as complicated strategies designed for things and stuff categories separately. To alleviate these drawbacks, we propose a one-stage end-to-end Pixel-Phrase Matching Network (PPMN), which directly matches each phrase to its corresponding pixels instead of region proposals and outputs panoptic segmentation by simple combination. Thus, our model can exploit sufficient and finer cross-modal semantic correspondence from the supervision of densely annotated pixel-phrase pairs rather than sparse region-phrase pairs. In addition, we also propose a Language-Compatible Pixel Aggregation (LCPA) module to further enhance the discriminative ability of phrase features through multi-round refinement, which selects the most compatible pixels for each phrase to adaptively aggregate the corresponding visual context. Extensive experiments show that our method achieves new state-of-the-art performance on the PNG benchmark with 4.0 absolute Average Recall gains.

Comments:	Accepted by ACM MM 2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2208.05647 [cs.CV]
	(or arXiv:2208.05647v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2208.05647

Submission history

From: Zihan Ding [view email]
[v1] Thu, 11 Aug 2022 05:42:12 UTC (2,453 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators