SafaRi: Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation

ECCV 2024

Sayan Nag^1,2, Koustava Goswami², Srikrishna Karanam²

¹University of Toronto, ²Adobe Research

arXiv Code (Coming Soon)

Primary Contributions

Novel Task: To the best of our knowledge, ours is the first to consider an accurate representation of Weakly-Supervised Referring Expression (WS-RES) task by considering a novel, more practical and challenging scenario with limited box and mask annotations where box % equals mask %.
Cross-modal Alignment: The novel X-FACt module fosters prediction of high quality masks by improving cross-modal alignment quality, especially where abundant ground- truth annotations are not present. .
Self-Labeling: Utilizing SpARC, a novel zero-shot REC technique, the mask validity filtering stage together with the bootstrapping pipeline improve system’s self-labeling capabilities.
Strong Generalization: SafaRi demonstrates strong generalization capabilities when evaluated on an unseen referring video object segmentation task in a zero-shot manner.

SAFARI Framework

Architectural components of SafaRi. (i) We introduce X-FACt, composed of normalized gated cross-attention based Fused Feature Extractors and Atten- tion Consistency Mask Regularization (AMCR) for enhancing cross-modal synergy and spatial localization of target objects. The fused output is subsequently fed to Sequence Transformer for prediction of contour points.(ii) We design Mask Validity Filtering (MVF) strategy for choosing valid pseudo-masks using SpARC module which is a Zero-Shot REC approach with spatial reasoning capabilities.

Main Result

Comparison with the state-of-the-arts on the RES task. SafaRi substantially outperforms SOTA SeqTR in the fully-supervised benchmark. SafaRi also yields significant gains over baseline Partial-RES even without using 100% box annotations in the WSRES task. † means trained on extra data combining RefCOCO datasets. ♠ indicates our reimplementation of Partial-RES with Swin-B backbone where we get better mIoUs than their reported values.

Cross-attention Maps and corresponding predictions

Cross-attention Maps and corresponding predictions showing strong cross-modal alignment learned by SafaRi.

Cross-attention Maps with and without AMCR

Qualitative differences between cross-attention maps and predicted masks in the presence and absence of AMCR. Without AMCR, some regions outside the object boundary are attended which affects the quality of predicted masks.

Predictions with varying label-rates.

Predictions with varying label-rates. With increasing mask annotations %, prediction quality improves.

Predictions with varying bootstrapping steps.

Examples of masks with increasing WSRES bootstrapping runs (steps) for 10% annotations. We see significant improvements in localization capabilities with an increase in retraining steps illustrating the efficacy our approach.

Zero-shot Results with weakly-supervised model on Video datasets

a sliver car going from shade to sunlight

a man in a red sweatshirt performing breakdance

a boy is standing_up

a man jumping across a wall

the woman in red walking

a lady is bending

BibTeX

@article{nag2024safari,
  title={SafaRi: Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation},
  author={Nag, Sayan and Goswami, Koustava and Karanam, Srikrishna},
  journal={arXiv preprint arXiv:2407.02389},
  year={2024}
}

Acknowledgement

This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Template of this website is borrowed from nerfies website.