Novel Task: To the best of our knowledge, ours is the first to consider an accurate representation of Weakly-Supervised Referring Expression (WS-RES) task by considering a novel, more practical and challenging scenario with limited box and mask annotations where box % equals mask %.
Cross-modal Alignment: The novel X-FACt module fosters prediction of high quality masks by improving cross-modal alignment quality, especially where abundant ground- truth annotations are not present. .
Self-Labeling: Utilizing SpARC, a novel zero-shot REC technique, the mask validity filtering stage together with the bootstrapping pipeline improve system’s self-labeling capabilities.
Strong Generalization: SafaRi demonstrates strong generalization capabilities when evaluated on an unseen referring video object segmentation task in a zero-shot manner.