Abstract: Learning cross-modal features is an essential task for many multimedia applications such as sound localization, audio-visual alignment, and image/audio retrieval. Most existing methods ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results