Home

/microsoft/ Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning

Code Link
https://github.com/microsoft/BridgeTower
Description
Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a cross-modal encoder, or feed the last-layer uni-modal features directly into the top cross-modal encoder, ignoring the semantic information at the different levels in the deep uni-modal encoders. Code: https://github.com/microsoft/BridgeTower
Retrieved
2022/06/21
Stars
33
TOP