Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision
Vložit
- čas přidán 17. 07. 2024
- If you always wanted to know hot to integrate both text and images in one single MULTIMODAL Transformer, then this is the video for you!
Multimodality🔥 + Transformers💪
➡️ AI Coffee Break Merch! 🛍️ aicoffeebreak.creator-spring....
Content:
* 00:00 Multimodality and Multimodal Transformers
* 02:08 ViLBERT
* 02:39 How does ViLBERT work?
* 05:49 How is ViLBERT trained?
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to boost our Coffee Bean production! ☕
Patreon: / aicoffeebreak
Ko-fi: ko-fi.com/aicoffeebreak
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🎞️ Useful: Ms. Coffee Bean explained the Transformer here • The Transformer neural...
📄 ViLBERT paper -- Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks." In Advances in Neural Information Processing Systems, pp. 13-23. 2019. papers.nips.cc/paper/8297-vilb...
📄 For even more similar architectures, check out the multi-modal section of github.com/tomohideshibata/BE...
💻 With code available at github.com/facebookresearch/v...
🔗 Links:
CZcams: / @aicoffeebreak
Twitter: / aicoffeebreak
Reddit: / aicoffeebreak
#AICoffeeBreak #MsCoffeeBean #multimodal #multimodality #ViLBERT #MachineLearning #AI #research
Video and thumbnails contain emojis designed by OpenMoji - the open-source emoji and icon project. License: CC BY-SA 4.0
Funny how I am currently working on image retrieval and was too studying this paper 🙂... Thanks for the video, great work as always.
This is the best lucid explanation I've heard for any architecture! Kudos!!
Thank you, this is a great thing to say!
very helpful video, thanks for this.
I love every time letitia refers to one of these models as a "monster".
🙊
Your channel is amazing!! The explanation is very great and helpful. You should create a podcast :)
Thanks! I will have a podcast in the future, but right now it is hard enough to find time for making these videos.
Thanks!
Thank you! Wow, this is an old video, now that I think about it. 😅
Great video. Now you have new subscriber :)
Thanks, nice to have you!
Great explanation! Thank you!
Can this model be fine tuned to classify memes on multi-labels?
Very cool video! Keep it going with multi-modality.
A few months back I was looking at Multi-Modal Transformers and was a bit overwhelmed by all the different papers out there. I decided to look at the one with the most citations, and also available code: ViLBERT. So I was glad to see that I wasn't the only one who thought that were really a lot of those papers out there. And that according to you the papers don't seem to differ too much. But I am still curious... Could you (or Ms. Coffe Bean) explain a bit what makes them different? Slight tweaks in the layers, applied to different tasks, trained on different data? To get through the review process, they must've done something different, right? ;)
Thank you for your positive feedback!
Regarding your question: Well noted. Usually papers have to do something different to get accepted at conferences. Or this is how it should be, but the novelty or difference argument does not easily hold when the papers are published concurrently (a couple of them where published on aXiv the same day!). If all of them went for the same conference, they did not even have to cite each other.
But in the end, the papers do things differently because of diverging architecture design, experiment design and data set choices. But the fundamental idea stays the same: After you read and fully understood one of the papers, you have no problem quickly understanding the others. If you want a quick (but not complete) overview of many Multimodal Transformes, you can check Table 5 in the Appendix of the VL-BERT paper: arxiv.org/pdf/1908.08530.pdf. Ms. Coffee Bean thinks, that table is a very good start to spot the differences! :)
Oh wow, several on the same day on arxiv :D I guess once BERT got so much attention (lol), it was the obvious next step to do Lang+Vis.
Thanks for the pointer to Table 5! That's exactly what I needed :) I was honestly too lazy to go over all the papers and compare by myself.
Hi thank you so much for your wonderful explainations.😇😇 One question, can we use VilBERT kind of visual + text transformers for grounded word embedding generation? And text only downstream tasks?
Hi Thank you for this video. Can I classify Social media meme using ViLBERT?
I think you'd need to fine tune it. If you want something that will work out of the box, try something like MAGMA or OpenFlamingo or LLaVa.
I want to implement video recognition using vilbert.... Is it possible???
Not out of the box, try something like VideoBERT for that.
Or just google "video recognition with transformer". The first results I got seemed relevant, I got the "Video Action Transformer Network", Girdhar et al. 2019: openaccess.thecvf.com/content_CVPR_2019/papers/Girdhar_Video_Action_Transformer_Network_CVPR_2019_paper.pdf
@@AICoffeeBreak hehe