Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision

AI Coffee Break with Letitia

zhlédnutí 18 778

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 17. 07. 2024
If you always wanted to know hot to integrate both text and images in one single MULTIMODAL Transformer, then this is the video for you!
Multimodality🔥 + Transformers💪
➡️ AI Coffee Break Merch! 🛍️ aicoffeebreak.creator-spring....
Content:
* 00:00 Multimodality and Multimodal Transformers
* 02:08 ViLBERT
* 02:39 How does ViLBERT work?
* 05:49 How is ViLBERT trained?
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to boost our Coffee Bean production! ☕
Patreon: / aicoffeebreak
Ko-fi: ko-fi.com/aicoffeebreak
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🎞️ Useful: Ms. Coffee Bean explained the Transformer here • The Transformer neural...
📄 ViLBERT paper -- Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks." In Advances in Neural Information Processing Systems, pp. 13-23. 2019. papers.nips.cc/paper/8297-vilb...
📄 For even more similar architectures, check out the multi-modal section of github.com/tomohideshibata/BE...
💻 With code available at github.com/facebookresearch/v...
🔗 Links:
CZcams: / @aicoffeebreak
Twitter: / aicoffeebreak
Reddit: / aicoffeebreak
#AICoffeeBreak #MsCoffeeBean #multimodal #multimodality #ViLBERT #MachineLearning #AI #research
Video and thumbnails contain emojis designed by OpenMoji - the open-source emoji and icon project. License: CC BY-SA 4.0

Komentáře • 24

@roughr4044 Před 2 lety ⁺⁷
Funny how I am currently working on image retrieval and was too studying this paper 🙂... Thanks for the video, great work as always.
@zerodeath3377 Před 3 lety ⁺⁷
This is the best lucid explanation I've heard for any architecture! Kudos!!
@AICoffeeBreak Před 3 lety ⁺¹
Thank you, this is a great thing to say!
@abdullahalfoysol4733 Před 3 lety ⁺³
very helpful video, thanks for this.
@marcocipriano5922 Před 2 lety ⁺³
I love every time letitia refers to one of these models as a "monster".
@AICoffeeBreak Před 2 lety ⁺¹
🙊
@leonardosantospaulucio8367 Před 3 lety ⁺⁴
Your channel is amazing!! The explanation is very great and helpful. You should create a podcast :)
@AICoffeeBreak Před 3 lety ⁺³
Thanks! I will have a podcast in the future, but right now it is hard enough to find time for making these videos.
@RyanRobots Před 22 dny ⁺¹
Thanks!
@AICoffeeBreak Před 22 dny
Thank you! Wow, this is an old video, now that I think about it. 😅
@filmfranz Před 3 lety ⁺⁵
Great video. Now you have new subscriber :)
@AICoffeeBreak Před 3 lety
Thanks, nice to have you!
@mohamad2509 Před 2 lety ⁺²
Great explanation! Thank you!
Can this model be fine tuned to classify memes on multi-labels?
@bennokrojer8406 Před 3 lety ⁺⁴
Very cool video! Keep it going with multi-modality.
A few months back I was looking at Multi-Modal Transformers and was a bit overwhelmed by all the different papers out there. I decided to look at the one with the most citations, and also available code: ViLBERT. So I was glad to see that I wasn't the only one who thought that were really a lot of those papers out there. And that according to you the papers don't seem to differ too much. But I am still curious... Could you (or Ms. Coffe Bean) explain a bit what makes them different? Slight tweaks in the layers, applied to different tasks, trained on different data? To get through the review process, they must've done something different, right? ;)
@AICoffeeBreak Před 3 lety ⁺²
Thank you for your positive feedback!
Regarding your question: Well noted. Usually papers have to do something different to get accepted at conferences. Or this is how it should be, but the novelty or difference argument does not easily hold when the papers are published concurrently (a couple of them where published on aXiv the same day!). If all of them went for the same conference, they did not even have to cite each other.
But in the end, the papers do things differently because of diverging architecture design, experiment design and data set choices. But the fundamental idea stays the same: After you read and fully understood one of the papers, you have no problem quickly understanding the others. If you want a quick (but not complete) overview of many Multimodal Transformes, you can check Table 5 in the Appendix of the VL-BERT paper: arxiv.org/pdf/1908.08530.pdf. Ms. Coffee Bean thinks, that table is a very good start to spot the differences! :)
@bennokrojer8406 Před 3 lety ⁺¹
Oh wow, several on the same day on arxiv :D I guess once BERT got so much attention (lol), it was the obvious next step to do Lang+Vis.
Thanks for the pointer to Table 5! That's exactly what I needed :) I was honestly too lazy to go over all the papers and compare by myself.
@geekavenue Před rokem ⁺¹
Hi thank you so much for your wonderful explainations.😇😇 One question, can we use VilBERT kind of visual + text transformers for grounded word embedding generation? And text only downstream tasks?
@mounicavanapalli3668 Před 6 měsíci ⁺¹
Hi Thank you for this video. Can I classify Social media meme using ViLBERT?
@AICoffeeBreak Před 5 měsíci
I think you'd need to fine tune it. If you want something that will work out of the box, try something like MAGMA or OpenFlamingo or LLaVa.
@sachinkhambe7106 Před 3 lety ⁺²
I want to implement video recognition using vilbert.... Is it possible???
@AICoffeeBreak Před 3 lety ⁺⁴
Not out of the box, try something like VideoBERT for that.
Or just google "video recognition with transformer". The first results I got seemed relevant, I got the "Video Action Transformer Network", Girdhar et al. 2019: openaccess.thecvf.com/content_CVPR_2019/papers/Girdhar_Video_Action_Transformer_Network_CVPR_2019_paper.pdf
@cedricvillani8502 Před 2 lety ⁺²
@@AICoffeeBreak hehe

Další v pořadí

Automatické přehrávání

Pre-training of BERT-based Transformer architectures explained - language and vision!