BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding&Generation

Yannic Kilcher

zhlédnutí 27 817

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 29. 08. 2024

Komentáře • 33

@mrdbourke Před 2 lety ⁺¹⁰
Perfect timing! I’ve got this paper on my desk to read
@vaibhavnakrani2983 Před rokem
Do you print the paper or just read it on a tablet.
Thinking of printing papers now and then read.
@user-xs9ey2rd5h Před 2 lety ⁺²
You're ads are legit the only ones that are worth watching, you always get good sponsors
@Pmaisterify Před 2 lety ⁺³
I really like this paper; I like how slowly, more and more people are trying to "wrap" AI around the entirety of the problem they are trying to solve, instead of just the underlying task.
@norik1616 Před 2 lety ⁺¹
I just love how you are trying (and succeeding) in morphing/expanding your channel!
@omgwenxx Před rokem
Just wanted to say thank you for explaining this paper thoroughly! 🙏
@user-my1tx4dc2w Před 11 měsíci ⁺¹
How you summarize is absolutely amazing!! Could you please summary BLIP2 too?🤩
@keikaku9298 Před 2 lety
This was super-informative! Thanks for going through this in-depth. Kudos to the author. BLIP results are really impressive especially on zero-shot video retrieval.
@abdurrahmangumus704 Před 2 lety ⁺³
Yannic, thanks for your great effort. It would be nice if you use a highlighter for your mouse cursor such as PenAttention. It's hard to follow where you are pointing right now.
@chriswang2464 Před 5 měsíci
Yannic is all you need!
@s.d.7472 Před 2 lety ⁺¹²
Next is ALIP
@bsdjns Před 2 lety
good one haha
@Veptis Před rokem
I read most of the CLIP paper as I tried to use it for a share tasks. But now BLIP has showed up a few more times, so I will watch your video instead of reading the paper myself
@lewingtonn Před rokem
immense fan of your work as always
@herp_derpingson Před 2 lety ⁺⁶
I wonder how many times we can repeat the bootstrap before it goes into diminishing returns.
.
Multiple sources of gradient from the same datapoint is definitely a pragmatic thing to do. I had tried some experiments on smaller scale and it worked really nicely.
@YannicKilcher Před 2 lety
I asked myself the same, I guess it would give diminishing returns pretty quickly
@Basant5911 Před rokem
You mean sampling data from LLM's gonna lead to slow death of Generative AI.
@AngadChandhok02 Před rokem
This was very well explained, thank you!
@McSwey Před 2 lety ⁺⁵
Petition to make dark mode paper review
@ghazalsahebzamani4925 Před rokem
Great explanation! Thank you
@UGSFlo Před 21 dnem
I have a question to BLIP 2 which probably relates to this as well. The outcome tokens of the Q-Former or here the transformer adapter serve as input to the LLM. Now, those tokens capture a lot of context as I understand because expecially in BLIP2 there are only 16 or so learned question embeddings. Those tokens should also find their place in the embedding space of the frozen LLM. For a decoder only LLM the input tokens on the first transformer layer only represent subwords like "as" or "ing" as I understand. Not much context as they didn't attend to each other yet. Which is different from the last transformer layer where each token captures a lot of context probably. Thus, I don't understand how the visual tokens can serve as input right next to the text embeddings on the first level? In my understanding the visual tokens should be in the same embedding space as the encoder output of a encoder-decoder network. Then additional the users questions get's also encoded and both embeddings are concatenated and go though the decoder.
This kind of workflow is neither shown here nor in BLIP2 and I don't get it. In the visualization of BLIP2 the visual token go directly to start of the decoder oder encoder depending which arhcitecture you have. If the LLM is frozen, how can this be in the same space as first embeddings?
I really hope somebody understands me and can answer this, because I see this kind of architecture everywhere, even in 3D case as in OpenScene or 3D-LLM, and I must have some major missunderstanding. My assumption is of course that the first layer of decoder only architecture or the first layer of an encoder doesn't have much context but the last layers have. Maybe here I'm already wrong?
Greetings to the ML community ;)
@user-ub3uq9bs3o Před 9 měsíci
The reflection from the light on his glasses grabs my concentration away
@user-ub3uq9bs3o Před 9 měsíci
Just like two funny eyes, lol
@NOYHanan Před 10 měsíci
Great explanation! could you please summarize Blip2 too?
@drdca8263 Před 2 lety ⁺¹
This thing about composing pieces like this... it makes me wonder if it could be possible to like, produce guarantees for some kind of invariants, in a way that would be compatible with such compositions?
I’m not sure how that could work though..
And like, what could said invariants even be? What would you want these modules to preserve?
I guess there’s things like “it is piecewise linear (because we only used RELU and linear maps)”, which is preserved under composition, but, that doesn’t really say much about what the function accomplishes.
If these modules were like, normal blocks of code written by people, what kinds of guarantees would we want the code to have?
There could be stuff like translation invariance, but that is something usually baked into the shape of the model if it is an important goal?
But, I guess for image transformers it isn’t built in exactly, and is learned from the amount of data, including exceptions to it? (I could be mistaken about any or all of this.)
Uh, if one had a generated guarantee that something was *approximately* translation invariant..
well, often the final output isn’t also an image (unless doing segmentation or something), so composing two things with that property might not make sense?
Uh.
Hm.
Ok yeah I’m struggling to think of a case where there is even an invariant I can think of that one would really want to show that a network preserves.
I guess one place invariants often show up is when you want to apply something repeatedly.
So, maybe some tasks with RNNs , or maybe some reinforcement learning agent? But then, when would you compose those bits with novel things?
Ok, I guess this idea I thought of doesn’t have much applicability.
Initially I thought it would be useful if it could be done, but would just be too hard to make work, but now I realize I can’t think of a concrete use-case.
@YannicKilcher Před 2 lety ⁺¹
Your idea sounds interesting, but probably impossible as long as the modules are neural layers and once fine tuning is applied
@Ziko92i Před 2 lety
RE the asymmetry criticism, wouldn't it be possible to "mirror" Figure 2's architecture?
What I mean is: also have a text-grounded image encoder with cross attention and ITM loss, and a GAN of some sort (for example Transformer-based GANs to incorporate a shared cross attention module) for text-based image generation.
Maybe the reason the authors didn't do it is the shear compute/memory cost, but it sounds straightforward conceptually. Wdyt?
@eliaweiss1 Před 8 měsíci
Link to the interview please
@thegistofcalculus Před 2 lety
Doesn't the topic of webpage text to some extent influence the alt text description? Why ignore it? Why not encode a bunch of surrounding text into an embedding vector and have some fun? I guess only wiser minds will know.
@theocachet6496 Před 2 lety
Hi @Yannic, can you share with us the way you find the papers you will present?! Best
@ashleyalexander7388 Před 2 lety ⁺¹
How the hell did I get here...
@m0zzar353 Před rokem
whats up ashley
@bhaskarpandey8586 Před 3 měsíci
Am i the only one who didnt understand shit from his explanation.

Další v pořadí

Automatické přehrávání

OpenAI CLIP: ConnectingText and Images (Paper Explained)