How AI 'Understands' Images (CLIP) - Computerphile

Computerphile

zhlédnutí 122 195

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 24. 04. 2024
With the explosion of AI image generators, AI images are everywhere, but how do they 'know' how to turn text strings into plausible images? Dr Mike Pound expands on his explanation of Diffusion models.
/ computerphile
/ computer_phile
This video was filmed and edited by Sean Riley.
Computer Science at the University of Nottingham: bit.ly/nottscomputer
Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharanblog.com
Thank you to Jane Street for their support of this channel. Learn more: www.janestreet.com

Komentáře • 230

@michaelpound9891 Před 16 dny ⁺²⁶⁵
As people have correctly noted: When I talk about the way we train at 9:50, I should say we maximise the similarity on the diagonal, not the distance :) Brain failed me!
@adfaklsdjf Před 15 dny ⁺⁸
we gotcha 💚
@harpersneil Před 14 dny ⁺¹
Phew, for a second there I thought you were dramatically more intelligent then I am!
@ArquimedesOfficial Před 12 dny ⁺⁴
Omg, I’m your fan since spiderman 😆, thanks for the lesson!
@adfaklsdjf Před 15 dny ⁺⁶²
thank you for "if you want to unlock your face with a phone".. i needed that in my life
@alib8396 Před 14 dny ⁺¹⁴
Unlocking my face with my phone is the first thing I do when I wake up everyday.
@edoardogribaldo1058 Před 16 dny ⁺¹³¹
Dr. Pound's videos are on another level! He explains things with such passion and such clarity rarely found on the web! Cheers
@joker345172 Před 14 dny ⁺¹
Dr Pound is just amazing. I love all his videos
@pyajudeme9245 Před 16 dny ⁺⁵⁷
This guy is one of the best teachers I have ever seen.
@sebastyanpapp Před 2 dny ⁺¹
Agreed
@orange-vlcybpd2 Před 7 dny ⁺⁶
The legend has it that the series will only end when the last sheet of continuous printing paper has been written on.
@aprilmeowmeow Před 16 dny ⁺⁶²
Thanks for taking us to Pound town. Great explanation!
@pierro281279 Před 16 dny ⁺³
Your profile picture reminds me of my cat ! It's so cute !
@pvanukoff Před 16 dny ⁺⁵
pound town 😂
@rundown132 Před 15 dny ⁺⁶
pause
@aprilmeowmeow Před 14 dny ⁺²
@@pierro281279 that's my kitty! She's a ragdoll. That must mean your cat is pretty cute, too 😊
@BrandenBrashear Před 9 dny
Pound was hella sassy this day.
@keanualves7977 Před 16 dny ⁺²⁸⁸
I'm a simple guy. I see a Mike Pound video, I click
@jamie_ar Před 16 dny ⁺¹³
I pound the like button... ❤
@Afr0deeziac Před 16 dny ⁺¹
@@jamie_arI see what you did there. But same here 🙂
@BooleanDisorder Před 16 dny ⁺⁴
I like to see Mike pound videos too.
@kurdm1482 Před 16 dny ⁺¹
Same
@MikeUnity Před 16 dny ⁺²
Were all here for an intellectual pounding
@MichalKottman Před 16 dny ⁺⁴³
9:45 - wasn't it supposed to be "minimize the distance on diagonal, maximize elsewhere"?
@michaelpound9891 Před 16 dny ⁺³⁴
Absolutely yes! I definitely should have added “the distance” or similar :)
@ScottiStudios Před 13 dny ⁺³
Yes it should have been *minimise* the diagonal, not maximise.
@rebucato3142 Před dnem ⁺¹
Or it should be “maximize the similarity on the diagonal, minimize elsewhere”
@skf957 Před 15 dny ⁺⁸
These guys are so watchable, and somehow they make an inherently inaccessible subject interesting and easy to follow.
@letsburn00 Před 14 dny ⁺¹
CZcams is like you got the best teacher in school. The world has hundreds or thousands of experts. Being able to explain is really hard to do as well.
@eholloway Před 16 dny ⁺⁶⁷
"There's a lot of stuff on the internet, not all of it good, I should add" - Dr Mike Pound, 2024
@rnts08 Před 15 dny ⁺⁸
Understatement of the century, even for a brit.
@beardmonster8051 Před 15 dny ⁺¹⁰
The biggest problem with unlocking a face with your phone is that you'll laugh too hard to hear the video for a minute or so.
@JohnMiller-mmuldoor Před 13 dny
Been trying to unlock my face for 10:37 and it’s still not working!
@bluekeybo Před 13 dny ⁺²
The man, the myth, the legend, Dr. Pound. The best lecturer on Computerphile.
@TheRealWarrior0 Před 15 dny ⁺⁹
A very important bit that was skipped over is how you get an LLM to talk about an image (multimodal LLM)!
After you got your embedding from the vision encoder you train a simple projection layer that aligns the image embedding with the semantic space of the LLM. You train the projection layer so that the embedding of the vision encoder produces the desired text output describing the image (and or executing the instructions in the image+prompt).
You basically project the "thoughts" of the part that sees (the vision encoder) into the part that speaks (the massive LLM).
@or1on89 Před 11 dny ⁺²
That’s pretty much what he said after explaining how the LLM infers an image from written text. Did you watch the whole video?
@TheRealWarrior0 Před 11 dny
@@or1on89 What? Inferring an image from written text? Is this a typo? You mean image generation?
Anyway, did he make my same point? I must have missed it. Could you point to the minute he roughly says that? I don't think he ever said something like "projective layer" and/or talked about how multimodality in LLMs is "bolted-on". It felt to me like he was talking about the actual CLIP paper rather than how CLIP is used on the modern systems (like Copilot).
@exceptionaldifference392 Před 9 dny
I mean the whole video was about how to align the embeddings of the visual transformer with LLM embeddings of captions of the images.
@TheRealWarrior0 Před 9 dny
@@exceptionaldifference392 to me, the whole video seems to be about the CLIP paper which is about “zero-shot labelling images”. But that is a prerequisite to make something like LLaVa which is able to talk, ask questions about the image and execute instruction based on the image content! CLIP can’t do that!
I described the step from going to having a vision encoder and an LLM to have a multimodal-LLM. That’s it.
@TheRealWarrior0 Před 9 dny
@@exceptionaldifference392 To be exceedingly clear: the video is about how you create the "vision encoder" in the first place, (which does require you also train a "text encoder" for matching the image to the caption), not how to attach the vision encoder to the more general LLM.
@rigbyb Před 15 dny ⁺⁶
6:09
"There isn't red cats"
Mike is hilarious and a great teacher lol
@Shabazza84 Před 7 dny ⁺¹
Excellent. Could listen to him all day and even understand stuff.
@wouldntyaliktono Před 16 dny ⁺²
I love these encoder models. And I have seen these methods implemented in practice, usually as part of a recommender system handling unstructured freetext queries. Embeddings are so cool.
@musikdoktor Před 16 dny ⁺³
Love seeing AI problems explained on fanfold paper. Classy!
@codegallant Před 16 dny ⁺³
Computerphile and Dr. Pound ♥️✨ I've been learning AI myself these past few months so this is just wonderful. Thanks a ton! :)
@RupertBruce Před 15 dny ⁺¹
One day, we'll give these models some high resolution images and comprehensive explanations and their minds will be blown! It's astonishing how good even a basic perceptron can be given 28x28 pixel images!
@user-dv5gm2gc3u Před 16 dny ⁺⁴
i'm an it-guy & programmer, but this is kinda hard to understand. thanks for the video, gives a little idea about the concepts!
@aspuzling Před 16 dny
I'd definitely recommend the last two videos on GPT from 3blue1brown. He explains the concept of embeddings in a really nice way.
@sbzr5323 Před 16 dny ⁺¹
The way he explains is very interesting.
@sebastianscharnagl3173 Před 8 dny
Awesome explanation
@stancooper5436 Před 14 dny ⁺¹
Thanks Mike, nice clear explanation. You can still get that printer paper!? Haven't seen that since my Dad worked as a mainframe engineer for ICL in the 80s!
@Stratelier Před 15 dny ⁺¹
When they say "high dimensional" in the vector context, I like to imagine it like an RPG character stat sheet, as each independent stat on that sheet can be considered its own dimension.
@sukaina4978 Před 16 dny ⁺⁹
i just feel 10 times smarter after watching any computerphile video
@jonyleo500 Před 16 dny ⁺⁵
At 9:30, doesn't a distance of zero mean the image and caption have the same "meaning", therefore, shouldn't we want to minimize the diagonal, and maximize the rest?
@michaelpound9891 Před 16 dny ⁺⁹
Yes! We want to maximise the similarity measure on the diagonal - I forgot the word similarity!
@romanemul1 Před 16 dny
@@michaelpound9891 Cmon. Its Mike Pound !
@xersxo5460 Před 11 dny
Just writing this to crystallize my understanding: (and for others to check me for accuracy)
So by circumventing the idea of trying to instill “true” understanding (which is a hard incompatibility in this context, due to our semantics); On a high level it’s substituting case specific discrepancies (like how a digital image is made of pixels, so only pixel related properties are important: like color and position) and filtering against them, because it happens to be easier to tell what something isn’t than what it is in this case (like there are WAAAY more cases where a random group of pixels isn’t an image of a cat, so your sample size for correction is also WAAY bigger.) And if you control for the specific property that disqualifies the entity (in this case, of the medium: discrete discrepancies), as he stated with the “ ‘predisposed noise’ subtraction to recreate a clean image’“ training, you can be even more efficient and effective by starting with already relevant cases. Once again because a smattering of colors is not a cat so it’s easier to go ahead and assume your images will already be in some assortment of colors similar to a cat to train on versus the near infinite combinations of random color pixel images.
And then in terms of the issue of accuracy through specificity versus scalability, it was just easier to use the huge sample size as a tool to approximate accuracy between the embedded images and texts because as a sample size increases, precision also roughly increases given a rule, (in crude terms). And that it’s also a way to circumvent “ mass hard coding” associations to approximate “meaning” because the system doesn’t even have to deal directly with the user inputs in the first place, just their association value within the embedded bank.
I think that’s a clever use of the properties of a system as limitations to solve for our human “black box” results. Because the two methods, organic and mathematical, converge due to a common factor:
The fact that digital images in terms of relevance to people are also useful approximations, because we literally can only care about how close an “image” is to something we know, not if it actually is or not, which is why we don’t get tripped up over individual pixels in determining the shape of a cat in the average Google search. So in the same way by relying on pixel resolution and accuracy as variables you can quantify the properties so a computer can calculate a useable result. That’s so cool!
@Misiok89 Před 11 dny
6:30 if for LLM you have nodes of meaning then you could look fof "nodes of meaning" in description and make classes based on those "nodes", if you are able to represent every language based on same "nodes of meaning" that is even better to translate text from one language to another then average translator that is not LLM, then you should be able to use it also for clasification.
@zzzaphod8507 Před 16 dny
4:35 "There is a lot of stuff on the internet, not all of it good." Today I learned 😀
6:05 I enjoyed that you mentioned the issues of red/black cats and the problem of cat-egorization
Video was helpful, explained well, thanks
@LupinoArts Před 14 dny ⁺¹
3:55 As someone born in the former GDR, I find it cute to label a Trabi as "a car"...
@IceMetalPunk Před 15 dny ⁺²
For using CLIP as a classifier: couldn't you train a decoder network at the same time as you train CLIP, such that you now have a network that can take image embeddings and produce semantically similar text, i.e. captions? That way you don't have to guess-and-check every class one-by-one?
Anyway, I can't believe CLIP has only existed for 3 years... despite the accelerating pace of AI progress, we really are still in the nascent stages of generalized generative AI, aren't we?
@VicenteSchmitt Před 14 dny
Great video!
@GeoffryGifari Před 16 dny ⁺⁴
Can AI say "I don't know what I'm looking at"? Is there a limit to how much it can recognize parts of an image?
@throttlekitty1 Před 15 dny ⁺¹
No, but it can certainly get it wrong! Remember that it's looking for a numerical similarity to things it does know, and by nature has to come to a conclusion.
@OsomPchic Před 15 dny ⁺³
Well in some way. It would say that picture have this embedings: cat:0.3, rainy weather: 0.23, white limo 0.1 every number representing a percentage how "confident" it is. So with a lot of tokens below 0.5 you can say it have no idea what's on that picture
@ERitMALT00123 Před 15 dny ⁺¹
Monte-Carlo dropout can produce confidence estimations of a model. If the model doesn't know what it's looking at then the confidence should be low. CLIP natively doesn't have this though
@el_es Před 14 dny
The 'i don't know ' answer is not very evenly treated along users and therefore there is an understandable hate of it embedded into the model;) possibly because it also means more work for the programmers... Therefore it would rather hallucinate than say it doesn't know something.
@martin777xyz Před 3 dny
Really nice to explanation 👍👍
@zxuiji Před 15 dny ⁺¹
Personally I woulda just did the colour comparison by putting the 24bit RGB integer colour into a double (the 64bit fpn type) and divided one by the other. If the result is greater than 0.01 or less than -0.01 then they're not close enough to deem the same overall colour and thus not part of the same facing of a shape.
**Edit:** When searching for images it might be better to use simple line path (both a 2d and 3d one) matching the given text of what to search for and compare the shapes identified in the images to those 2 paths. If at least 20% of the line paths matches a shape in the image set then it likely contains that what was searching for.
Similarly when generating images the line paths should then traced for producing each image then layered on to one image. Finally for identifying shapes in a given image you just iterate through all stored line paths. I believe this is how our brains conceptualise shapes in the 1st place given how our brains have nowhere to draw shapes to compare to. Instead they just have connections between...cells? neurons? Someone will correct me. Anyways they just have connections between what are effectively physical functions that equate to something like this in C:
int neuron( float connections[CHAR_BIT * sizeof(uint)] );
Which tells me the same subshapes share neurons for comparisons which means a bigger shape will likely be just something initial nueron to visit, how many neurons to vist, and what angle to direct the path at to identify the next neuron to visit. In other words every subshape would be able to revisit a previous subshapes neruon/function. There might be an extra value or 2 but I'm no neural expert so a rough guess should be accurate enough to get the ball rolling.
@Foxxey Před 16 dny ⁺³
14:36 Why can't you just train a network that would decode the vector in the embedded space back into text (being either fixed sized or using a recurrent neural network)? Wouldn't it be as simple as training a decoder and encoder in parallel and using the text input of the encoder as the expected output in the decoder?
@or1on89 Před 11 dny
Because that’s a whole different class of problem and would make the process highly inefficient. There are better ways just to do that using a different approach.
@FilmFactry Před 16 dny
When will wee see the multimodal LLMs be able to answer a question with a generated image. Could be how do you wire an electric socket, and it would generate either a diagram or illustration of the wire colors and position. Should be able to do this but it can't yet. Next would be a functional use of SORA rendering a video how you install a starter motor in a Honda.
@Funkymix18 Před 14 dny ⁺¹
Mike is the best
@barrotem5627 Před 11 dny
Brilliant mike !
@pickyourlane6431 Před 12 dny
i was curious, when you are showing the paper from above, are you transforming the original footage?
@IOSARBX Před 16 dny
Computerphile, This is great! I liked it and subscribed!
@jonathan-._.- Před 16 dny
approx how many samples do i need when i just want to do image categorisation (but with multiple categories per image)
@lancemarchetti8673 Před 10 dny
Amazing. Imagine the day when AI is able to detect digital image steganography. Not by vision primarily, but by bit inspection.... iterating over the bytes and spitting out the hidden data. I think we're still years away from that though.
@aleksszukovskis2074 Před 11 dny ⁺¹
there is stray audio in the background that you can faintly hear at 0:05
@StashOfCode Před 12 dny
There is a paper on The Gradient about reverting embeddings to text ("Do text embeddings perfectly encode text?")
@el_es Před 14 dny
@dr Pound: sorry if this is off topic here but, i wonder if the problem of hallucinations in AI comes from us not treating the 'i don't know what I'm looking at ' answer of a model, as a very negative outcome? If it was treated by us as a valid neutral answer, could it reduce the rate if hallucinations?
@MilesBellas Před 15 dny
Stable Diffusion 3 = potential topic
Optimum workflow strategies using Control Nets, LORAS, VEAs etc....?
@Holycrabbe Před 4 dny
so the length of the clip array training the defusion would have 400 million entries ? so it defines a "corner" of the space we have spanned by the 400 million fotos and foto descriptions ?
@WilhelmPendragon Před 6 dny
So the Visio-Text encoder is dependent on the quality of the captioned photo dataset? If so, where do you find quality datasets ?
@thestormtrooperwhocanaim496 Před 16 dny ⁺¹⁴
A good edging session (for my brain)
@brdane Před 16 dny ⁺¹
Oop. 😳
@JT-hi1cs Před 16 dny
Awesome! I always wondered how the hell does the AI “gets” that an image is made with a certain type of lens or film stock. Or how the hell AI generates objects that were never filmed in a way, say, The Matrix filmed on fisheye and Panavision in the 1950s.
@genuinefreewilly5706 Před 15 dny
Great explainer. Appreciated. I hope someone will cover AI music next
@suicidalbanananana Před 15 dny ⁺¹
In super short:
Most "AI music stuff" is literally just running stable diffusion in the backend, they train a model on the actual images of spectrograms of songs, then ask it to make an image like that & then convert that spectrogram image back to sound.
@genuinefreewilly5706 Před 15 dny
@@suicidalbanananana Yes I can see that, however AI music has made a sudden marked departure in quality of late.
Its pretty controversial among musicians.
I can wrap my head around narrow AI applications in music ie mastering, samples etc.. Its been a mixed bag of results until recently.
@or1on89 Před 11 dny ⁺¹
It surely would be interesting…I can see a lot of people embracing it for pop/trap music and genres with “simple” compositions…my worry as a musician is that it would make the landscape more boring than boy bands in the 90s (and somewhat already is without AI being involved).
As a software developer I would love instead to explore the tool to refine filters, corrections and sampling during the production process…
It’s a bit of a mixed bag…the generative aspect is being marketed as the “real revolution” and that’s a bit scary…knowing more the tech and how ML can help improve our tools would be great…
@nenharma82 Před 16 dny ⁺¹
This is as simple as it’s ingenious and it wouldn’t be possible without the internet being what it is.
@IceMetalPunk Před 15 dny
True! Although it also requires Transformers to exist, as previous AI architectures would never be able to handle all the varying contexts, so it's a combination of the scale of the internet and the invention of the Transformer that made it all possible.
@Retrofire-47 Před 14 dny
@@IceMetalPunk the transformer, as someone who is ignorant, what is that? I only know a transformer as a means of converting electrical voltage from AC - DC
@zurc_bot Před 15 dny ⁺¹
Where did they get those images from? Any copyright infringement?
@quonxinquonyi8570 Před 8 dny
Internet is a huge public repository since its inception
@dimitrifogolin Před 2 dny
Amazing
@LukeTheB Před 15 dny
Quick question from someone outside computer science:
Does the model actually instill "meaning" into the embedded space?
What I mean is:
Is the Angel between "black car" and "Red car" smaller than "black car" and "bus" and that is smaller than "black car" and "tree"?
@suicidalbanananana Před 15 dny ⁺²
Yeah that's correct, "black car" and "red car" will be much closer to each other than "black car" and "bus" or "black car" and "tree" would be. It's just pretty hard to visualize this in our minds because we're talking about some strange sort of thousands-of-dimensions-space with billions of data points in it. But there's definitely discernable "groups of stuff" in this data.
(Also, "Angle" not "Angel" but eh, we get what you mean ^^)
@unvergebeneid Před 16 dny
But confusing to say that you want to maximise the distances on the diagonal. Of course you can define things however you want but usually you'd say you want to maximise the cosine similarity and thus minimise the cosine distance on the diagonal.
@MattMcT Před 15 dny
Do any of you ever get this weird feeling that you need to buy Mike a beer? Or perhaps, a substantial yet unknown factor of beers?
@j3r3miasmg Před 14 dny
I didn't read the cited paper, but if I understood correctly, the 5 billion images need to be labeled for the training step?
@Hexanitrobenzene Před dnem
Or "at least" 400 million...
@utkua Před 16 dny
How do you go from embedings to text of something never been see. before?
@ginogarcia8730 Před 14 dny
I wish I could hear Professor Brailsford's thoughts on AI these days man
@AZTECMAN Před 16 dny ⁺⁷
Clip is fantastic.
It can be used as a 'zero-shot' classifier.
It's both effective and easy to use.
@GeoffryGifari Před 16 dny ⁺¹
How can AI determine the "importance" of parts of an image? why would it output "people in front of boat" instead of "boat behind people" or "boat surrounded by people"?
Or maybe the image is a grid of square white cells. One cell then get its color progressively darken to black. Would the AI describe these transitioning images differently?
@michaelpound9891 Před 16 dny ⁺³
Interesting question! This very much comes down to the training data in my experience. For the network to learn a concept such as "depth ordering", where something is in front of another, what we are really saying is it has learnt a way to extract features (numbers in grids) representing different objects, and then recognize that an object is obscured or some other signal that indicates this concept of being in front of. For this to happen in practice, we will need to see many examples of this in the training data, such that eventually such features occurring in an image lead to a predictable text response.
@GeoffryGifari Před 16 dny
@@michaelpound9891 The man himself! thank you for your time
@GeoffryGifari Před 16 dny ⁺¹
@@michaelpound9891 I picked that example because... maybe its not just depth? maybe there are myriad of factors that the AI summarized as "important"
For example the man is in front of the boat, but the boat is far enough behind that it looks somewhat small.... Or maybe that small boat has a bright color that contrasts with everything else (including the man in front).
But your answer makes sense, that its the training data
@Jononor Před dnem ⁺¹
@@GeoffryGifarisalience and salience detection is what this concept is usually called in computer vision. CLIP style models will learn it as a side effect
@bogdyee Před 16 dny
I'm curios about a thing. If you have a bunch of millions of photos of cats and dogs and they are also correctly labeled (with descriptions) but all these photos have the cats and dogs in the bottom half of the image, will the transformer be able to correctly classify them after training if they are put in the upper half of the image? (or images are rotated, color changed, filtered, etc..).
@Macieks300 Před 16 dny ⁺¹
Yes, it may learn it wrong. That's why scale is necessary for this. If you have a million of photos of a cats and dogs it's very unlikely that all of them are in the bottom half of the image.
@bogdyee Před 16 dny
@@Macieks300 That's why for me it pose a philosophical question. Will these things actually solve intelligence at some point? If so, what exactly might be the difference between a human brain an an artificial one.
@IceMetalPunk Před 15 dny
@@bogdyee Well, think of it this way: humans learn very similarly. It may not seem like it, because the chances of a human only ever seeing cats in the bottom of their vision and never anywhere else is basically zero... but we do. The main difference between human learning and AI learning, with modern networks, is the training data: we're constantly learning and gathering tons of data through our senses and changing environments, while these networks learn in batches and only get to learn from the training data we curate, which tends to be relatively static. But give an existing AI model the ability to do online learning (i.e. continual learning, not "look up on the internet" 😅) and put it in a robot body that it can control? And you'll basically have a human brain, perhaps at a different scale. And embodied AIs are constantly being worked on now, and continual learning for large models... I'm not sure about. I think the recent Infini-Attention is similar, though, so we might be making progress on that as well.
@suicidalbanananana Před 15 dny
@@bogdyee Nah they won't solve intelligence at some point when going down this route they are currently going down, AI industry was working on actual "intelligence" for a while but all this hype about shoving insane amounts of training data into "AI" has reduced the field to really just writing overly complex search engines that sort of mix results together... 🤷‍♂
Its not trying to think or understand (as is the actual goal of AI field) anything at all at this stage, it's really just trying to match patterns. "Ah the user talked about dogs, my training data contains the following info about dog type a/b/c, oh the user asks about trees, training data contains info about tree type a/b/c", etc.
Actual AI (not even getting to the point of 'general ai' yet but certainly getting to somewhere much better than what we have now) would have little to no training data at all, instead it would start 'learning' as its running, so you would talk to it about trees and it would go "idk what a tree is, please tell me more" and then later on it might have some basic understanding of "ah yes, tree, i have heard about them, person x explained them to me, they let you all breathe & exist in type a/b/c, right? please tell me more about trees"
Where the weirdness lies is that the companies behind current "AI" are starting to tell the "AI" to respond in a similar smart manner, so they are starting to APPEAR smart, but they're not actually capable of learning. All the current AI's do not remember any conversation they have had outside of training, because that makes it super easy to turn Bing (or whatever) into yet another racist twitter bot (see microsoft's history with ai chatbots)
@suicidalbanananana Před 15 dny
@@IceMetalPunk The biggest difference is that we (or any other biological intelligence) don't need insanely large amounts of training data, show a baby some spoons and forks and how to use them and that baby/person will recognize and be able to use 99.9% of spoons and forks correctly for the rest of its life, current overhyped AI's would have to see thousands of spoons and forks to maybe get it right 75% of the time & that's just recognizing it, we're not even close yet to 'understanding how to use'
Also worth noting is how we (and again, any other biological intelligence) are always "training data" and much more versatile when it comes to new things, if you train an AI to recognize spoons and forks and then show it a knife it's just going to classify it as a fork or spoon, where as we would go "well that's something i've not seen before so it's NOT a spoon and NOT a fork"
@bennettzug Před 15 dny
13:54 you actually probably can, at least to an extent
there’s been some recent research on the idea of going backwards from embeddings to text, maybe look at the paper “Text Embeddings Reveal (Almost) As Much As Text” (Morris et al)
the same thing has been done with images from a CNN, see “Inverting Visual Representations with Convolutional Networks” (Dosovitsky et al)
neither of these are with CLIP models so maybe future research? (not that it’d produce better images than a diffusion model)
@or1on89 Před 11 dny
You can, using a different type of network/model. We need to remind that all he said is in the context of a specific type of model and not in absolute terms, otherwise the lesson would go very quickly out of context and hard to follow.
@bennettzug Před 11 dny
@@or1on89 i don’t see any specific reason why CLIP model embeddings would be especially intractable though
@donaldhobson8873 Před 15 dny
Once you have a clip, can't you train a diffusion on pure images, just by putting an image into clip, and training the diffusion to output the same image?
@MikeKoss Před 14 dny
Can't you do something analogous to stable diffusion for text classification? Get the image embedding, and then start with random noisy text, and iteratively refine it in the direction of the image's embedding to get a progressively more accurate description of the image.
@quonxinquonyi8570 Před 9 dny
Image manifolds are of huge dimension compare to text manifolds….so guided diffusion from a low dimension manifold to a very high dimension manifold would have a less information and more noise, basically information theoretic bounds still hold when you transform from high dimensional space to low dimension embedding but the other way around isn’t seems that intuitive…might be some prior must be taken into an account..but it still is a hard problem
@robosergTV Před 9 hodinami
Please make a Playlist only about GenAI or a separate AIphile channel. I care only about genAI.
@ianburton9223 Před 16 dny
Difficult to see how convergence can be ensured. Lots of very different functions can be closely mapped over certain controlled ranges, but then are wildly different outside those ranges. What I have missed in many AI discussions is these concepts of validity matching and range identities to ensure that there's some degree of controlled convergence. Maybe this is just a human fear of the unknown.
@proc Před 16 dny
9:48 I didn't quite get how do similar embeddings end up close to each other if we maximize the distances to all other embeddings in the batch? Wouldn't two images of dogs in the same batch will be pulled further away just like an image of a dog and a cat would? Explain like Dr. Pound please.
@drdca8263 Před 16 dny
First: I don’t know.
Now I’m going to speculate:
Not sure if this had a relevant impact, but: probably there are quite a few copies of the same image with different captions, and of the same caption for different images?
Again, maybe that doesn’t have an appreciable effect, idk.
Oh, also, maybe the number of image,caption pairs is large compared to the number of dimensions for the embedding vectors?
Like, I know the embedding dimension is pretty high, but maybe the number of image,caption pairs is large enough that some need to be kinda close together?
Also, presumably the mapping producing the embedding of the image, has to be continuous, so, images that are sufficiently close in pixel space (though not if only semantically similar) should have to have similar embeddings.
Another thing they could do, if it doesn’t happen automatically, is to use random cropping and other small changes to the images, so that a variety of slightly different versions of the same image are encouraged to have similar embeddings to the embedding of the same prompt.
@fredrik3685 Před 14 dny
Question 🤚
Up until recently all images of a cat on internet were photos of real cars and the system could use them in training.
But now more and more cat images are AI generated.
If future systems use generated images in training it will be like a blind leading a blind. More and more distortion will be added. Or? Can that be avoided?
@quonxinquonyi8570 Před 8 dny
Distortion and perceptual qualities are the tradeoff we make when we use generative ai
@NeinStein Před 15 dny
Oh look, a Mike!
@nightwishlover8913 Před 15 dny
5.02 Never seen a "boat wearing a red jumper" before lol
@charlesgalant8271 Před 16 dny ⁺¹
The answer given for the "we feed the embedding into the denoise process" still felt a little hand-wavey to me as someone who would like to understand better, but overall good video.
@michaelpound9891 Před 16 dny ⁺³
Yes I'm still skipping things :) The process this uses is called attention, which basically is a type of layer we use in modern deep networks. The layer allows features that are related to share information amongst themselves. Rob Miles covered attention a little in the video "AI Language Models & Transformers", but it may well be time to revisit this since attention has become quite a lot more mainstream now, being put in all kinds of networks.
@IceMetalPunk Před 15 dny
@@michaelpound9891 It is, after all, all you need 😁 Speaking of attention: do you think you could do a video (either on Computerphile or elsewhere) about the recent Infini-Attention paper? It sounds to me like it's a form of continual learning, which I think would be super important to getting large models to learn more like humans, but it's also a bit over my head so I feel like I could be totally wrong about that. I'd appreciate an overview/rundown of it, if you've got the time and desire, please 💗
@klyanadkmorr Před 15 dny ⁺¹
Heyo, a Pound dogette here!
@Rapand Před 16 dny
Each time I watch one of these videos, I could might as well watch Apocalypto without subtitles. My brain is not made for this 🤓
@MedEighty Před 12 dny
10:37 "If you want to unlock a face with your phone". Ha ha ha!
@hehotbros01 Před 13 dny
Poundtown.. sweet...
@EkShunya Před 14 dny
I thought diffusion models had VAE and not ViT
Correct me if I m wrong
@quonxinquonyi8570 Před 8 dny
Diffusion model is an upgraded version of vae with limitation in sampling speed
@creedolala6918 Před 15 dny
'and we want an image of foggonstilz'
me: wat
'we want to pass the text of farngunstills'
me: u wot m8
@eigd Před 16 dny
9:48 Been a while since I did machine learning class... Anyone care to tell me why I'm thinking of PCA? What's the connection?
@Hexanitrobenzene Před dnem
Hm, I'm not an expert either, but... AFAIK, Principal Component Analysis finds directions which maximise/minimise the variance of the data, which can be thought of as average distance. The drawback is that it's only a linear method and it cannot deal with high dimensional data such as images effectively.
@babasathyanarayanathota8564 Před 15 dny
Me: added to resume ai expert
@MuaddibIsMe Před 16 dny ⁺³
"a mike"
@Hexanitrobenzene Před dnem ⁺¹
THE Mike :)
@bryandraughn9830 Před 16 dny
I wonder, if every cat image has specific "cat" types of numerical curves, textures, eyes and so on. So a completely numerical calculation would conclude that the image is of a cat.
There's only so much variety of pixel arrangements at some resolution, it seems like images could be reduced to pure math. Im probably so wrong.
Just curious.
@quonxinquonyi8570 Před 8 dny
You are absolutely right….images are of very high dimension but their image manifold is still considered to cover and filled a very low dimension of their whole image hyper space….the only way to manipulate or tweak that image manifold is by adding noise…but noise is of very low dimension compare to that high dimension image manifold…so that perturbation or guidance to image manifold in form of noise disturb it into one of its direction from many of its inherent direction….this is similar to find slope of a curve ( manifold) by linearly approximate it with a line ( noise)…this is the method you learn in your high school maths….if want to discuss more,I will clarify it further…
@kbabiy Před dnem
[15:00] It's supposed to have a tail
@JeiShian Před 15 dny
The exchange at 6:50 made me laugh out loud and I had to show that part of the video to the people around me😆😆
@MilesBellas Před 15 dny
Stable Diffusion needs a CEO BTW
....just saying ...
😅
@FLPhotoCatcher Před 16 dny
At 16:20 the 'cat' looks more like a shower head.
@CreachterZ Před 16 dny ⁺¹
How does he stay on top of all of this technology and still have time to teach? …and sleep?
@RawrxDev Před 16 dny ⁺⁷
Truly a marvel of human applications of mathematics and engineering, but boy do I think these tools have significantly more cons than pros in practical use.
@aprilmeowmeow Před 16 dny ⁺³
agreed. The sheer power required is an ethical concern
@suicidalbanananana Před 15 dny ⁺²
We're currently experiencing an "AI bubble" that will pop within 2-3 years or less, no doubt about that at all. Companies are wasting money and resources trying to be the first to make something crappy appear less crappy than it actually is, but they don't fully realize yet that it's that's a harder task then it might seem & it's going to be extremely hard to monetize the end result.
We need to move back to AI research trying to recreate a biological brain, somehow the field has suddenly been reduced to people trying to recreate a search engine that mixes results or something, which is just ridiculous & running in the opposite direction that AI field should be heading.
@RawrxDev Před 15 dny
@@suicidalbanananana That's my thought as well, I even recently watched a clip from sam altman saying they have no idea how to actually make money from AI without investors, and that he is just going to ask the AGI how to make a return once they achieve AGI, which to me seems..... optimistic.
@AngelicaBotelho-he1hb Před 11 dny ⁺⁸
Crypto Bull run is making waves everywhere and I have no idea on how it works. What is the best step to get started please,,
@roseypasha1706 Před 11 dny ⁺¹
Am facing the same challenges right now and I made a lots of mistakes trying to do it on my own even this video doesn't give any guidelines
@GiseleLuz-rm6vd Před 11 dny
I will advise you to stop trading on your own if you continue to lose. I no longer negotiate alone, I have always needed help and assistance
@brandonkim4554 Před 11 dny
You're right! The current market might give opportunities to maximize profit within a short term, but in order to execute such strategy, you must be a skilled practitioner.
@heleisy5110 Před 11 dny
Inspiring! Do you think you can give me some advice on how to invest like you do now?
@djtomoy Před 14 dny
Why is there always so much mess and clutter in the background of these videos? Do you film them in abandoned buildings?
@YouTubeCertifiedCommenter Před 12 dny
This must have been the entire purpose of googles Picasa.
@grantc8353 Před 16 dny
I swear that P took longer to come up than the rest.
@Ginto_O Před 16 dny
a yellow cat is called red cat
@MagicPlants Před 16 dny ⁺²
the gorilla camera moving around all the time is making me dizzy
@SkEiTaDEV Před 15 dny
Isn't there an AI that fixed shaky video by now?
@creedolala6918 Před 15 dny
Isn't that a problem that's been solved without AI already? Someone can ride on a mountain bike that's violently shaking, down a forest trail, with a GoPro on his helmet, and we get perfect smooth video of it somehow.
@willhart2188 Před 15 dny
AI art is great.
@artseiff Před 8 dny
Oh, no… All cats are missing a tail 😿
@diegoyotta Před 3 dny
Mike Pound, the cousin of Mike Dollar and Mike Ruble

Další v pořadí

Automatické přehrávání

How AI Image Generators Work (Stable Diffusion / Dall-E) - Computerphile