DETR: End-to-End Object Detection with Transformers (Paper Explained)
Vložit
- čas přidán 17. 07. 2024
- Object detection in images is a notoriously hard task! Objects can be of a wide variety of classes, can be numerous or absent, they can occlude each other or be out of frame. All of this makes it even more surprising that the architecture in this paper is so simple. Thanks to a clever loss function, a single Transformer stacked on a CNN is enough to handle the entire task!
OUTLINE:
0:00 - Intro & High-Level Overview
0:50 - Problem Formulation
2:30 - Architecture Overview
6:20 - Bipartite Match Loss Function
15:55 - Architecture in Detail
25:00 - Object Queries
31:00 - Transformer Properties
35:40 - Results
ERRATA:
When I introduce bounding boxes, I say they consist of x and y, but you also need the width and height.
My Video on Transformers: • Attention Is All You Need
Paper: arxiv.org/abs/2005.12872
Blog: / end-to-end-object-dete...
Code: github.com/facebookresearch/detr
Abstract:
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive baselines. Training code and pretrained models are available at this https URL.
Authors: Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko
Links:
CZcams: / yannickilcher
Twitter: / ykilcher
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher - Věda a technologie
This is a gift. The clarity of the explanation, the speed at which it comes out. Thank you for all of your work.
I had seen your Attention is all you need video and now watching this, I am astounded by the clarity you give in your videos. Subscribed!
Yup. Subscribed with notifications. I love that you enjoy the content of the papers. It really shows! Thank you for these videos.
Really appreciate the efforts you are putting into this. You paper explanations make my day everyday!
Awesome video. Highly recommend reading the paper first and then watching this to solidfy understanding. This definitely helped me understand DETR model more.
Greatest find on CZcams for me todate!! Thank you for the great videos!
Thank you for your wonderful video. When I read this paper first, I couldn't understand what is the input of decoder (object queries), but after watching your video, finally I got it, random vector !
A great paper and a great review of the paper! As always nice work!
WoW , the way you've explained and break down this paper is spectacular ,
Thx mate
The attention visualization are practically instance segmentations, very impressive results and great job untangling it all
YES! I was waiting for this!
Was waiting for this. Thanks a lot! Also dude, how many papers do you read everyday?!!!
Loved the video! I was just reading the paper.
Just wanted to point out that Transformers, or rather Multi Head Attention, naturally processes sets, not sequences, this is why you have to include the positional embeddings.
Do a video about the Set Transformer! In that paper the call the technique used by the Decoder in this paper "Pooling by Multihead Attention".
Very true, I was just still in the mode where transformers are applied to text ;)
What are positional encodings?
@@princecanuma The positional encoding is simply the index of each token in the sequence.
@@snippletrap I had a feeling it was gonna be something that simple. 🤦🏾♂️ AI researchers' naming conventions aren't helping the community, in terms of accessibility lmao
Thank you for the one-line summary of "Pooling by Multihead Attention". This makes it 10x clearer about what exactly the decoder is doing. I was feeling that the "decoder + object seeds" is doing similar things to ROI pooling, which is gathering relevant information for a possible object. I also recommend reading the set transformer paper, which enhanced my limited knowledge of attention models. Thanks again for your comment!
Great!!! absolutely great! fast , to the point, and extremely clear. Thanks!!
This video was absolutely amazing. You explaned this concept really well and I loved the bit at 33:00 about flattening the image twice and using the rows and columns to create an attention matrix where every pixel can releate to every other pixel. Also loved the bit at the beginning when you explaned the loss in detail. alot of other videos just gloss over that part. Have liked and subscribed
Thank you for this content! I have recommended this channel to my colleagues.
You are a godsend! Please keep up the good work!
"Maximal benefit of the doubt" - love it!
Thanks so much for making it so easy to understand these papers.
Very informative. Thanks for explanation!
Very well done and understandable. Thank you!
Fantastic explanation 👌 looking forward for more videos ❤️
infinite respect for the ali G reference
Haha someone noticed :D
What an amazing paper and an explanation!
You saved my project. Thank you 🙏🏻
Thanks for great explanation!
Really smart idea about how the (HxW)^2 matrix naturally embeds bounding boxes information. I am impressed :)
You explained it so well. Thanks . best of luck
Thanks for the walkthrough!
Very very nice explanation, I really subscribed for that quadratic attention explanation. Thanks! :D
I like the way you DECIPHER things! thanks!
Amazing explanation. Keep up the great work.
Thank you very much, this was really good.
Thank you very much. This is a very good video. Very easy to understand.
really thank you for your explanation!
Excellent work,Thanks!
34:08 GOAT explanation about the bbox in atttention feature map.
thank u so much for video! that's so amazing and make me much understanding for this paper ^^
So cool! You are great!
Thank you sooo much for this explanation!!
very clear explanation, great work sir. thanks
Holy shit. Instant subscribe within 3 minutes. Bravo!!
Great content!
Very cool video, thank you!
Love this content bro thank you so much, hoping to get a Mac in Artificial Intelligence
Great explanation
This is a really great idea
really quite quick. thanks. make more...
Thanks for the explaination
Awesome 🔥🔥🔥
Thank you for providing such interesting paper reading ! Yannic Kilcher
I love your channel thank you soooo much
Hi Yannic, amazing video and great improvements in the presentation (time-sections in youtube etc.) I really like where this channel is going, keep it up.
I've been reading through the paper myself yesterday as I've been working with that kind of attention for CNNs a bit and I really liked the way you described the mechanism behind the different attention heads in such a simplistic and easily understandable way!
Your idea with directly inferring bboxes from two attending points in the "attention matrix" sounds neat and didn't cross my mind yet. But I guess then you probably have to use some kind of nms again if you do so?
One engineering problem that I came across, especially with those full (HxW)^2 attention matrices is that this blows up your GPU memory insanely. Thus one can only use a fraction of the batchsize and a (HxW)^2 multiplication also takes forever, which is why that model takes much longer to train (and infer I think)
What impressed me most was that an actually very "unsophisticated learned upscaling and argmax over all attentionmaps" achieved such great results for panoptic segmentation!
One thing that I did not quite get: Can the multiple attention heads actually "communicate" with each other during the "look up"? Going by the description in the Attention is all you need: "we then perform the attention function in parallel, yielding dv-dimensional
output values" and the formula: "Concat(head1, ..., headh)W°". This to me looks like the attention heads do not share information while attending to things. Only the W° might be able during the backprop to reweight the attention heads if they have overlapping attention regions?
Yes I see it the same way, the individual heads do independent operations in each layer. I guess the integration of information between them would then happen in higher layers, where their signal could be aggregated in a single head there.
Also, thanks for the feedback :)
@@YannicKilcher The multi-head part is the only confusion I have about this great work. In NLP multi-head makes total sense: an embedding can "borrow" features/semantics from multiple words at different feature dimensions. But in CV seems it's not necessary? The authors didn't do ablation study about the number of heads. My suspicion is single head works almost as well as 8 heads. Would test it once I got a lot of GPUs...
Awesome!
Great video thanks you
"First paper ever to have ever cite a youtube channel." ...challenge accepted.
Thanks a lot for this really helpful
Excellent
Are you even human? You're really quick.
Nope .. A Bot
@@m.s.d2656 I don't actually know which is the most impressive
There's a bird!!! There's a bird...
@@krishnendusengupta6158 bird, bird, bird, bird, bird, bird, bird, bird, its a BIRD
Awesome!!! Yannic, by any chance, would you mind reviewing the paper (1) Fawkes: Protecting Personal Privacy against Unauthorized Deep Learning Models or (2) Analyzing and Improving the Image Quality of StyleGAN? I would find it helpful to have those papers deconstructed a bit!
Thanks for this vid, really fast. I still (after 2 days) didn't tried to run it on my data - feeling bad
I love how it understands which part of the image belongs to which object (elephant example) regardless of overlapping. Kind of understands the depth. Maybe transformers can be used for depth-mapping?
Great sharing! Like to ask about if there is any clue to deside how many object queries should we use for any particular Object Detection problems? Thanks!
Great!
Excellent job as usual. Congrats on your Ph.D.
Cool trick adding position encoding to K,Q and leaving V without position encoding. Is this unique to DETR?
I'm guessing, the decoder learns an offset from these given positions analogous to more traditional bounding box algorithms findings bounding boxes relative to a fixed grid with the extra where decoder also eliminates duplicates.
This is the same thing I wanted to ask. Why leave out V? It's not even described in the paper.
Great video, very speedy :). How well does this compare to YOLOv4?
No idea, I've never looked into it.
I think it might not be as good rn but the transformer part can be scaled like crazy.
Thanks Yannick! Great explanation. Since the object queries are learned and I assume they remain fixed after training, why do we keep the lower self-attention part of the decoder block during inference, and not just replace it with the precomputed Q values?
Great video! What about a video on this paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows? They split the images in patches and uses self attention locally on every patch and then shift the patches. Would be great to hear you're explanation on this!
thanks! Which one is better you think compared to YOLOv8 for example?
Thank you very much for the explanation! I have a couple of questions:
1. Can we consider object queries to be analogous to anchor boxes?
2. Does the attention visualization highlights those parts in the image which the network gives highest importance to while predicting?
1. Somewhat, but object queries are learned and initially completely independent of the datapoint.
2. Yes, there are multiple ways, but roughly it's what you're saying
2:47 worth pointing out that the CNN reduces the size of the image while retaining high level features and so massively speeds up computation
Hi Thanks yannic for all videos. i have a question about the digits recognition in image that no writied by hand, how we can find digits in street like number of building of cars .... ? Thanks in advance
the object queries remind me of latent variables in variational architectures (VAEs for example). In those architectures, the LV's are constrained with a prior. Is this done for the object queries. Is that a good idea?
Have anyone tried to run this in a Jetson Nano to compare with previous approaches? How faster is this in comparison with a mobilenet ssd v2?
So basically little people asking lots of questions... nice!
PS. Thanks Yannic for the great analogy and insight...
I'm a bit confused. At 17:17, you are drawing vertical lines, meaning that you unroll the channels (ending up with a vectors of features per pixel that are fed into the transformer, "pixel by pixel"). Is that how it's being done? Or should there be horizontal lines (WH x C), where you feed one feature at a time for the entire image into the transformer?
Yes, if you think as text transformers consuming one word vector per word, the. analogy would be you consume all channels of a pixel per pixel
Great channel, subscribed! How does this approach compare to models opimized for size and inference speed for mobile devices like SSD mobile net? (See detection model zoo on the TF github)
No idea, I'm sorry :)
can you do one about Efficient-det?
AI Developer:
AI: 8:36 BIRD! BIRD! BIRD!
I wonder if we can use this to generate captions from image using pure transformers
And also for VQA like we can give question encoding as input in decoder
How do you make the bipartite matching loss differentiable?
the matching itself isn't differentiable, but the resulting differences are, so you just take that.
Hi Yannic! Great video! I am working on a project, just for fun because i want
to get better at deep learning, about predicting sales prices on auctions
based on a number of features over time and also the state of the economy,
probably represented by the stock market or GDP. So its a Time Series prediction project.
And i want to use transfer learning, finding a good pretrained model i can use.
As you seem to be very knowledgeable about state of the art deep learning
i wonder if you have any idea about a model i can use?
Preferably i should be able to use it with tensorflow.
Wow, no clue :D You might want to look for example in the ML for medicine field, because they have a lot of data over time (heart rate, etc.) or the ML for speech field if you have really high sample rates. Depending on your signal you might want to extract your own features or work with something like a fourier transform of the data. If you have very little data, it might make sense to bin it into classes, rather than use its original value. I guess the possibilities are endless, but ultimately it boils down to how much data you have, which puts a limit on how complicated of a model you can learn.
I am having problems understanding the trainable queries size. I know it's a random vector, but of what size? If we want the output to be 1. Bounding box (query_num, x, y, W, H) and 2. Class (query_num, num,classes), so the size of our object querie will be a 1x5 vector? [class, x, y, W, H]?
A naive doubt...in 39:17 , the attention maps you say here are generated within the model itself or are feeded from outside at that stage ?
They are from within the model
At 16:27 it is claimed "The transformer is naturally a sequence processing unit" is it? Isn't it a naturally set processing unit? and this is why we are putting a position encoding block before it.
Please make a video to train this model on our own custom datasets
I wonder how “Object Query” is different from “Region Proposal Network” in RCNN detector
It looks like Faster RCNN may still be better than DETR on smaller objects.
First difference that comes to mind is that the RPN has a chance to look at the image before outputting any region proposal, while the object queries don't. The RPN makes suggestion like "there's something interesting at this location of this image, we should look more into it". The object queries instead are learned in an image-agnostic fashion, meaning that they look more like questions e.g. "is there any small object in the bottom-left corner?"
can u train this for live vr/ar data?
This probably quite a stupid question, but can we just train end to end, from image embedding to string of symbols which contains all necessary information for object detection? I am not arguing that would be efficient, because of obvious problems with representing numbers as text, but that could work, right? If yes, then we could alleviate the requirement for the predefined maximum number of object to detect.
I guess technically you could solve any problem by learning an end-to-end system to predict its output in form of a string. T5 is already doing sort-of this for text tasks, so it's not so far out there, but I think these custom approaches still work better for now.
Maybe! but getting the neural-network to converge to that dataset would be a nightmare. The gradient-descent-algorithm only cares about one thing, "getting down that hill fast", with that sort of tunnel-vision, it can easily miss important features. So forcing gradient-descent to look at the scenery as it climbs down the mountain, you might get lucky and find a helicopter😆
@@YannicKilcher
Guess it works now. :)
Pix2seq: A Language Modeling Framework for Object Detection
(sorry if I tagged you twice, the first comment had a Twitter link and got removed instantly.)
Interesting to compare to YoloV4 which claims to get 65.7% @ mAP50?
But Yolo can’t do instance segmentation yet though, so Mask-RCNN is probably better comparison. Also Yolo probably run faster than either of these.
How are those object queries learnt?
I've always wondered where we could find the code for ML research papers (In this case, we're lucky to have Yannic sharing everything)... Can anyone in the community help me out?
Sometimes the authors create a github repo or put the code as additional files on arxiv, but mostly there's no code.
paperswithcode.com/
I have never been so confused when you started saying diagonal and then going from bottom left to top right. So used to the matrix paradigm. 32:40 Absolutely great otherwise.
8:35 here's a bird! here's a bird! here's a bird! here's a bird! :D
It's definitely not AGI, following your argument - which is true.
It seems to do more filtering, interpolation than actual reasoning.
I kinda feel disappointed. But this is good progress.
I'm still amateur in AI by the way.
Wow its same as how human attention works
When we are focus on one thing we ignore other things in an image
Thank you for your detailed explanation. But I still can not follow the idea of object queries in the transformer decoder. Based on your explanation, N people are trained to find a different region with a random value. Then why we do not directly grid the image into the N part. Get rid of randomness. In Object detection, we do not need the probability of "Generator."
I just realized youtube added labels for parts of the video. I wonder what kind of AI Google will train using this data. :O
.
35:20 Thats a very interesting interpretation.
Yea I still have to provide the outline, but hopefully in the future that's done automatically, like subtitles
Drink one shot whenever he says "sort of" :-D
Crap now I'm drunk :D
Will you do gpt 3 ?
I have. Check it out :) czcams.com/video/SY5PvZrJhLE/video.html