DETR: End-to-End Object Detection with Transformers (Paper Explained)

Yannic Kilcher

zhlédnutí 146 840

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 17. 07. 2024
Object detection in images is a notoriously hard task! Objects can be of a wide variety of classes, can be numerous or absent, they can occlude each other or be out of frame. All of this makes it even more surprising that the architecture in this paper is so simple. Thanks to a clever loss function, a single Transformer stacked on a CNN is enough to handle the entire task!
OUTLINE:
0:00 - Intro & High-Level Overview
0:50 - Problem Formulation
2:30 - Architecture Overview
6:20 - Bipartite Match Loss Function
15:55 - Architecture in Detail
25:00 - Object Queries
31:00 - Transformer Properties
35:40 - Results
ERRATA:
When I introduce bounding boxes, I say they consist of x and y, but you also need the width and height.
My Video on Transformers: • Attention Is All You Need
Paper: arxiv.org/abs/2005.12872
Blog: / end-to-end-object-dete...
Code: github.com/facebookresearch/detr
Abstract:
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive baselines. Training code and pretrained models are available at this https URL.
Authors: Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko
Links:
CZcams: / yannickilcher
Twitter: / ykilcher
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Věda a technologie

Komentáře • 177

@slackstation Před 4 lety ⁺¹¹³
This is a gift. The clarity of the explanation, the speed at which it comes out. Thank you for all of your work.
@ankitbhardwaj1956 Před 4 lety ⁺¹
I had seen your Attention is all you need video and now watching this, I am astounded by the clarity you give in your videos. Subscribed!
@aashishghosh8246 Před 4 lety ⁺¹
Yup. Subscribed with notifications. I love that you enjoy the content of the papers. It really shows! Thank you for these videos.
@rishabpal2726 Před 4 lety ⁺²
Really appreciate the efforts you are putting into this. You paper explanations make my day everyday!
@adisingh4422 Před 2 lety ⁺²
Awesome video. Highly recommend reading the paper first and then watching this to solidfy understanding. This definitely helped me understand DETR model more.
@sahandsesoot Před 4 lety ⁺⁴
Greatest find on CZcams for me todate!! Thank you for the great videos!
@user-ze2lj2nr1p Před 4 lety ⁺³
Thank you for your wonderful video. When I read this paper first, I couldn't understand what is the input of decoder (object queries), but after watching your video, finally I got it, random vector !
@michaelcarlon1831 Před 4 lety
A great paper and a great review of the paper! As always nice work!
@chaouidhuzgen6818 Před 2 lety
WoW , the way you've explained and break down this paper is spectacular ,
Thx mate
@Phobos11 Před 4 lety ⁺¹³
The attention visualization are practically instance segmentations, very impressive results and great job untangling it all
@tsunamidestructor Před 4 lety ⁺³
YES! I was waiting for this!
@ramandutt3646 Před 4 lety ⁺¹³
Was waiting for this. Thanks a lot! Also dude, how many papers do you read everyday?!!!
@CristianGarcia Před 4 lety ⁺⁵²
Loved the video! I was just reading the paper.
Just wanted to point out that Transformers, or rather Multi Head Attention, naturally processes sets, not sequences, this is why you have to include the positional embeddings.
Do a video about the Set Transformer! In that paper the call the technique used by the Decoder in this paper "Pooling by Multihead Attention".
@YannicKilcher Před 4 lety ⁺⁷
Very true, I was just still in the mode where transformers are applied to text ;)
@princecanuma Před 4 lety
What are positional encodings?
@snippletrap Před 4 lety ⁺¹
@@princecanuma The positional encoding is simply the index of each token in the sequence.
@coldblaze100 Před 4 lety ⁺³
@@snippletrap I had a feeling it was gonna be something that simple. 🤦🏾‍♂️ AI researchers' naming conventions aren't helping the community, in terms of accessibility lmao
@chuwang2125 Před 3 lety
Thank you for the one-line summary of "Pooling by Multihead Attention". This makes it 10x clearer about what exactly the decoder is doing. I was feeling that the "decoder + object seeds" is doing similar things to ROI pooling, which is gathering relevant information for a possible object. I also recommend reading the set transformer paper, which enhanced my limited knowledge of attention models. Thanks again for your comment!
@opiido Před 4 lety
Great!!! absolutely great! fast , to the point, and extremely clear. Thanks!!
@hackercop Před 2 lety
This video was absolutely amazing. You explaned this concept really well and I loved the bit at 33:00 about flattening the image twice and using the rows and columns to create an attention matrix where every pixel can releate to every other pixel. Also loved the bit at the beginning when you explaned the loss in detail. alot of other videos just gloss over that part. Have liked and subscribed
@renehaas7866 Před 4 lety
Thank you for this content! I have recommended this channel to my colleagues.
@biswadeepchakraborty685 Před 4 lety ⁺²
You are a godsend! Please keep up the good work!
@edwarddixon Před 4 lety ⁺³
"Maximal benefit of the doubt" - love it!
@AishaUroojKhan Před 2 lety
Thanks so much for making it so easy to understand these papers.
@Konstantin-qk6hv Před 2 lety ⁺²
Very informative. Thanks for explanation!
@Gotrek103 Před 4 lety ⁺¹
Very well done and understandable. Thank you!
@pranabsarkar Před 4 lety ⁺¹
Fantastic explanation 👌 looking forward for more videos ❤️
@AonoGK Před 4 lety ⁺⁹
infinite respect for the ali G reference
@YannicKilcher Před 4 lety
Haha someone noticed :D
@tae898 Před 3 lety
What an amazing paper and an explanation!
@sawanaich4765 Před 3 lety ⁺¹
You saved my project. Thank you 🙏🏻
@sungso7689 Před 4 lety ⁺¹
Thanks for great explanation!
@TheAhmadob Před 2 lety ⁺²
Really smart idea about how the (HxW)^2 matrix naturally embeds bounding boxes information. I am impressed :)
@mahimanzum Před 4 lety ⁺¹
You explained it so well. Thanks . best of luck
@kodjigarpp Před rokem
Thanks for the walkthrough!
@AlexOmbla Před 2 lety
Very very nice explanation, I really subscribed for that quadratic attention explanation. Thanks! :D
@zeynolabedinsoleymani4591 Před 7 měsíci
I like the way you DECIPHER things! thanks!
@pravindesai6687 Před 6 měsíci
Amazing explanation. Keep up the great work.
@drhilm Před 4 lety ⁺²
Thank you very much, this was really good.
@pokinchaitanasakul-boss3370 Před 2 lety ⁺¹
Thank you very much. This is a very good video. Very easy to understand.
@cuiqingli2077 Před 4 lety ⁺¹
really thank you for your explanation!
@user-nh3er6vh1r Před 2 lety
Excellent work,Thanks!
@oldcoolbroqiuqiu6593 Před 3 lety ⁺¹
34:08 GOAT explanation about the bbox in atttention feature map.
@Charles-my2pb Před rokem
thank u so much for video! that's so amazing and make me much understanding for this paper ^^
@florianhonicke5448 Před 4 lety ⁺²
So cool! You are great!
@dheerajkhanna7697 Před 11 měsíci
Thank you sooo much for this explanation!!
@uditagarwal6435 Před rokem
very clear explanation, great work sir. thanks
@RyanMartinRAM Před 7 měsíci
Holy shit. Instant subscribe within 3 minutes. Bravo!!
@frederickwilliam6497 Před 4 lety ⁺¹
Great content!
@krocodilnaohote1412 Před 2 lety
Very cool video, thank you!
@apkingboy Před 4 lety ⁺¹
Love this content bro thank you so much, hoping to get a Mac in Artificial Intelligence
@user-gy9ef7mr7g Před rokem
Great explanation
@kylepena8908 Před 3 lety
This is a really great idea
@tianhao7783 Před 4 lety ⁺¹
really quite quick. thanks. make more...
@Muhammadw92 Před 3 měsíci
Thanks for the explaination
@jjmachan Před 4 lety ⁺²
Awesome 🔥🔥🔥
@user-sv5uc7vc9j Před 3 lety
Thank you for providing such interesting paper reading ! Yannic Kilcher
@quebono100 Před 4 lety ⁺²
I love your channel thank you soooo much
@0lec817 Před 4 lety ⁺⁵
Hi Yannic, amazing video and great improvements in the presentation (time-sections in youtube etc.) I really like where this channel is going, keep it up.
I've been reading through the paper myself yesterday as I've been working with that kind of attention for CNNs a bit and I really liked the way you described the mechanism behind the different attention heads in such a simplistic and easily understandable way!
Your idea with directly inferring bboxes from two attending points in the "attention matrix" sounds neat and didn't cross my mind yet. But I guess then you probably have to use some kind of nms again if you do so?
One engineering problem that I came across, especially with those full (HxW)^2 attention matrices is that this blows up your GPU memory insanely. Thus one can only use a fraction of the batchsize and a (HxW)^2 multiplication also takes forever, which is why that model takes much longer to train (and infer I think)
What impressed me most was that an actually very "unsophisticated learned upscaling and argmax over all attentionmaps" achieved such great results for panoptic segmentation!
One thing that I did not quite get: Can the multiple attention heads actually "communicate" with each other during the "look up"? Going by the description in the Attention is all you need: "we then perform the attention function in parallel, yielding dv-dimensional
output values" and the formula: "Concat(head1, ..., headh)W°". This to me looks like the attention heads do not share information while attending to things. Only the W° might be able during the backprop to reweight the attention heads if they have overlapping attention regions?
@YannicKilcher Před 4 lety
Yes I see it the same way, the individual heads do independent operations in each layer. I guess the integration of information between them would then happen in higher layers, where their signal could be aggregated in a single head there.
@YannicKilcher Před 4 lety
Also, thanks for the feedback :)
@gruffalosmouse107 Před 4 lety
@@YannicKilcher The multi-head part is the only confusion I have about this great work. In NLP multi-head makes total sense: an embedding can "borrow" features/semantics from multiple words at different feature dimensions. But in CV seems it's not necessary? The authors didn't do ablation study about the number of heads. My suspicion is single head works almost as well as 8 heads. Would test it once I got a lot of GPUs...
@musbell Před 4 lety ⁺¹
Awesome!
@a_sobah Před 4 lety ⁺¹
Great video thanks you
@tarmiziizzuddin337 Před 4 lety ⁺⁹
"First paper ever to have ever cite a youtube channel." ...challenge accepted.
@diplodopote Před 2 lety
Thanks a lot for this really helpful
@quantum01010101 Před 3 lety ⁺¹
Excellent
@yashmandilwar8904 Před 4 lety ⁺¹²¹
Are you even human? You're really quick.
@m.s.d2656 Před 4 lety
Nope .. A Bot
@meerkatj9363 Před 4 lety ⁺¹
@@m.s.d2656 I don't actually know which is the most impressive
@krishnendusengupta6158 Před 4 lety ⁺²
There's a bird!!! There's a bird...
@sadraxis Před 4 lety ⁺¹
@@krishnendusengupta6158 bird, bird, bird, bird, bird, bird, bird, bird, its a BIRD
@anheuser-busch Před 4 lety ⁺²
Awesome!!! Yannic, by any chance, would you mind reviewing the paper (1) Fawkes: Protecting Personal Privacy against Unauthorized Deep Learning Models or (2) Analyzing and Improving the Image Quality of StyleGAN? I would find it helpful to have those papers deconstructed a bit!
@KarolMajek Před 4 lety ⁺²
Thanks for this vid, really fast. I still (after 2 days) didn't tried to run it on my data - feeling bad
@arturiooo Před 2 lety
I love how it understands which part of the image belongs to which object (elephant example) regardless of overlapping. Kind of understands the depth. Maybe transformers can be used for depth-mapping?
@wjmuse Před 3 lety ⁺¹
Great sharing! Like to ask about if there is any clue to deside how many object queries should we use for any particular Object Detection problems? Thanks!
@DacNguyenDW Před 4 lety ⁺¹
Great!
@johngrabner Před 3 lety ⁺²
Excellent job as usual. Congrats on your Ph.D.
Cool trick adding position encoding to K,Q and leaving V without position encoding. Is this unique to DETR?
I'm guessing, the decoder learns an offset from these given positions analogous to more traditional bounding box algorithms findings bounding boxes relative to a fixed grid with the extra where decoder also eliminates duplicates.
@danielharsanyi844 Před 3 měsíci
This is the same thing I wanted to ask. Why leave out V? It's not even described in the paper.
@Augmented_AI Před 4 lety ⁺⁴
Great video, very speedy :). How well does this compare to YOLOv4?
@YannicKilcher Před 4 lety
No idea, I've never looked into it.
@gunslingerarthur5865 Před 4 lety
I think it might not be as good rn but the transformer part can be scaled like crazy.
@TheGatoskilo Před rokem
Thanks Yannick! Great explanation. Since the object queries are learned and I assume they remain fixed after training, why do we keep the lower self-attention part of the decoder block during inference, and not just replace it with the precomputed Q values?
@thivyesh Před 2 lety ⁺¹
Great video! What about a video on this paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows? They split the images in patches and uses self attention locally on every patch and then shift the patches. Would be great to hear you're explanation on this!
@mohammadyahya78 Před měsícem
thanks! Which one is better you think compared to YOLOv8 for example?
@himanshurawlani3445 Před 4 lety ⁺¹
Thank you very much for the explanation! I have a couple of questions:
1. Can we consider object queries to be analogous to anchor boxes?
2. Does the attention visualization highlights those parts in the image which the network gives highest importance to while predicting?
@YannicKilcher Před 4 lety ⁺¹
1. Somewhat, but object queries are learned and initially completely independent of the datapoint.
2. Yes, there are multiple ways, but roughly it's what you're saying
@hackercop Před 2 lety
2:47 worth pointing out that the CNN reduces the size of the image while retaining high level features and so massively speeds up computation
@maloukemallouke9735 Před 3 lety
Hi Thanks yannic for all videos. i have a question about the digits recognition in image that no writied by hand, how we can find digits in street like number of building of cars .... ? Thanks in advance
@jadtawil6143 Před 2 lety
the object queries remind me of latent variables in variational architectures (VAEs for example). In those architectures, the LV's are constrained with a prior. Is this done for the object queries. Is that a good idea?
@Volconnh Před 4 lety ⁺¹
Have anyone tried to run this in a Jetson Nano to compare with previous approaches? How faster is this in comparison with a mobilenet ssd v2?
@wizardOfRobots Před 2 lety
So basically little people asking lots of questions... nice!
PS. Thanks Yannic for the great analogy and insight...
@AlexanderPacha Před 4 lety ⁺¹
I'm a bit confused. At 17:17, you are drawing vertical lines, meaning that you unroll the channels (ending up with a vectors of features per pixel that are fed into the transformer, "pixel by pixel"). Is that how it's being done? Or should there be horizontal lines (WH x C), where you feed one feature at a time for the entire image into the transformer?
@YannicKilcher Před 4 lety ⁺¹
Yes, if you think as text transformers consuming one word vector per word, the. analogy would be you consume all channels of a pixel per pixel
@benibachmann9274 Před 4 lety ⁺¹
Great channel, subscribed! How does this approach compare to models opimized for size and inference speed for mobile devices like SSD mobile net? (See detection model zoo on the TF github)
@YannicKilcher Před 4 lety
No idea, I'm sorry :)
@vaibhavsingh1049 Před 4 lety ⁺¹
can you do one about Efficient-det?
@christianjoshua8666 Před 4 lety ⁺⁸
AI Developer:
AI: 8:36 BIRD! BIRD! BIRD!
@mathematicalninja2756 Před 4 lety ⁺⁴
I wonder if we can use this to generate captions from image using pure transformers
@amantayal1897 Před 4 lety ⁺²
And also for VQA like we can give question encoding as input in decoder
@tranquil_cove4884 Před 3 lety ⁺¹
How do you make the bipartite matching loss differentiable?
@YannicKilcher Před 3 lety ⁺²
the matching itself isn't differentiable, but the resulting differences are, so you just take that.
@linusjohansson3164 Před 4 lety
Hi Yannic! Great video! I am working on a project, just for fun because i want
to get better at deep learning, about predicting sales prices on auctions
based on a number of features over time and also the state of the economy,
probably represented by the stock market or GDP. So its a Time Series prediction project.
And i want to use transfer learning, finding a good pretrained model i can use.
As you seem to be very knowledgeable about state of the art deep learning
i wonder if you have any idea about a model i can use?
Preferably i should be able to use it with tensorflow.
@YannicKilcher Před 4 lety
Wow, no clue :D You might want to look for example in the ML for medicine field, because they have a lot of data over time (heart rate, etc.) or the ML for speech field if you have really high sample rates. Depending on your signal you might want to extract your own features or work with something like a fourier transform of the data. If you have very little data, it might make sense to bin it into classes, rather than use its original value. I guess the possibilities are endless, but ultimately it boils down to how much data you have, which puts a limit on how complicated of a model you can learn.
@mariosconstantinou8271 Před rokem
I am having problems understanding the trainable queries size. I know it's a random vector, but of what size? If we want the output to be 1. Bounding box (query_num, x, y, W, H) and 2. Class (query_num, num,classes), so the size of our object querie will be a 1x5 vector? [class, x, y, W, H]?
@arnavdas3139 Před 4 lety ⁺¹
A naive doubt...in 39:17 , the attention maps you say here are generated within the model itself or are feeded from outside at that stage ?
@YannicKilcher Před 4 lety ⁺¹
They are from within the model
@TaherAbbasiz Před rokem
At 16:27 it is claimed "The transformer is naturally a sequence processing unit" is it? Isn't it a naturally set processing unit? and this is why we are putting a position encoding block before it.
@arjunpukale3310 Před 4 lety ⁺⁶
Please make a video to train this model on our own custom datasets
@dshlai Před 4 lety ⁺³
I wonder how “Object Query” is different from “Region Proposal Network” in RCNN detector
@dshlai Před 4 lety
It looks like Faster RCNN may still be better than DETR on smaller objects.
@FedericoBaldassarre Před 4 lety
First difference that comes to mind is that the RPN has a chance to look at the image before outputting any region proposal, while the object queries don't. The RPN makes suggestion like "there's something interesting at this location of this image, we should look more into it". The object queries instead are learned in an image-agnostic fashion, meaning that they look more like questions e.g. "is there any small object in the bottom-left corner?"
@erobusblack4856 Před rokem
can u train this for live vr/ar data?
@mikhaildoroshenko2169 Před 4 lety ⁺¹
This probably quite a stupid question, but can we just train end to end, from image embedding to string of symbols which contains all necessary information for object detection? I am not arguing that would be efficient, because of obvious problems with representing numbers as text, but that could work, right? If yes, then we could alleviate the requirement for the predefined maximum number of object to detect.
@YannicKilcher Před 4 lety ⁺¹
I guess technically you could solve any problem by learning an end-to-end system to predict its output in form of a string. T5 is already doing sort-of this for text tasks, so it's not so far out there, but I think these custom approaches still work better for now.
@larrybird3729 Před 4 lety ⁺¹
Maybe! but getting the neural-network to converge to that dataset would be a nightmare. The gradient-descent-algorithm only cares about one thing, "getting down that hill fast", with that sort of tunnel-vision, it can easily miss important features. So forcing gradient-descent to look at the scenery as it climbs down the mountain, you might get lucky and find a helicopter😆
@mikhaildoroshenko2169 Před 2 lety
@@YannicKilcher
Guess it works now. :)
Pix2seq: A Language Modeling Framework for Object Detection
(sorry if I tagged you twice, the first comment had a Twitter link and got removed instantly.)
@gerardwalsh4724 Před 4 lety ⁺¹
Interesting to compare to YoloV4 which claims to get 65.7% @ mAP50?
@dshlai Před 4 lety ⁺¹
But Yolo can’t do instance segmentation yet though, so Mask-RCNN is probably better comparison. Also Yolo probably run faster than either of these.
@DANstudiosable Před 3 lety
How are those object queries learnt?
@marcgrondier398 Před 4 lety ⁺³
I've always wondered where we could find the code for ML research papers (In this case, we're lucky to have Yannic sharing everything)... Can anyone in the community help me out?
@YannicKilcher Před 4 lety ⁺¹
Sometimes the authors create a github repo or put the code as additional files on arxiv, but mostly there's no code.
@convolvr Před 4 lety ⁺³
paperswithcode.com/
@FlorianLaborde Před 3 lety
I have never been so confused when you started saying diagonal and then going from bottom left to top right. So used to the matrix paradigm. 32:40 Absolutely great otherwise.
@spenhouet Před 4 lety ⁺¹
8:35 here's a bird! here's a bird! here's a bird! here's a bird! :D
@chideraachinike7619 Před 4 lety
It's definitely not AGI, following your argument - which is true.
It seems to do more filtering, interpolation than actual reasoning.
I kinda feel disappointed. But this is good progress.
I'm still amateur in AI by the way.
@sanjaybora380 Před rokem
Wow its same as how human attention works
When we are focus on one thing we ignore other things in an image
@JLin-xk9nf Před 3 lety
Thank you for your detailed explanation. But I still can not follow the idea of object queries in the transformer decoder. Based on your explanation, N people are trained to find a different region with a random value. Then why we do not directly grid the image into the N part. Get rid of randomness. In Object detection, we do not need the probability of "Generator."
@herp_derpingson Před 4 lety
I just realized youtube added labels for parts of the video. I wonder what kind of AI Google will train using this data. :O
.
35:20 Thats a very interesting interpretation.
@YannicKilcher Před 4 lety ⁺¹
Yea I still have to provide the outline, but hopefully in the future that's done automatically, like subtitles
@bryce-bryce Před 3 lety ⁺³
Drink one shot whenever he says "sort of" :-D
@YannicKilcher Před 3 lety ⁺¹
Crap now I'm drunk :D
@jjmachan Před 4 lety ⁺¹
Will you do gpt 3 ?
@YannicKilcher Před 4 lety
I have. Check it out :) czcams.com/video/SY5PvZrJhLE/video.html

Další v pořadí

Automatické přehrávání

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)