C 5.0 | Object Localization | Bounding Box Regression | CNN | Machine Learning | EvODN

Cogneethi

zhlédnutí 40 876

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 29. 08. 2024
Until now in the previous chapter we have discussed Image Classification.
That is, given an image with one object, we tell what kind of object it is.
Next comes localization, where we not only tell what kind of object it is, but we also locate it in the image. Programmatically speaking, we have to draw the correct bounding box around it.
If there are more than 1 objects in the image, we have to locate and identify all of them. This task is Object Detection.
But, before jumping into Object Detection lets first understand Object Localization. This is a task of locating an object in an image. That is, there is a single object in the image and we have to correctly draw the bounding box around it.
Once we know how to do classification, it is easy to extend the network to do Localization.
In classify, the last FC layer outputs 1 value per class. But for Localization, we need 4 outputs per class. These 4 outputs would correspond to x1,y1,x2,y2 co-ordinates of the bounding box. We just need to train the network to output correct values for these co-ordinates.
Difference from classification is that, there would be no Softmax layer and the loss function used is L2 Loss.
L2 Loss is just the square of the differences between the expected and actual value.
Few things about bounding boxes:
1. It should fit the object tightly, ending at the borders. And the boxes should not be within the object.
2. Even if the object is partially visible, note that the network can still infer the extent of the object. You can imaging how well these CNNs learn the general properties of an object.
So sometimes, it can happen that, these boxes may exceed the boundaries of the image itself. In such cases, we just trim it to the image boundaries.
3. Also, it should be obvious that, if more part of the object is visible, the bounding box might be more accurate.
Finally, combining the results of both classifier and BBox Regressor, we infer the type of the object and its location.
Note that, for the same object, if there are 10 classes in your dataset, 10 different classification scores and BBox coordinates will be output by the network. But we only pick the one which has the highest confidence score in the classification output.
------------------------
This is a part of the course 'Evolution of Object Detection Networks'.
See full playlist here: • Evolution Of Object De...
------------------------
Credits:
host.robots.ox....
farm1.staticfli...
farm8.staticfli...
farm5.staticfli...
commons.wikime...
Copyright Disclaimer: Under section 107 of the Copyright Act 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Komentáře • 94

@anandphilip Před 3 lety ⁺⁴
This is the simplest, most lucid explanation for the topic I've heard.
@PJDuro Před 4 lety ⁺⁵
There are very few good explanations of this on the net, including in the online courses. You really made it easer to understand, thanks!
@rodrigodevon1152 Před 3 lety
A tip : you can watch movies on Flixzone. Me and my gf have been using it for watching lots of of movies recently.
@warrenjefferson2237 Před 3 lety
@Rodrigo Devon Yup, I have been watching on flixzone for since november myself :)
@ansondiego8875 Před 3 lety
@Rodrigo Devon Definitely, I have been using Flixzone for months myself :D
@deaconjacob2283 Před 3 lety
@Rodrigo Devon Definitely, I have been using Flixzone for years myself :D
@Replcate Před rokem
hey can you help with coding parts i am a beginner
@ImranKhan-tc8jz Před 4 lety ⁺²
Thankyou so much sir. I was looking for an explanation and unfortunately I could not get it even after watching alot of youtube videos and reading articles . But when I saw this video, My confusion is clear now. Thanks again.
@valentinfontanger4962 Před 3 lety
This is excellent ! If you are willing to understand YOLO, YOU NEED TO WATCH THIS VIDEO !!! I had no idea how bouding boxes were predicted. Now it's clear, the only thing I have to figure out is how to spread my last layers into a classifier and a regressor.
Thanks !!!!!
@Cogneethi Před 3 lety
Welcome Valentin!
@Replcate Před rokem
@@Cogneethi the theory explanation is good but can you tell me codes also for this
@ganeshchalamalasetti2884 Před 3 lety ⁺¹
That was an amazing explanation and insight about the bounding box regressor. Rare video about this topic. I appreciate your efforts.
@Cogneethi Před 3 lety ⁺¹
Thanks Ganesh!
@ganeshchalamalasetti2884 Před 3 lety
@@Cogneethi By the way, are you still working on to put YOLO framework on board?
@Cogneethi Před 3 lety
@@ganeshchalamalasetti2884 Not yet. Caught up with some projects. So not getting the time. So may be later.
@ganeshchalamalasetti2884 Před 3 lety
@@Cogneethi Make sense. All the best 👍
@SouravDas-eg8ok Před 4 lety ⁺¹
Very good videos. Small and crispy. Easy to understand. Thank you very much.
@abdussametturker Před 3 lety ⁺¹
Thank you
@dynocodes Před 3 lety
most simplified video on youtube keep it up bro hats off for your explanation.i would like to learn coding part for it.
@DrMukeshBangar Před 2 lety
Great 👍🙏
@pallawirajendra Před 4 lety ⁺¹
Very clear explanation. Keep creating.
@akhilsraj6698 Před 4 lety ⁺²
Thank you!!
@anonymosranger4759 Před 4 lety ⁺²
Amazing video!!! You deserve more subs!
@Life_on_wheeel Před 2 lety
Thanks for crystal clear explanation.
@medhavimonish41 Před 4 lety ⁺¹
best explanation , thank you sir
@sahilmakandar773 Před 4 lety ⁺¹
very good
@maschleimichael16 Před 4 lety
Thanks for the explanation,it is very clear and easy to understand
@mukundsrinivas8426 Před 4 lety ⁺¹
wonderful video
@shobhitsharma1022 Před 2 lety
which can model is good for detecting bounding boxes of customer demographic on national ID cards?
@trungphamduc8271 Před 3 lety
Many thanks about this video.
@chyldstudios Před 2 lety
Amazing explanation!
@marlene5547 Před 2 lety
Thank you so much for explaining this!!
@BiancaAguglia Před 4 lety ⁺¹
You're a good teacher. I know that, as you said on your website, video/audio recording is time consuming and is not for the faint of heart 😀, but I hope you'll continue to do these tutorials.
One question: do you have examples of actually coding the neural networks you explain in your videos? I looked on your website and your github account but didn't find anything. It might be that I didn't do a very good job at searching. 😊
@Cogneethi Před 4 lety
@Bianca,
Regarding the code, I have not posted them on github/website. I will probably post them to github. I will comment here and let you know as soon as I do. (But first I have to find them, I dont know where I saved them :( )
Thank you for the encouragement. I will try to do more of these as and when I find time. :)
And let me know if I have made any mistakes and how I can improve, since, like you, I am still learning and not an expert yet!
@Cogneethi Před 4 lety ⁺¹
Meanwhile, these are the libraries that I used in this tutorial:
HOG: scikit-image.org/docs/dev/auto_examples/features_detection/plot_hog.html
SVM: scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC
VGG & Faster RCNN: github.com/endernewton/tf-faster-rcnn
@Cogneethi Před 4 lety
Code and PPT are here: drive.google.com/drive/folders/120KC9i3F0WMhqksngS-dWS1iJNP-mXAv?usp=sharing
@waterspray5743 Před 2 lety
When training the neural network, should all the other bounding boxes be zeros? Say there are three classes: *people, boat, tv.* If the image contains only a boat, what is the ground truth for people and tv?
@Cogneethi Před rokem
We will still have some bounding box estimate for the other classes, but will not be considered. Only the BBox corresponding to the highest class will be considered.
@kinsung85 Před 2 lety
Since the feature maps separate into two path, one for classification, the other one for grounding box regressor. How does one know the results of classification (object1, object2, object3) belong to which bounding box (box1, box2, box3)
what I mean the result could be object1 -> box2, but in your explanation, object1 -> box1
@kinsung85 Před 2 lety
My question is how to decide which object belong to which bounding box?
@studentpointofview9328 Před 3 lety ⁺¹
amazing tutorial!
@kirushikeshdb1885 Před 3 lety
I have a doubt you told us that we do backpropagation based on the L2 loss between the actual bounding box and the predicted bounding box coordinates. But we actually have bounding box coordinates only for the correct class which means all the coordinates for the incorrect classes are fixed to zero.
@tiasm919 Před 4 lety
There should be 2 loss functions used here right ?
1. For the classification layer
2. For the regression layer
Does we sum the loss before backpropagating through the network ? What i understand is, we count the classification loss to "only" backpropagate through the classification layer AND we count the regression loss to "only" backpropagte through the regression layer (only for the 4 neuron correspond to the predicted class in classification layer). Both of the loss will be "added" to the neurons in the last FC layer, lets say FC7, through both of the layer backprop step (which branched into class and regg layer)
Is this right ? Can you please clarify this for me ?
@yusufbalik6784 Před 2 lety ⁺¹
Sir, what you said is theoretically very good, thank you. I train the model with faster rcnn. This model gives me good outputs. It draws the boxes, but I can't get the coordinates of the boxes with the code, how can I do this?
@pavithrans7153 Před 4 lety ⁺²
How to change the last fully connected layer to give the co-ordinates of bounding boxes?
@Cogneethi Před 4 lety ⁺¹
The output you get in the last FC layer depends on the loss function that you use. If you use softmax and use class labels to calculate loss, then eventually after many steps of training, it will learn to predict the correct class labels. Here you need just 1 output per class. This is a case of 'classification'.
In the code, you might use something like this: pytorch.org/docs/stable/nn.html?highlight=loss#torch.nn.CrossEntropyLoss
Instead, If you use L2 loss, and use the co-ordinates of the 'ground truth' bounding box as input to L2 loss function, after many training steps, if will learn to predict the co-ordinates of bounding box. Only, in this case, since bbox needs 4 points, your last FC layer outputs will have 4 outputs per class. This is a case of 'regression'.
In the code, you might use something like this: pytorch.org/docs/stable/nn.html?highlight=loss#torch.nn.MSELoss
In general, it all depends on what is the output you are expecting and what kind of loss function you are using. Based on this, you decide the number of outputs in your last FC layer.
If, you see this pytorch.org/docs/stable/search.html?q=loss&check_keywords=yes&area=default, there are different types of loss functions suitable for different use cases.
Let me know, if I need to eloborate further.
@Cogneethi Před 4 lety
Lets say, your last fully connected layer before classification and bbox regression is called 'fc7'.
Then, from this, to get the classification probabilities, you do:
cls_score = fully_connected(fc7, num_classes,...)
cls_prob = softmax(cls_score)
# here number of outputs = num_classes
And to get the bbox co-ordinates, you do:
bbox = fully_connected(fc7, num_classes * 4,...)
# here num of outputs = num_classes * 4.
That is all there is to it. In fact, when I was studying, I too was too confused by it. But after seeing the code, it was clear.
@Cogneethi Před 4 lety
The coding part, I have covered a bit more in the last chapter 'Faster-RCNN'.
czcams.com/video/09DRku3USAs/video.html
@RnFChannelJr Před 4 lety
@@Cogneethi if it possible, may i see the code for implementation this lecture ?
@tiasm919 Před 4 lety
Does the last FC layer branced into 2 different layers:
1. For classification, consist of Nclass+1(background) neurons and softmax function
2. For bounding box regression, consist of Nclass*4(each for the coordinate)
Is this true ? (I belive this is what pavithran was asking)
Edit : reading the code you provide in the first reply of this comment, i believe what i say is right. Thanks :)
@muhammadabubakar9688 Před 2 lety
For bounding box coordinates, we have to use fc layer without softmax. What is last layer,please explain it.
@secretfolder8870 Před 2 lety
In 4:10, how do you mean modifying the last FC layer to obtain the BBox coordinates? Are these coordinates obtained from 4 Layers of the FC layer or from only the last FC layer? Please clarify.
@khabarsilva6850 Před 4 lety
Good explanation 👍
@Cogneethi Před 4 lety
Thank you Khabar
@sathishbabu3867 Před 4 lety ⁺¹
sir how to find the distance from camera to bounding box
@annie157 Před 3 lety
How did you find the coordinates (200,250) (600,400) I didn't understand please explain
@rs9130 Před 2 lety
thank you for good explaination. can we find approximation of bbox for semantic segment masks?
@Replcate Před rokem
the theory explanation is good but can you tell me codes also for this
@aakashr4974 Před 3 lety
Can you please explain the loss function, will you put the correct coordinates only corresponding to the actual lable?
@Cogneethi Před 3 lety ⁺¹
It is L2 loss used in this example: heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0
"will you put the correct coordinates only corresponding to the actual lable?"
That is correct.
@chandrabindu4440 Před 3 lety
It would be extremely fruitful if you explained the code along with this theory.
@Cogneethi Před 3 lety
Yes, I think that is one mistake that I made which i realised later. "If" I make some videos in future, I will definitely include code. :)
@sahasradalghara9904 Před 2 lety
do we need 4 output layer for Bouning box regressor ?
@RnFChannelJr Před 4 lety ⁺¹
hello sir, thanks for explanations. but if it possible can you explain that theory into the real source code ?
@Cogneethi Před 4 lety
Unfortunately I have not covered the coding part. May be on a later date I will add some kind of explanations for the code.
But in the end, I have covered the code for Faster RCNN.
czcams.com/video/09DRku3USAs/video.html
drive.google.com/drive/folders/120KC9i3F0WMhqksngS-dWS1iJNP-mXAv?usp=sharing
github.com/endernewton/tf-faster-rcnn
@aviseklahiri3864 Před 4 lety
Thanks for the awesome tutorial. Just one doubt: so this video assumes that a given image has only 1 instance of the object ?
@Cogneethi Před 4 lety
Yes, that is the assumption in case of 'Localization'. For 'Detection' there will be multiple objects in an image.
@harishkumaranandan5946 Před 4 lety
Hi, between (6:00 - 6:03) regarding the initial bbox that u mentioned as a hypothetical one..will it be a bbox of any one of the feature map among the stack in last FC layer? also having said the last FC layer basically has the stack of different features as a vector..will the stack have the entire boat as one of feature and based on ground truth co-ordinates and bbox regression we are using the L2 loss to narrow down on that one as the location. Like basically backtracking the 4 ground truth bbox co-ordinates in the feature vector space bbox coordinates for any input?
@Cogneethi Před 4 lety
No, the enitre boat will not be a feature.
It is difficult to guage exactly what is happening.
You may check this: distill.pub/2020/circuits/zoom-in/
and this:
czcams.com/video/AgkfIQ4IGaM/video.html
The network basically learns patterns in the data. And based on the pattern it will approximately guage the location of the object.
And we fine tune the detection part based on the Ground Truth.
@harishkumaranandan5946 Před 4 lety
@@Cogneethi Hi Thanks for getting back. I had a look at the links and also aware of the visualization toolbox. However the thing still not clear is about the initial hypothetical bbox co-ordinates that you mentioned. Let u say that when we start training even before the first back propagation step the network's FC layer will have a stack of feature map and as you say yes it won't be an entire boat or whatever object we are trying to detect. If this is the case then the o/p bbox co-ordinates that we try to detect are around an entire boat right. So how does the stack of feature map in FC layer with each representing just a part of feature extracted through the CNN operation coupled with pooling stride etc....return set of bbox coordinate that is supposed to represent a bbox for an entire boat.
2nd question: if the 1st bbox coordinate is hypothetical then does is not co relate with the features of the boat and it is through the ground truth and L2 loss we are forcing the network to spit out the final numbers or if the initial bbox co-ordinates co relate or is formed based on features of a boat then can you show a similar explanation as how u did for HOG+ SVM how we form the bbox co-ordinates from features in FC layer stack (the transformation) even though they are not accurate.
@Cogneethi Před 4 lety ⁺¹
@@harishkumaranandan5946
Sorry, at this point of time, I dont have an easy answer to the 1st question.
Regarding the 2nd one, I will have to dig deeper into visualization and show some examples as you suggested.
I have received similar queries from other viewers.
But to briefly ans ur 2nd q:
Yes, initially the network just spits out random numbers.
Later on as the training proceeds, once the network sees 1000s of boat images, we are using the ground truth values to force the network to learn the correct bbox co-ordinates from the feature maps extracted by the CNN.
This way, the network learns to read the feature maps and guess the correct bbox values.
I have given some sort of imperfect demo at the end of 8th chapter. That might help ur intuitions a bit.
Meanwhile, I will keep this in mind when I try to expand the course, I will probably include more visualizations for better understanding.
@harishkumaranandan5946 Před 4 lety
@@Cogneethi Hi cogneethi, thanks for getting back. I appreciate it. I will keeping touch.
@durgabhavanitirumala9632 Před 4 lety ⁺¹
Sir could you provide me some very good reference videos to understand how yolov3 work?
@vineethgogu2309 Před 3 lety
Hello sir
Can we completely remove/wipe out text from an image ??? Using python libraries like easyocr ,pytessarat
@Cogneethi Před 3 lety
May be you can first identify text position and crop it off and use 'image in-painting' technique to fill the gaps.
Not sure about the quality but should work.
@vineethgogu2309 Před 3 lety
@@Cogneethi hey please
Why can't you make a video on it ?? And explain how it actually works so that it would be very beneficial to me
@vineethgogu2309 Před 3 lety
And more over I identified the text position on a image using EASYOCR
Have a try with easyocr ???
@Cogneethi Před 3 lety
@@vineethgogu2309 Unfortunately, as of now, i dont have the bandwidth for new videos.
But I found a blog on inpainting which might help.
heartbeat.fritz.ai/guide-to-image-inpainting-using-machine-learning-to-edit-and-correct-defects-in-photos-3c1b0e13bbd0
paperswithcode.com/task/image-inpainting
Once you have the text coordinates from any ocr library, you can just set them to ones or zeros and try inpainting.
@vineethgogu2309 Před 3 lety
@@Cogneethi thank you so much sir for providing blog links ....
👍👍👍👍👍👍👍
@anirudhbabu8496 Před 4 lety
what if camera sensor outputs in bounding boxes with 3rd order polynomial.
how to decode it
@Cogneethi Před 4 lety
Sorry, I dont know about this.
@vcvracarkad Před 4 lety
can anyone link to a keras implementation of the object detector model?
@SouravDas-eg8ok Před 4 lety
Please let me know which libraries you have used for coding.
@Cogneethi Před 4 lety
@Saurav, these are the libraries that I used in this tutorial:
HOG: scikit-image.org/docs/dev/auto_examples/features_detection/plot_hog.html
SVM: scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC
VGG & Faster RCNN: github.com/endernewton/tf-faster-rcnn
@adityarajora7219 Před 4 lety
explain YOLO .....it would be a great help.
@Cogneethi Před 4 lety
Yes, in a few months.
@durgabhavanitirumala9632 Před 4 lety
you did not cover yolo model?
@Cogneethi Před 4 lety
Not yet, will do so in few months time.
@vikramreddy5631 Před 4 lety
how do you get expected values to compare
@Cogneethi Před 4 lety
It is manually annotated for each image by some person. See this czcams.com/video/e4G9H18VYmA/video.html
@Cogneethi Před 5 lety
See full course on Object Detection: czcams.com/play/PL1GQaVhO4f_jLxOokW7CS5kY_J1t1T17S.html and Subscribe to my channel
If you found this tutorial useful, please share with your friends(WhatsApp/iMessage/Messenger/WeChat/Line/KaTalk/Telegram) and on Social(LinkedIn/Quora/Reddit),
Tag @cogneethi on twitter.com
Let me know your feedback @ cogneethi.com/contact

Další v pořadí

Automatické přehrávání

C 5.1 | Ideas for Object Detection | CNN | Machine Learning | EvODN