Kernel Size and Why Everyone Loves 3x3 - Neural Network Convolution

Sdílet
Vložit
  • čas přidán 26. 06. 2024
  • Patreon: / animated_ai
    Find out what the Kernel Size option controls and which values you should use in your neural network.

Komentáře • 42

  • @IoannisKazlaris
    @IoannisKazlaris Před rokem +66

    The basic reason we don't use (even number) x (even number) layers, is because those layers don't have a "center". Having a "center" pixel (as in a 3 x 3 configuration) is very useful for max and average pooling - it's much more convenient for us.

    • @100deep1001
      @100deep1001 Před 7 měsíci

      I didn't understand it.
      End of the day for even sized filter you could consider any pixel to be the center pixel right?
      it will end up giving similar values though not same.
      also max pooling and average pooling works on a output feature map so how is it related?

  • @axelanderson2030
    @axelanderson2030 Před rokem +8

    This is honestly the best video related to machine learning in general I have seen, amazing work. Most people just pull architectures out of thin air or make a clumsy disclaimer to experiment with numbers. This video shows 3d visual representations of popular CNN architectures, and really helps you build all cnns in general.

  • @matthewboughton8320
    @matthewboughton8320 Před rokem +1

    Such an amazing video. Your going to hit 50k soon! Keep this up!!!

  • @alansart5147
    @alansart5147 Před 10 měsíci

    friking love your videos! Keep up with your awesome work! :D

  • @newperspective5918
    @newperspective5918 Před rokem +6

    I think odd sized filters are mainly used since we often use a stride of 1. Each pixel (except for the edges) will then be filtered based on the surrounding pixels (defined by the kernel size). If the kernel size is even the pixel that the kernel represents would be the average pixel of the 4 middle pixels. It introduces a sort of shift of 0.5 pixel. I think it might be fine mathematically speaking, but it feels odd or wrong. Also if you worked with Gaussian filters (which I assume many CNN researchers has) you are literaly forced to use odd sized filters there.

  • @schorsch7400
    @schorsch7400 Před 3 měsíci

    Thanks for the effort of maxing this excellent visualization! This creates a very good intuition for how convolutions work and why 3x3 is dominant.

  • @rewanthnayak2972
    @rewanthnayak2972 Před rokem +1

    great work in animation and research

  • @ankitvyas8534
    @ankitvyas8534 Před rokem

    good explanation. Looking forward to more.

  • @josephpark2093
    @josephpark2093 Před rokem

    There was no reason that I should have this very question and there had to be a great video telling me the exact reason why on the internet. Bless!

  • @md.zahidulislam3548
    @md.zahidulislam3548 Před rokem +2

    Good Work. amazing explanation

  • @maxlawwk
    @maxlawwk Před rokem +9

    Perhaps 2x2 kernel is a common trick for learnable stride-2 downsample kernel and upsample deconvolution kernel. It is a more likely about computation efficiency instead of network performance, because such kernels are almost equivalent to downsample/upsample followed by a 3x3 kernel. In this regard, 2x2 combo with stride-2 down/upsample operations do not shrink the resultant feature map size by 2 as 3x3 kernel does, possibly beneficial to image generation tasks. In GAN, 2x2 or 4x4 kernels are commonly found in discriminators which emphasize non-overlapping kernels to avoid grid artifacts.

  • @ocamlmail
    @ocamlmail Před rokem

    Super cool, thank you!

  • @travislee5486
    @travislee5486 Před rokem

    great work, your video do help me a lot👍

  • @vikramsharma720
    @vikramsharma720 Před rokem

    Great Video Keep going like this 😊

  • @benc7910
    @benc7910 Před rokem

    this is amazing.

  • @j________k
    @j________k Před rokem

    Nice video I like it!

  • @pritomroy2465
    @pritomroy2465 Před 2 měsíci

    In the Unet, GAN architecture when it is required to generate a feature map half of its actual size a 4x4 kernel size is used.

  • @bengodw
    @bengodw Před měsícem

    Hi Animated AI, thanks for your great video. I have below question:
    4:45 indicated the color of filters (i.e. red, yellow, green, blue) represent the "Features". A filter (e.g. the red one) itself in 3-dimension (Height, Width, Feature) also include "Feature". Thus, the "Feature" appear twice. Please could you advise why we need "Feature" twice?

  • @naevan1
    @naevan1 Před rokem +3

    Wow really beatiful animations , great job! However I got kinda confused since I always saw convolution in 2d haha

    • @animatedai
      @animatedai  Před rokem +10

      Yes, I imagine that many AI students that only see 2D animations are surprised to learn the 2D convolutions actually work with 3D tensors (or 4D if it's batched). That was one of my main motivations for creating these animations :)

  • @yoursubconscious
    @yoursubconscious Před 2 měsíci

    "we dont talk about the goose goblin" - MadTV

  • @fosheimdet
    @fosheimdet Před rokem +1

    Is there a good reason for why filter sizes of even numbers aren't used at all, except that the padding will be uneven if using "same"?

  • @haofanren6284
    @haofanren6284 Před rokem

    About 2*2 filter, a paper maybe helpful

  • @danychristiandanychristian1060

    really helpful for understanding the concept, correct me if i'm wrong, so for the first conv2d layer, it will always contains 1 feature for black and white image, and 3 features for rgb image. And after that the number of features increases depending the number of filters used in the convolution.

  • @aalaptube
    @aalaptube Před rokem +5

    Why would just 3 channels at the beginning make 5x5 or 7x7 kernel a more preferable one instead of 3x3?
    5x5x3=125 and 7x7x3=147
    3x3x3 + 3x3x? must be be lower than 125 or 147 to make it preferable. => ? < 10.89 or 13.33
    This means that if for 3x3 2nd layer, if number of channels is < 11 (or 13), only then it is preferable over 5x5 (or 7x7). This second layer should be in the control of model developer, so it should still be okay. Or did I miss anything?

    • @animatedai
      @animatedai  Před rokem +13

      I'm planning to make a future video on the math that will go over this in detail.
      For now, let's just look at the number of parameters needed. (Total floating-point operations is roughly correlated with number of parameters, but we would need to also consider stride to calculate it precisely.) You can calculate the parameter count with filter_height * filter_width * filter_count * input_feature_count. Note that we need to include the filter_count of our layer.
      A realistic first layer might be 7x7 with 64 filters. So the parameter count would be 7*7*3*64 = 9,408.
      We can compare that to a stack of 3x3 layers with filter counts of 16, 32, and 64. Their parameter count would be 3*3*3*16 + 3*3*16*32 + 3*3*32*64 = 23,472.
      Another possibility might be a 5x5 layer with 32 filters. It's parameter count would be 5*5*3*32 = 2,400.
      We could compare that to a stack of 3x3 layers with filter counts of 16 and 32. Their parameter count would be 3*3*3*16 + 3*3*16*32 = 5040. If the first layer in the 3x3 stack had a filter count less than 8, then the stack would be more efficient. However, I haven't seen a filter count that low in practice.

    • @Anodder1
      @Anodder1 Před rokem +3

      @@animatedai Thank you very much for the examples and the explanation! The video is also very solid!

  • @kznsq77
    @kznsq77 Před rokem +1

    The even size of the kernel does not allow symmetrical coverage of the area around the pixel

  • @tantzer6113
    @tantzer6113 Před rokem +1

    Wait, I didn’t get why for the first layer 5x5 or 7x7 works better.

    • @animatedai
      @animatedai  Před rokem

      Check out my reply to this comment for an explanation: czcams.com/video/V9ZYDCnItr0/video.html&lc=UgweJZ_Bri8emvyNAMF4AaABAg.9hOBaZTROlX9hQHXzxdOZM

  • @bangsa_puja
    @bangsa_puja Před 5 měsíci

    How about of kernel 1x7, 7x1 in inception modul C. Please help me

  • @thivuxhale
    @thivuxhale Před 10 měsíci

    at 4:23, you said that the exception to the rule "3x3 filters are more efficient than larger filters" is the first layer, since the input only has 3 channels. i still haven't got this part. i thought when comparing the number of weights needed for each kind of filters, only the size of the filters matter, not the number of channels in the input

    • @animatedai
      @animatedai  Před 10 měsíci +1

      I've been trying to avoid equations in the videos, but the formula for the total number of weights needed is (filter width * filter height * filter count * input feature count). You can see this represented visually in my filter count video.
      Assuming that the filter count is the same as the input feature count, it's more efficient to break large (5x5, 7x7, ...) filters into multiple 3x3 filters. A concrete example where all inputs and outputs have F feature dimensions:
      One 7x7: (7 * 7 * F * F) = 49 * F^2
      Three 3x3s: (3 * 3 * F * F) + (3 * 3 * F * F) + (3 * 3 * F * F) = 27 * F^2
      But for the first layer, the filter count is usually much higher than the input feature count of 3. It's more efficient to perform this dramatic increase in feature count using one large filter than with multiple smaller one. A concrete example where the first layer has 16 filters:
      One 7x7: (7 * 7 * 3 * 16) = 2352
      Three 3x3s: (3 * 3 * 3 * 16) + (3 * 3 * 16 * 16) + (3 * 3 * 16 * 16) = 5040

    • @agmontpetit
      @agmontpetit Před 5 měsíci +1

      Thanks for taking the time to explain this!@@animatedai

  • @Antagon666
    @Antagon666 Před 9 měsíci

    Wait so why do we need larger filters in first layer ? To extract more features from only the 3 channels ?
    And what is better, more chained filters with lower channel count, or lesser amount of chained filters with more channels ?

    • @animatedai
      @animatedai  Před 9 měsíci +1

      The filters in the first layer don't need to be larger. There's just no performance benefit to splitting them into a chain of smaller filters. And the reason for that is that the number of features increases dramatically from the input (typically 3 channels for RGB) to something like 16 or 32. The performance benefit of splitting a large filter into smaller filters assumes the number of features stays the same from input to output.
      > And what is better, more chained filters with lower channel count, or lesser amount of chained filters with more channels?
      This really depends on the data, how long the chain is, and how many filters you have. It's an ongoing area of research where researchers have found great results in both cases.

  • @ati43888
    @ati43888 Před 2 měsíci

    Nİce

  • @Firestorm-tq7fy
    @Firestorm-tq7fy Před 2 měsíci

    I don’t see a reason for 1x1. All you achieve is loosing information, while also creating N-features, each scaled by a certain factor. This can also be achieved within a normal layer (the scaling i mean). There is rly no point.
    Obviously outside of Depthwise-Pointwise combo.
    Pls correct me if I’m missing smt.