BYOL: Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (Paper Explained)

Sdílet
Vložit
  • čas přidán 29. 08. 2024

Komentáře • 131

  • @tanmay8639
    @tanmay8639 Před 4 lety +91

    This guy is epic.. never stop making videos bro

    • @Denetony
      @Denetony Před 4 lety +4

      Don't say that "never stop" ! He has been posting every single day! He deserves a holiday or at least a day break haha

  • @1Kapachow1
    @1Kapachow1 Před 3 lety +12

    There is actual motivation behind the projection step. The idea is that in our self supervised task, we try to reach a representation which doesn't change with augmentations of the image. However, we don't know the actual task, and we might remove TOO MUCH information from the representation. Therefore, adding a final "projection head", and taking the representation before that projection increases the chance of having "a bit more" useful information for other tasks as well. It is mentioned/describe a bit in SimCLR.

    • @sushilkhadka8069
      @sushilkhadka8069 Před 5 měsíci

      so the vector representation before the projection layer would contain fine-grained information about the input and after projection vector would contain more semantic meaning?

  • @BradNeuberg
    @BradNeuberg Před 4 lety +16

    Just a small note how much I appreciate these in-depth paper reviews you've been doing! Great work, and its a solid contribution to the broader community.

  • @PhucLe-qs7nx
    @PhucLe-qs7nx Před 4 lety +23

    This can be seen from the Yann Lecun talk as a method in "Regularized Latent Variable Energy-based Model", instead of the "Contrastive Energy-based Model". In the Contrastive setting, you need positive and negative samples to know where to push up and down your energy function. This method doesn't need negative samples because it's implicitly constrained to have the same amount of energy as the "target network", only gradually shifted around by the positive samples.
    So I agree with Yannic, this must be in some kind of balance in the intialization for it to work.

    • @YannicKilcher
      @YannicKilcher  Před 4 lety +8

      Very good observation

    • @RaviAnnaswamy
      @RaviAnnaswamy Před 4 lety +3

      This is a deeper principle. It is much easier to learn from what is working than from what is NOT correct. but slower.

  • @ensabinha
    @ensabinha Před 5 měsíci +1

    16:20 using the projection at the end allows the encoder to learn and capture more general and robust features beneficial for various tasks, while the projection, closely tied to the specific contrastive learning task, focuses on more task-specific features.

  • @actualBIAS
    @actualBIAS Před rokem +1

    Wahnsinn. Dein Channel ist eine wahre Goldgrube. Danke für den Content!

  • @AdamRaudonis
    @AdamRaudonis Před 4 lety +4

    These are my favorite videos! Love your humor as well. Keep it up!

  • @dorbarber1
    @dorbarber1 Před 2 lety +1

    hi yannic. Thanks for your great surveys on the best DL papers. I really enjoy them. Regarding your comments on the projection necessity, it is better explain in the simCLR paper(by Hinton) that was publish about the same time as this paper. In short, they showed empirical results that the projection does help. the reason behind this idea is that you would like “z” representation to be agnostic to the augmentation but you don't need “y” representation to be agnostic to the augmentation. you would want the “y” representation to be separable to the augmentation so it will be easy to "get rid of” this information in the projection stage. the difference between y and z is not only the dimensionality. z is used for the loss calculations and y representation is the representation you will eventually use in the downstream task

  • @RohitKumarSingh25
    @RohitKumarSingh25 Před 4 lety +10

    I think the idea behind projection layer might be to avoid curse of dimensionality for the loss function. In low dimensional space distance between points are more differentiable. So that might help in l2 norm. For no collapse no idea. 😅

  • @sushilkhadka8069
    @sushilkhadka8069 Před 5 měsíci

    Okay for those of you wondering how is it that it's avoiding model collapse (learning a constant representation for all the inputs) when there's no constractive setting in the approach.
    It's because of the BatchNormalisation Layer and ema update of the target network.
    Both BN and EMA encourages variability in the model's representation.

  • @siyn007
    @siyn007 Před 4 lety +23

    The broader impact rant is hilarious

    • @veedrac
      @veedrac Před 4 lety +1

      What's wrong with it though? The paper has no clear broader ethical impacts beyond those generally applicable to the field, so it makes sense that they've not said anything concrete to their work.

    • @YannicKilcher
      @YannicKilcher  Před 4 lety +8

      Nothing is wrong. It's just that it's pointless, which was my entire point about broader impact statements.

    • @veedrac
      @veedrac Před 4 lety

      ​@@YannicKilcher If we only had papers like these, sure, but not every paper is neutral-neutral alignment.

    • @theodorosgalanos9663
      @theodorosgalanos9663 Před 4 lety +3

      The 'broader impact Turing test': if a paper requires a broader impact statement for users to see its impact outside of the context of the paper, then it doesn't really have one. The broader impact of your work should be immediately evident after the abstract and introduction, where the real problem you're trying to solve is described.
      Most papers in ML nowadays don't really do that, real problems are tough and they don't allow you to set a new SOTA on ImageNet.
      p.s. a practitioner in another domain I admit my bias towards real life applications in this rant; I'm not voting for practice over research I'm just voting against separating it.

    • @softerseltzer
      @softerseltzer Před 3 lety

      Probably it's a field required by a journal.

  • @cameron4814
    @cameron4814 Před 4 lety +3

    this video is better than the paper. i'm going to try to reimplement this based on only the video.

  • @Marcos10PT
    @Marcos10PT Před 4 lety +32

    Damn Yannick you are fast! Soon these videos will start being released before the papers 😂

  • @JTMoustache
    @JTMoustache Před 4 lety +2

    For the projection the parameters go from O(Y x N) > O(Y x Z + Z x N).. maybe that is the reason ?
    Of course that makes sense only if YN > ZY + ZN => N ( Y - Z ) > ZY => N > ZY / (Y-Z) and Z < Y

  • @NewTume123
    @NewTume123 Před 3 lety +1

    Regarding the projection head. The guys in the SimClr/SupCon paper showed that you get better results by using projection head (just need to remove it for future fine-tuning). That's why in the MoCoV2 they also started applying a projection head

  • @bensas42
    @bensas42 Před 2 lety +1

    This video was amazing. Thank you so much!!

  • @LidoList
    @LidoList Před 2 lety +1

    Super simple, all credits go to Speaker. Thanks bro ))))

  • @jwstolk
    @jwstolk Před 4 lety +10

    8 hours on 512 TPU's and still doesn't end up with a constant output. I'd say it works.

  • @franklinyao7597
    @franklinyao7597 Před 3 lety +3

    Yannic, at 18:01. You said there is no need for a projection. My answer is that maybe the intermidiate represent is better than the final representation because earlier layers learn universal representation while finall layers give you more specific represnetation for specific tasks.

  • @Schrammletts
    @Schrammletts Před 3 lety +3

    I strongly suspect that normalization is a key part of making this work. If you had a batchnorm on the last layer, then you couldn't map everything to 0, because the output must be mean 0 variance 1. So you have to find the mean 0 variance 1 representation that's invariant to the data augmentations, which starts to sound much more like a good representation

  • @alifarrokh9863
    @alifarrokh9863 Před rokem

    Hi Yannic, I really appreciate what you do. Thanks for the great videos.

  • @herp_derpingson
    @herp_derpingson Před 4 lety +14

    00:00 I wonder if there is a correlation between number of authors and GPU/TPU hours used.
    .
    I dont really see the novelty in this paper. They are just flexing their GPU budget on us.

    • @YannicKilcher
      @YannicKilcher  Před 4 lety +6

      No, I think it's more like how many people took 5 minutes to make a spurious comment on the design doc at the beginning of the project.

  • @adamrak7560
    @adamrak7560 Před 3 lety +2

    You are right in assuming that there are no strong guarantees to stop it from reaching the global minimum, which is degenerate.
    But the way they train it, with the averaged version creates a _massive_ mountain around the degenerate solution. Gradients will point away from the degenerate case unless you are already too close, because of the "target" model. The degenerate case is a _local_ attractor only. (and the good solutions are local attractors too, but more numerous)
    This means that unless the stepsize is too big, this network should robustly converge to a adequate solution instead of the degenerate one.
    In the original network, where you differentiate with with both paths, you will always get gradient which points somewhat towards the degenerate solution, because you use the same network in the two paths and sum the gradients: in a local linear approximation this will tend towards the degenerate case. The degenerate case is a _global_ attractor. (and the good solutions are only local attractors)

  • @darkmythos4457
    @darkmythos4457 Před 4 lety +18

    Using stable targets with an EMA model is not really new, used a lot in semi-supervised learning, like in the paper Mean Teachers. As for the projection, in SimCLR, they explain why it is important.

    • @DanielHesslow
      @DanielHesslow Před 4 lety +2

      So if I understand it correctly, the projection is useful because we train the representations to be invariant to data augmentations, if instead we use the layer before the projection as the representation we can also keep information that does vary over data augmentations, which may be useful in downstream tasks?
      In the SimCLR they also show that the output size of the projection is largely irrelevant. However for this paper I wonder if there is not a point to projecting it down to a lower dimension in that it would increase the likelihood of getting stuck in a local minima and not degenerating to constant representations? Although 256 dimensions is still fairly large so maybe that doesn't play a role.

    • @gopietz
      @gopietz Před 4 lety +1

      I still see his argument because to my understanding the q should be able to replace the projection step and do the same thing.

    • @DanielHesslow
      @DanielHesslow Před 4 lety +3

      @@gopietz The prediction network, q, is only applied to the online network, so if we were to yank out the projection network our embeddings would need to be invariant to data augmentations, so I do think there is a point to have it there.

  • @Phobos11
    @Phobos11 Před 4 lety +2

    This configuration of the two encoders sounds exactly like stochastic weight averaging, only that the online and sliding window parameters are both being used actively 🤔. From SWA, the second encoder should have a wider minima, helping it generalize better

  • @clara4338
    @clara4338 Před 4 lety +1

    Awesome explanations, thank you for the video!!

  • @jonatan01i
    @jonatan01i Před 4 lety +1

    The target network's representation of an original image is a planet (moving slowly) that pulls the augmented versions' representations to itself. The other original images' planet representations are far away and only if there are many common things in them do they have a strong attraction towards each other.

    • @dermitdembrot3091
      @dermitdembrot3091 Před 4 lety +1

      Agreed! These representations form a cloud where there is a tiny pull of gravity toward the center. This pull is however insignificant compared to the forces acting locally. It would probably take a huge lot more training for a collapse to materialize.

    • @dermitdembrot3091
      @dermitdembrot3091 Před 4 lety +1

      I suspect that collapse would first happen locally so that it would be interesting to test whether this method has problems encoding fine differences between similar images.

  • @daisukelab
    @daisukelab Před 3 lety +3

    Here we got one mystery partly confirmed, this nice article shows that BN in MLP has an effect like contrastive loss:
    untitled-ai.github.io/understanding-self-supervised-contrastive-learning.html
    And paper:
    arxiv.org/abs/2010.00578

  • @daniekpo
    @daniekpo Před 3 lety

    When I first saw the authors, I wondered what DeepMind was doing with computer vision (this is maybe the first vision paper that I've seen from them), but after reading the paper it totally made sense coz they use the same idea as DQN and some other reinforcement learning algorithms.

  • @Phobos11
    @Phobos11 Před 4 lety +2

    Self-supervised week 👏

  • @behzadbozorgtabar9413
    @behzadbozorgtabar9413 Před 4 lety +1

    It is very similar to MoCov2, the only difference is using deeper network (prediction network, which I guess the gain comes from) and MSE loss instead of InfoNSE loss.

  • @ayankashyap5379
    @ayankashyap5379 Před 4 lety

    Thank you for all your videos

  • @roman6575
    @roman6575 Před 4 lety +1

    Around the the 17:30 mark, I think the idea behind using the projection MLP 'g' was to make the encoder 'f' learn more semantic features by using higher level features for the contrastive loss.

    • @WaldoCampos1
      @WaldoCampos1 Před 2 lety

      The idea of using a projection comes from SimCLR's architecture. In their paper they proved that it improves the quality of the representation.

  • @ai_station_fa
    @ai_station_fa Před 2 lety

    Thanks again!

  • @thak456
    @thak456 Před 4 lety

    love your honesty....

  • @citiblocsMaster
    @citiblocsMaster Před 4 lety +15

    33:40 Yeah let me just spin up my 512 TPUv3 cluster real quick

  • @citiblocsMaster
    @citiblocsMaster Před 4 lety +4

    31:12 I disagree slightly. I think the power of the representations comes from the fact that they throw an insane amount of compute at the problem. Approaches such as Adversarial AutoAugment (arxiv.org/abs/1912.11188) or AutoAugment more broadly show that it's possible to learn such augmentations.

    • @YannicKilcher
      @YannicKilcher  Před 4 lety +2

      Yes in part, but I think a lot of papers show just how important the particulars of the augmentation procedure are

  • @asadek100
    @asadek100 Před 4 lety +4

    Hi Yannic, Can u make a video to explain Graph Neural Network.

  • @Ma2rten
    @Ma2rten Před 3 lety +2

    I'm not sure if I put a link in CZcams, but if you google "Understanding self-supervised and contrastive learning with Bootstrap Your Own Latent" someone figured out why it works and doesn't produce the degenerate solution.

  • @lamdang1032
    @lamdang1032 Před 4 lety +1

    My 2 cents on why collapse doesn’t happen. For ResNet collapse means one of the few projection layer must have weight all zeroed. This is very difficult to obtain since it is only 1 global minimum vs infinite number of local minima.also the moving average helps because in order to the weight to be completely zeroed the moving average must also mean be zeroed, which means weights are zeroed for a alot of iterations. Mode collapse may have happens partially for the experiment where they remove the moving average.

    • @sushilkhadka8069
      @sushilkhadka8069 Před 3 měsíci

      Model can collapse when it spits out the same vector representation for all the inputs .i.e constant vector not only vectors of zeros. In the video @yannic simply gives vectors of zero as an example.

    • @lamdang1032
      @lamdang1032 Před 3 měsíci

      @@sushilkhadka8069 please read my comment again, I talk about weight zero not activation. In order for activation to be a constant W must be 0 therefore the activation equal bias term

  • @tsunamidestructor
    @tsunamidestructor Před 4 lety +4

    Hi Yannic, just dropping in (as usual) to say that I love your content!
    I was just wondering - what h/w and/or s/w do you use to draw around the PDFs?

  • @korydonati8926
    @korydonati8926 Před 4 lety +7

    I think you pronounce this "B-Y-O-L". It's a play on "B-Y-O-B" for "Bring your own booze". Basically this means to bring your own drinks to a party or gathering.

  • @bzqp2
    @bzqp2 Před 3 lety +3

    29:36 All the performance gains are from the seed=1337

  • @florianhonicke5448
    @florianhonicke5448 Před 4 lety +1

    Great summary!

  • @d13gg0
    @d13gg0 Před 2 lety

    @Yannic I think the projections are important because otherwise the representations are too sparse to calculate a useful loss.

  • @MH-km9gd
    @MH-km9gd Před 4 lety +1

    So it's basically iterative mean teacher with a symmetrical loss?
    Bit the projection layer is a neat thing.
    Would be nice if they would have shown it working on a single GPU

  • @marat61
    @marat61 Před 4 lety

    If you compare representation y' which is obtained using exp mean of the past parameters your new parametrs will be as close to past as possible inspite of all possible augmentations. So model indeed is forced to stay at the same spot forced to move in dirrection of just ignoring augmentations

  • @jasdeepsinghgrover2470
    @jasdeepsinghgrover2470 Před 4 lety +1

    I guess it's because completely ignoring the input is kind of hard when there is a separate architecture which is tracking parameters. Forgetting the input would be a huge change in initial parameters so should increase the loss between the architectures thus could be stopped.

  • @anthonybell8512
    @anthonybell8512 Před 3 lety

    Couldn't you predict the distance between two patches to prevent the network from degenerating to the constant representation c? This seems like it would help the model even more for learning representations than a yes/no label because it would also encode a notion of distance - i.e. I expect patches at opposite corners of the image to be less similar than patches right next to each other.

  • @krystalp9856
    @krystalp9856 Před 3 lety +4

    That broader impact section seems like an undergrad wrote it lol. I would know cuz that's exactly how I write reports for projects in class.

  • @StefanausBC
    @StefanausBC Před 4 lety +1

    Hey Yannic, thank you for all this great videos. Always enjoying it. Yesterday, ImageGPT got published, I would be interested to get your feedback on that on too!

  • @AIology2022
    @AIology2022 Před 3 lety +1

    What is its difference with Siamese Neural Network. I did not see anything new.

  • @maxraphael1
    @maxraphael1 Před 4 lety +1

    The idea of weights moving averaging in teacher-student architecture was also used in this other paper arxiv.org/abs/1703.01780

  • @wagbagsag
    @wagbagsag Před 4 lety +1

    Why wouldn't the q(z) network (prediction of target embedding) just be be the identity? q(z) doesn't know which transformations were applied, so the only thing that differs consistently between the two embeddings is that one was produced by a target network whose weights are filtered by an exponential moving average. I would think that the expected difference between a signal and it's EMA filtered target should be zero since EMA is an unbiased filter...?

    • @YannicKilcher
      @YannicKilcher  Před 4 lety +2

      I don't think EMA is an unbiased filter. I mean, I'm happy to be convinced otherwise, but it's just my first intuition.

    • @wagbagsag
      @wagbagsag Před 4 lety

      ​@@YannicKilcher I guess the EMA-filtered params lag the current params, and in general that means that they should be higher loss than the current params. So the q network is basically learning which direction is uphill on the loss surface? I agree that it's not at all clear why this avoids mode collapse

    • @YannicKilcher
      @YannicKilcher  Před 4 lety

      @@wagbagsag yes that sounds right. And yes, it's a mystery to me too 😁

  • @snippletrap
    @snippletrap Před 4 lety +4

    Yannic it's probably pronounced bee-why-oh-ell, after BYOB, which stands for Bring Your Own Beer,

  • @user-jj2bx7kt4d
    @user-jj2bx7kt4d Před měsícem

    i dont think that,,,,, trivial zero problem,,, and nearby possible solution (local minima),,,,, is what this algo is looking for means even if thats a option there way to negate that,,,,suppose
    X^2-2x=0
    x(x-2)=0
    so there is a 0 as solution but 2 is also there and
    if when we slightly introduce e

  • @Gabend
    @Gabend Před 3 lety

    "Why is this exactly here? Probably simply because it works." This sums up the whole deep learning paradigm.

  • @bzqp2
    @bzqp2 Před 3 lety +1

    10:30 The local minimum idea doesn't make too much sense with so many dimensions. Without any preventive mechanism it would always collapse to a constant. It's really hard for me to believe it's just an initialization balance.

    • @user-rh8hi4ph4b
      @user-rh8hi4ph4b Před 2 lety +2

      I think the different "target parameters" are crucial in preventing collapse here. As long as the target parameters are different from the online parameters, it would be almost impossible for both online and target parameters to produce the same constant output, _unless_ the model is fully converged (as in the parameters no longer change, causing the online and target parameters to become the same over time). So one might argue that the global minimum of "it's just the same constant output regardless of input" doesn't exist, since that behavior could never yield a minimal loss because of the difference in online and target parameters. If that difference is large enough, such as with a very high momentum in the update of the target parameters, the loss of the trivial/collapsed behavior might even be worse than that of non-trivial behaviors, preventing collapse that way.

    • @sushilkhadka8069
      @sushilkhadka8069 Před 3 měsíci

      @@user-rh8hi4ph4b the same idea hit my mind when I asked myself why would it prevent model collapse. very good observation, thanks for sharing!

  • @qcpwned07
    @qcpwned07 Před 4 lety +1

    czcams.com/video/YPfUiOMYOEE/video.html
    I don't think the projection can be ignored here, ResNet has been trained for a long while, so it's weights are fairly stable, it would probably take a long time to notch them in the right direction, and you would loose some of its capacity to represent the input set. By adding the projection, you have a new naive network on which you can apply a greater learning rate without fear of alienising the structure of the pre-learned model. -- Basically it is serves somewhat the same function as the classification layer one would usually add for fine tuning

  • @mlworks
    @mlworks Před 3 lety

    It does sound little like GANs except instead of random noise and ground truth image, we have two variants of the same image learning from each other. Am I misinterpreting completely ?

  • @mohammadyahya78
    @mohammadyahya78 Před rokem

    Thank you very much. Can this be used with NLP instead of images?

  •  Před 2 lety

    Wait, at 18:26, shouldn't it be q_theta(f_theta(v)) instead of q_theta(f_theta(z))?
    Anyway, great video!

  • @grafzhl
    @grafzhl Před 4 lety +3

    16:41 4092 is my favorite power of 2 😏

  • @dshlai
    @dshlai Před 4 lety +1

    I just saw that leaderboard on the paperwithcode

  • @3dsakr
    @3dsakr Před 4 lety +3

    Very interesting!!, Thanks a lot for your videos, I just found your channel yesterday and I already watched like 5 videos :)
    I want your help in understanding only 1 thing that I feel I'm lost in,
    which is: learning what a single neuron do.
    for example: you have an input jpg image of 800x600, pass to Conv2D layer1, then Conv2D layer2
    the question is: how to get the final result from layer1 (as an image), and how to find out what each neuron is doing in that layer? (same for layer2)
    keep the good work up.

    • @YannicKilcher
      @YannicKilcher  Před 4 lety +2

      That's a complicated question, especially in a convolutional network. have a look at openai microscope

    • @3dsakr
      @3dsakr Před 4 lety

      @@YannicKilcher Thanks, after some search I found this, github.com/tensorflow/lucid
      and this
      microscope.openai.com/models/inceptionv1/mixed5a_3x3_0/56
      and this
      distill.pub/2020/circuits/early-vision/#conv2d1

    • @3dsakr
      @3dsakr Před 4 lety

      @@YannicKilcher and this, interpretablevision.github.io/

  • @eelcohoogendoorn8044
    @eelcohoogendoorn8044 Před 4 lety +1

    One question that occurs with me is wrt the batch size. The show it matters less than methods that mine negatives within the batch, obviously. But why does it matter at all? Just because of batchnorm related concerns? There are some good solutions to this available nowadays are there not? If I used groupnorm for instance, and my dataset was not too big / my patience sufficed (kinda the situation I am in), could I make this work on a single GPU with tiny batches? I dont see any reason why not. Just trying to get a sense of how important the bazzilion TPUs are in the big picture.

    • @eelcohoogendoorn8044
      @eelcohoogendoorn8044 Před 4 lety +1

      Ah right they explicitly claim its just the batch norm. Interesting, would not expect it to matter that much with triple-digit batch sizes.

    • @YannicKilcher
      @YannicKilcher  Před 4 lety

      It's also the fact that there are better negatives in larger batches

    • @eelcohoogendoorn8044
      @eelcohoogendoorn8044 Před 4 lety

      @@YannicKilcher Yeah it matters a lot if you are mining within the batch; been there, done that. Im actually surprised at how little it seems to matter for the negatives in their example. My situation is a very big mem-hungry model and hardware-constrained situation; im having to deal with batches of size 4 a lot of the time. Sadly they dont go that low in this paper, but if its true batchnorm is the only bottleneck, thats very promising to me.

    • @eelcohoogendoorn8044
      @eelcohoogendoorn8044 Před 4 lety

      ...im not sampling negatives from the batch if the batch size is 4, obviously. memory banks for decently but i also find them very tuning-sensitive; how tolerant is one supposed to be 'staleness' of entries in the memory bank, etc. Actually finding negatives of the right hardness consistently isnt easy. Its kinda interesting that the staleness of the memory bank entries can be a feature, if this paper is to be believed.

  • @youngjin8300
    @youngjin8300 Před 4 lety

    You rolling my friend rolling!

  • @teslaonly2136
    @teslaonly2136 Před 4 lety +1

    Nice!

  • @004307ec
    @004307ec Před 4 lety +1

    Off-policy design from RL?

  • @BPTScalpel
    @BPTScalpel Před 4 lety +5

    I respectfully disagree with the pseudo code release. With it, it took me half a day to implement the paper while it took me way more time to replicate SimCLR because the codebase the SimCLR authors published was disgusting to say the least.
    One thing that really grinds my gears though is that they (deliberately?) ignored the semi supervised literature.

    • @ulm287
      @ulm287 Před 4 lety +2

      Are you able to reproduce the results? I would imagine 512 TPUs must cost a lot of money or do you just let it run for days?
      My main concern like in the video is why it doesnt collapse to constant representation ... from a theoretical perspective, you are literally exploiting the problem of not being able to optimize a NN, which is weird. If they use their loss for example as validation, then that would mean that they are not CV over the init, as constant zero init would be 0 loss ...
      Did you find any hacks they used to avoid this? like "anti-"weight decay or so lol

    • @BPTScalpel
      @BPTScalpel Před 4 lety +5

      ​@@ulm287 I have access to a lot of compute for free thanks to the french gov. Right now, we replicated MoCo, MoCo v2 and SimCLR with our codebase and BYOL is currently running.
      I think the main reason it works is that, because of the EMA and the different views, the collapse is very unlikely: while it seems obvious to us that outputting constant features minimises the loss, this solution is very hard to find because of the EMA updates. To the network, it's just one of the solution and not even an easy one. The reason it learns anything at all is because there is already some very weak signal in a randomly initialised network, signal that they refine over time.
      Some pattern are so obvious that even a randomly initialised target network with a modern architecture will be able to output features that contain informations related to them (1.4% accuracy on ImageNet with random features). The online network will pick this signal up and will refine it to make it robust to the different views (18.4% accuracy on ImageNet with BYOL with a random target network).
      Now you can replace the target network with this trained online network and start again. The idea is that this new target network will be able to pick up signals from more patterns that the original random one. You iterate this process many times until convergence.
      That's basically what they are doing but instead of waiting for the online network to converge before updating the target network, they maintain an EMA of it.

    • @eelcohoogendoorn8044
      @eelcohoogendoorn8044 Před 4 lety +1

      ​@@BPTScalpel Yeah it sort of clicks but I have to think about this a little longer, before I convince myself I really understand it. Curious to hear if your replication code works; if it does, id love to see a github link. Not sure which is better; really clean pseudocode or trash real code. But why not both?
      Having messed around with quite a lot of negative mining schemes I know how fragile and prone to collapse that is for the type of problems ive worked on, and what I like about this method is not so much the 1 or 2 % this or that, but that it might (emphasis on might) do away with all that tuning.
      So yeah pretty cool stuff; interesting applications, interesting conceptually... perhaps only downside is that it may not follow a super efficient path to its end state, and require quite some training time. But I can live with that for my applications.

    • @YannicKilcher
      @YannicKilcher  Před 4 lety +1

      I'm not against releasing pseudocode. But why can't they just release the actual code?

    • @BPTScalpel
      @BPTScalpel Před 4 lety +1

      @@YannicKilcher My best guess is that the actual code contains a lot of boilerplate code to make it work on their amazing infra and they could not be bothered to clean it =P

  • @MariuszWoloszyn
    @MariuszWoloszyn Před 4 lety

    Actually it's the other way around. Transformers are inspired by transfer learning on image recognition models.

  • @daisukelab
    @daisukelab Před 4 lety

    "if you construct augmentations smartly" ;)

  • @tshev
    @tshev Před 3 lety

    It is interesting what numbers can you get with 3090 in one week.

  • @tuanphamanh7919
    @tuanphamanh7919 Před 4 lety

    hay, chúc bạn một ngày tốt lành

  • @Agrover112
    @Agrover112 Před 4 lety +1

    By the time I enter a Master's , DL will be so fucking different

  • @theodorosgalanos9663
    @theodorosgalanos9663 Před 4 lety +2

    Alternate name idea: BYOTPUs

  • @CristianGarcia
    @CristianGarcia Před 4 lety

    I asked the question about the mode colapse and a researcher friend pointed me to Mean Teacher (awesome name) where they also do exponential averaging, they might have some insights into why it works: arxiv.org/abs/1703.01780

  • @larrybird3729
    @larrybird3729 Před 4 lety +5

    Deepminds next paper:
    We solved general-intelligence
    === Here is the code ===
    def general_intelligence(information):
    return "intelligence"

  • @dippatel1739
    @dippatel1739 Před 4 lety +2

    As I said earlier.
    Label exists.
    Augmentation: I am about to end this man‘s career.

  • @RickeyBowers
    @RickeyBowers Před 4 lety

    Augmentations work to focus the network - it's too great an oversimplification to say the algorithm learns to ignore the augmentations, imho.

  • @sui-chan.wa.kyou.mo.chiisai

    self-supervised is hot

  • @amitsinghrawat8483
    @amitsinghrawat8483 Před 2 lety

    What the hell is this why I'm here?

  • @sphereron
    @sphereron Před 4 lety +1

    First