Soroush Mehraban
Soroush Mehraban
  • 32
  • 120 355
Diffusion Models (DDPM & DDIM) - Easily explained!
In this video I review how diffusion models work for the task of image generation.
DDPM paper: arxiv.org/abs/2006.11239
DDIM paper: arxiv.org/abs/2010.02502
P.S.: I used Invincible for visualization of my explanations, as I find the series pretty cool. :D
Table of Content:
00:00 Intro
00:23 DDPM
11:26 DDIM
18:14 Outro
Icon made by Freepik from flaticon.com
zhlédnutí: 947

Video

GLIGEN (CVPR2023): Open-Set Grounded Text-to-Image Generation
zhlédnutí 225Před 3 měsíci
In this video, I review the GLIGEN paper from CVPR2023 that proposes a new type of diffusion model that can receive grounding condition (bounding box, depth map, etc.) in addition to text to generate images with more constraints. Project page: gligen.github.io/ Table of Content: 00:00 Intro 01:00 Grounding Tokenization 03:55 Architecture 07:21 Scheduled Sampling 09:21 Results Icon made by Freep...
The Entropy Enigma: Success and Failure of Entropy Minimization
zhlédnutí 655Před 3 měsíci
In this video I review the entropy enigma paper that proposes a new method to estimate the accuracy of a model without having any labels. Paper link: arxiv.org/abs/2405.05012 Table of Content: 00:00 Intro 00:27 Tent 01:15 Excluding samples 03:31 The two-phase clustering 06:00 Label flips 07:06 Weighted Flips 08:24 Results Icon made by Freepik from flaticon.com
Tent: Fully Test-time Adaptation by Entropy Minimization
zhlédnutí 206Před 3 měsíci
In this video I review the tent approach to increase the test-time accuracy by minimizing the entropy at test time. Paper link: arxiv.org/abs/2006.10726 Table of Content: 00:00 Intro 05:40 Entropy analysis 07:05 TENT Icon made by Freepik from flaticon.com
VPD (ICCV2023): Unleashing Text-to-Image Diffusion Models for Visual Perception
zhlédnutí 199Před 3 měsíci
In this video I review the VPD paper from ICCV2023 that proposes a method that uses the diffusion model as a backbone for a visual perception task. paper link: arxiv.org/abs/2303.02153 Table of Content: 00:00 Intro 02:49 VPD Icon made by Freepik from flaticon.com
TokenHMR (CVPR2024): Advancing Human Mesh Recovery witha Tokenized Pose Representation
zhlédnutí 337Před 3 měsíci
In this video, I review the TokenHMR paper that reconstructs 3D human mesh from an RGB image. Project page: tokenhmr.is.tue.mpg.de/ Table of Content: 00:00 Intro 00:26 SMPL 02:04 Current issue 03:08 Foreshortened legs 05:31 Analysis 11:45 TALS loss function 15:16 Tokenization 18:47 TokenHMR 22:16 Results Icon made by Freepik from flaticon.com
SHViT (CVPR2024): Single-Head Vision Transformer with Memory Efficient Macro Design
zhlédnutí 603Před 3 měsíci
In this video, we review the SHViT (Single-Head Vision Transformer) paper, which introduces a memory-efficient Vision Transformer with competitive performance. SHViT reduces computational redundancy with larger-stride patch embeddings and a single-head attention module. paper link: arxiv.org/abs/2401.16456 Table of Content: 00:00 Intro 00:54 FastViT 01:33 EfficientFormerV2 02:48 Macro Design An...
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation
zhlédnutí 457Před 4 měsíci
Introducing InstaFlow: A game-changer in text-to-image generation! This one-step diffusion model, leveraging Rectified Flow's 'reflow' technique, achieves SD-level image quality in milliseconds. With an FID of 23.3 on MS COCO 2017-5k and training taking just 199 GPU days, InstaFlow sets new standards in speed and quality. Paper link: arxiv.org/abs/2309.06380 You can also read: arxiv.org/abs/220...
FastV: An Image is Worth 1/2 Tokens After Layer 2
zhlédnutí 384Před 5 měsíci
Large Vision-Language Models (LVLMs) tackle question answering in images and videos, but they struggle to efficiently utilize attention, resulting in high computational costs during inference. The FastV paper introduces a method to prune up to 50% of image tokens while maintaining comparable performance to the original model. Paper link: arxiv.org/pdf/2403.06764.pdf Table of Content: 00:00 Intr...
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
zhlédnutí 1,4KPřed 5 měsíci
Large language models (LLMs) typically demand substantial GPU memory, rendering training impractical on a single consumer GPU, especially for a 7-billion-parameter model that necessitates 58GB of memory. In response, the GaLore paper introduces an innovative strategy that projects gradients into a low-rank space, enabling the model to fit within the constraints of a single GPU. Remarkably, this...
PoseGPT (ChatPose): Chatting about 3D Human Pose
zhlédnutí 681Před 7 měsíci
Human pose estimation models struggle to grasp contextual information in images or video frames. Meanwhile, text-to-pose generation models, having limited training data, cannot effectively generate accurate poses for novel prompts. PoseGPT, a novel multimodal language model, not only comprehends 3D human pose but also processes image and text data. This innovative model excels in speculative po...
MotionAGFormer (WACV2024): Enhancing 3D Human Pose Estimation with a Transformer-GCNFormer Network
zhlédnutí 900Před 8 měsíci
In this video, I review the MotionAGFormer paper for the task monocular 3D human pose estimation. Paper link: arxiv.org/abs/2310.16288 GitHub link: github.com/TaatiTeam/MotionAGFormer Table of content: 00:00 Intro 00:10 MetaFormer and GCFormer 01:25 MotionAGFormer 03:52 GCNFormer's Adjacency Matrix 06:34 MotionAGFormer Variants 06:56 Results and Comparison
HD-GCN (ICCV2023): Skeleton-Based Action Recognition
zhlédnutí 1,3KPřed 9 měsíci
In this video I review the paper "Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition" which is published in ICCV2023. Paper link: arxiv.org/abs/2208.10741 Table of content: 00:00 Intro 00:24 ST-GCN overview 01:28 MS-G3D overview 02:25 CTR-GCN overview 03:31 Hierarchically Decomposed Graph 20:55 A-HA module 27:12 Six-way Ensemble 29:41 Network Architectu...
ST-GCN: Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition
zhlédnutí 3,1KPřed 10 měsíci
ST-GCN is the first GCN-based method for the task of skeleton-based action recognition. In this video, I explain how it works. Paper link: arxiv.org/abs/1801.07455 Icon made by Freepik from flaticon.com
Graph Convolutional Networks (GCN): From CNN point of view
zhlédnutí 4,6KPřed 10 měsíci
Table of Content: 00:00 CNN Summary 00:58 Analogy of CNN with Graph 03:00 Self-loop connection 04:22 GCN paper link: arxiv.org/abs/1609.02907 Icon made by Freepik from flaticon.com
DINO: Self-Supervised Vision Transformers
zhlédnutí 2,6KPřed 11 měsíci
DINO: Self-Supervised Vision Transformers
MoCo (+ v2): Unsupervised learning in computer vision
zhlédnutí 2,6KPřed rokem
MoCo ( v2): Unsupervised learning in computer vision
ViTPose: 2D Human Pose Estimation
zhlédnutí 3,3KPřed rokem
ViTPose: 2D Human Pose Estimation
TrackFormer: Multi-Object Tracking with Transformers
zhlédnutí 4,5KPřed rokem
TrackFormer: Multi-Object Tracking with Transformers
MetaFormer is Actually What You Need for Vision
zhlédnutí 939Před rokem
MetaFormer is Actually What You Need for Vision
ConvNet beats Vision Transformers (ConvNeXt) Paper explained
zhlédnutí 1,6KPřed rokem
ConvNet beats Vision Transformers (ConvNeXt) Paper explained
Swin Transformer V2 - Paper explained
zhlédnutí 3,1KPřed rokem
Swin Transformer V2 - Paper explained
Masked Autoencoders (MAE) Paper Explained
zhlédnutí 3KPřed rokem
Masked Autoencoders (MAE) Paper Explained
Relative Position Bias (+ PyTorch Implementation)
zhlédnutí 3,6KPřed rokem
Relative Position Bias ( PyTorch Implementation)
Swin Transformer - Paper Explained
zhlédnutí 11KPřed rokem
Swin Transformer - Paper Explained
Vision Transformer (ViT) Paper Explained
zhlédnutí 2,6KPřed rokem
Vision Transformer (ViT) Paper Explained
Convolutional Block Attention Module (CBAM) Paper Explained
zhlédnutí 6KPřed rokem
Convolutional Block Attention Module (CBAM) Paper Explained
Squeeze-and-Excitation Networks (SENet) paper explained
zhlédnutí 4,8KPřed rokem
Squeeze-and-Excitation Networks (SENet) paper explained
Faster R-CNN: Faster than Fast R-CNN!
zhlédnutí 6KPřed rokem
Faster R-CNN: Faster than Fast R-CNN!
Receptive Fields: Why 3x3 conv layer is the best?
zhlédnutí 7KPřed rokem
Receptive Fields: Why 3x3 conv layer is the best?

Komentáře

  • @ravibhushandixit3500

    can i implement efficientNet with squeeze and excitation

  • @ravibhushandixit3500

    can i implement efficientnet + SE...???

  • @rojinapanta-q3i
    @rojinapanta-q3i Před dnem

    i still cannot understand how self.selative_postion_bias is changed during training. Could please elaborate it ?

  • @shklbor
    @shklbor Před 3 dny

    how do they detect poses from heatmaps for say 'k' people?

    • @shklbor
      @shklbor Před 3 dny

      nevermind it doesn't detect multiple poses

  • @anupammishra8273
    @anupammishra8273 Před 6 dny

    Great explanation

  • @pakalapatisanjay1068
    @pakalapatisanjay1068 Před 10 dny

    Thanks for the Explanation!!! Loved it

  • @pakalapatisanjay1068
    @pakalapatisanjay1068 Před 10 dny

    Excellent Explanation. Thanks for that!!

  • @MehdiBarzegar-x6l
    @MehdiBarzegar-x6l Před 12 dny

    Great Explanation!!! one of the best video I've ever seen in GCN thank you

  • @amirhosseinmohammadi4731

    It was very comprehensive, thanks a lot Soroush

  • @GopalSharma-sf1zz
    @GopalSharma-sf1zz Před 19 dny

    Nice short explanation!

  • @vidaadelimosabeb6689
    @vidaadelimosabeb6689 Před 22 dny

    Great video, keep it up!

  • @HassanHamidi-v8s
    @HassanHamidi-v8s Před 22 dny

    Wow! great video. thanks a lot.

  • @ziku8910
    @ziku8910 Před 24 dny

    Best explanation of GCNs! Thank you.

  • @sanurcucuyeva1958
    @sanurcucuyeva1958 Před 24 dny

    I really appreciate it, very good explanation. Thanks!

  • @armanhatami5706
    @armanhatami5706 Před 25 dny

    awesome soroush. nice and clear explain

  • @ai1998
    @ai1998 Před 25 dny

    you are great , please keep going

  • @user-qp9so1by1j
    @user-qp9so1by1j Před měsícem

    Such a wonderful and clear video! Thank you

  • @lucacazzola5327
    @lucacazzola5327 Před měsícem

    sick channel, you're a nice orator! Working on a TTA project right now in Uni 💪

    • @soroushmehraban
      @soroushmehraban Před měsícem

      @@lucacazzola5327 Thank you so much Luca! I’ve put more recent TTA papers on my agenda to create videos about. Stay tuned🙂

    • @lucacazzola5327
      @lucacazzola5327 Před měsícem

      @@soroushmehraban I'll definitelly check It out! Do you have by chance any paper suggestions which specifically targets improving over TPT? I'm running out of ideas (and time 😢)

    • @soroushmehraban
      @soroushmehraban Před měsícem

      @@lucacazzola5327 What is TPT?

    • @lucacazzola5327
      @lucacazzola5327 Před měsícem

      ​@@soroushmehraban paper - "Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models" Basically a TTA solution which uses CLIP as backbone

    • @soroushmehraban
      @soroushmehraban Před měsícem

      @@lucacazzola5327 Oh sorry I didn't know about that. Thanks for introducing it though.

  • @laspinetta2954
    @laspinetta2954 Před měsícem

    This part of the SwinTransformer paper is the least understood and took a long time, finally I understood it clearly thanks to this lecture. I would really appreciate it if you could find these points of many papers in the future and explain them easily! Super Thanks!

    • @soroushmehraban
      @soroushmehraban Před měsícem

      @@laspinetta2954 I have to understand them first lol. I like to focus on them and make cool videos but recently I got so busy unfortunately

    • @proterotype
      @proterotype Před měsícem

      Yeah I’m with this guy

  • @paniProvorova
    @paniProvorova Před měsícem

    thank you for great explanations!

  • @efeburako.9670
    @efeburako.9670 Před měsícem

    nice one thx

  • @Karthik-kt24
    @Karthik-kt24 Před měsícem

    very nicely explained thank you! likes are at 314 so didnt hit like it😁subbed instead

  • @angnguyenkhoa5093
    @angnguyenkhoa5093 Před měsícem

    The best video on this topic hands down

  • @tomactor50
    @tomactor50 Před měsícem

    Fudge, you copy other's work

  • @dslkgjsdlkfjd
    @dslkgjsdlkfjd Před měsícem

    2:43 C would be equal to the number of filters not the number of kernels. In the torch.nn.conv2d operation being performed we have 3 kernels for each input channel and then C number of filters. Each filter having 3 kernels not C number of kernels.

  • @noony31122009
    @noony31122009 Před 2 měsíci

    awesome

  • @marioparreno24
    @marioparreno24 Před 2 měsíci

    Thanks for the intuitions, faqs and clearly explained topics!

    • @soroushmehraban
      @soroushmehraban Před 2 měsíci

      Glad you liked it Mario🙂

    • @marioparreno24
      @marioparreno24 Před 2 měsíci

      ​​@@soroushmehraban Just one question. Why is centering only applied to the teacher and sharpening to both the student and the teacher? Could we not apply centering to both? Maybe if we add both operations to both sides we play a sum 0 game and we have the collapse problem again, I dont know 😅 Maybe we need then artificially create an unbalance

    • @soroushmehraban
      @soroushmehraban Před 2 měsíci

      @@marioparreno24 From my understanding, sharpening makes the model more confident that this sample belongs to a certain sudo-class (the output label of model that we don't have ground truth). And we want the student to be kept certain about it and we sharpen it. The less certain the student is, the less certain it is to differentiate samples from different images. But for images we do both to prevent the mode collapse. But this is just based on my intuition. Don't quote me on that lol.

  • @MadinideAlwis
    @MadinideAlwis Před 2 měsíci

    Very interesting! need more videos.

  • @jialiangxu1657
    @jialiangxu1657 Před 2 měsíci

    Hi, I'm still a bit confused so could you please tell me how do you solve the 3D pose judder. The 2D pose contains the judder problem, but I can not find it after lifting to 3D pose in the demo video of your code. Thank you.

    • @soroushmehraban
      @soroushmehraban Před 2 měsíci

      Hi Jialiang, Throughout training the model also sees 2D poses with jitters but as the ground truth output, it sees motion capture 3D and we have a velocity loss (we multiply by 20 to make it 20 times more important than MPJPE), that make the model estimation to have the same velocity as the ground truth and penalizes it if it has jitters. So the model in addition to lifting the input from 2D to 3D and inferring the underlying 3D structure, it also has to denoise the input.

  • @pranavgandhiprojects
    @pranavgandhiprojects Před 2 měsíci

    veryy veryyy well explained..... i also loved your video on fast rcnn:) amazing workk

  • @pranavgandhiprojects
    @pranavgandhiprojects Před 2 měsíci

    WOw so well explained....thankyou very much:)

  • @yakuzi07
    @yakuzi07 Před 2 měsíci

    Is there a way to use grad cam on a Siamese cnn network. I'm getting graph disconnect error whenever i try and i have read that it's because grad cam was originally designed to accept a single input instead of multiple inputs.

  • @VedantJoshi-mr2us
    @VedantJoshi-mr2us Před 2 měsíci

    By far one of the best + complete, SWIN transformer explanations on the entire Internet.

  • @hamidrezahemati8837
    @hamidrezahemati8837 Před 2 měsíci

    Great video. keep up the good work

  • @SaraTaro
    @SaraTaro Před 2 měsíci

    This made it so much clear!! Great job :)

  • @user-gl5ys8nr2u
    @user-gl5ys8nr2u Před 2 měsíci

    Excellent video! Would you recommend any resources that explains the theorems they propose for low-rank gradients and their convergence in-depth? Also, what tools do you use to create such cool animations?

  • @victormanuel8767
    @victormanuel8767 Před 2 měsíci

    I may not be fully caught up but this gives some context around why cross entropy loss is minimized as a criterion during training. Thanks for this overview.

  • @mjavadrajabi7401
    @mjavadrajabi7401 Před 3 měsíci

    Prefect !!

  • @rohollahhosseyni8564
    @rohollahhosseyni8564 Před 3 měsíci

    Great video Soroush. Thanks.

  • @NarkeEmpire
    @NarkeEmpire Před 3 měsíci

    You are a great teacher 🙏

  • @user-zb9ub5nd1z
    @user-zb9ub5nd1z Před 3 měsíci

    Hello Soroush, how can I contact you please? I am working on my thesis and wanted to need your intake on something. Thanks

    • @soroushmehraban
      @soroushmehraban Před 3 měsíci

      Hello, Just search my name on google and you find me on Twitter or Linkedin. My email is also shared here on CZcams

  • @alinaderiparizi7193
    @alinaderiparizi7193 Před 3 měsíci

    <3

  • @alinaderiparizi7193
    @alinaderiparizi7193 Před 3 měsíci

    Liked (❤)

  • @alinaderiparizi7193
    @alinaderiparizi7193 Před 3 měsíci

    Perfect, Thank you.

  • @ericsy78
    @ericsy78 Před 3 měsíci

    Fantastic👌

  • @ericsy78
    @ericsy78 Před 3 měsíci

    You're amazing, create more!

  • @senpanwu5163
    @senpanwu5163 Před 3 měsíci

    Great Work! You explained 1000 times better than my uni lecturer :D

  • @subramanyabhat446
    @subramanyabhat446 Před 3 měsíci

    The loss functions were definitely a bit tricky to get around. But that was a really cool video tho! One thing you could've also touched upon is the usage of deformable detr in place of detr. I can see the trackformer code does incorporate it but wanted to know what changes in trackformer when you switch from detr to deformable one?

  • @hasanghavidel2701
    @hasanghavidel2701 Před 3 měsíci

    you explain complicated stuff very clearly.. thx

  • @user-ui5dg3nr3r
    @user-ui5dg3nr3r Před 3 měsíci

    usefull