32
120 355

GLIGEN (CVPR2023): Open-Set Grounded Text-to-Image Generation

10:46

The Entropy Enigma: Success and Failure of Entropy Minimization

9:09

Tent: Fully Test-time Adaptation by Entropy Minimization

13:16

VPD (ICCV2023): Unleashing Text-to-Image Diffusion Models for Visual Perception

9:44

TokenHMR (CVPR2024): Advancing Human Mesh Recovery witha Tokenized Pose Representation

30:13

SHViT (CVPR2024): Single-Head Vision Transformer with Memory Efficient Macro Design

22:26

Diffusion Models (DDPM & DDIM) - Easily explained!

In this video I review how diffusion models work for the task of image generation.
DDPM paper: arxiv.org/abs/2006.11239
DDIM paper: arxiv.org/abs/2010.02502
P.S.: I used Invincible for visualization of my explanations, as I find the series pretty cool. :D
Table of Content:
00:00 Intro
00:23 DDPM
11:26 DDIM
18:14 Outro
Icon made by Freepik from flaticon.com

zhlédnutí: 947

Video

GLIGEN (CVPR2023): Open-Set Grounded Text-to-Image Generation

10:46

GLIGEN (CVPR2023): Open-Set Grounded Text-to-Image Generation

zhlédnutí 225Před 3 měsíci

In this video, I review the GLIGEN paper from CVPR2023 that proposes a new type of diffusion model that can receive grounding condition (bounding box, depth map, etc.) in addition to text to generate images with more constraints. Project page: gligen.github.io/ Table of Content: 00:00 Intro 01:00 Grounding Tokenization 03:55 Architecture 07:21 Scheduled Sampling 09:21 Results Icon made by Freep...

The Entropy Enigma: Success and Failure of Entropy Minimization

9:09

The Entropy Enigma: Success and Failure of Entropy Minimization

zhlédnutí 655Před 3 měsíci

In this video I review the entropy enigma paper that proposes a new method to estimate the accuracy of a model without having any labels. Paper link: arxiv.org/abs/2405.05012 Table of Content: 00:00 Intro 00:27 Tent 01:15 Excluding samples 03:31 The two-phase clustering 06:00 Label flips 07:06 Weighted Flips 08:24 Results Icon made by Freepik from flaticon.com

Tent: Fully Test-time Adaptation by Entropy Minimization

13:16

Tent: Fully Test-time Adaptation by Entropy Minimization

zhlédnutí 206Před 3 měsíci

In this video I review the tent approach to increase the test-time accuracy by minimizing the entropy at test time. Paper link: arxiv.org/abs/2006.10726 Table of Content: 00:00 Intro 05:40 Entropy analysis 07:05 TENT Icon made by Freepik from flaticon.com

VPD (ICCV2023): Unleashing Text-to-Image Diffusion Models for Visual Perception

9:44

VPD (ICCV2023): Unleashing Text-to-Image Diffusion Models for Visual Perception

zhlédnutí 199Před 3 měsíci

In this video I review the VPD paper from ICCV2023 that proposes a method that uses the diffusion model as a backbone for a visual perception task. paper link: arxiv.org/abs/2303.02153 Table of Content: 00:00 Intro 02:49 VPD Icon made by Freepik from flaticon.com

TokenHMR (CVPR2024): Advancing Human Mesh Recovery witha Tokenized Pose Representation

30:13

TokenHMR (CVPR2024): Advancing Human Mesh Recovery witha Tokenized Pose Representation

zhlédnutí 337Před 3 měsíci

In this video, I review the TokenHMR paper that reconstructs 3D human mesh from an RGB image. Project page: tokenhmr.is.tue.mpg.de/ Table of Content: 00:00 Intro 00:26 SMPL 02:04 Current issue 03:08 Foreshortened legs 05:31 Analysis 11:45 TALS loss function 15:16 Tokenization 18:47 TokenHMR 22:16 Results Icon made by Freepik from flaticon.com

SHViT (CVPR2024): Single-Head Vision Transformer with Memory Efficient Macro Design

22:26

SHViT (CVPR2024): Single-Head Vision Transformer with Memory Efficient Macro Design

zhlédnutí 603Před 3 měsíci

In this video, we review the SHViT (Single-Head Vision Transformer) paper, which introduces a memory-efficient Vision Transformer with competitive performance. SHViT reduces computational redundancy with larger-stride patch embeddings and a single-head attention module. paper link: arxiv.org/abs/2401.16456 Table of Content: 00:00 Intro 00:54 FastViT 01:33 EfficientFormerV2 02:48 Macro Design An...

InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation

22:17

InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation

zhlédnutí 457Před 4 měsíci

Introducing InstaFlow: A game-changer in text-to-image generation! This one-step diffusion model, leveraging Rectified Flow's 'reflow' technique, achieves SD-level image quality in milliseconds. With an FID of 23.3 on MS COCO 2017-5k and training taking just 199 GPU days, InstaFlow sets new standards in speed and quality. Paper link: arxiv.org/abs/2309.06380 You can also read: arxiv.org/abs/220...

FastV: An Image is Worth 1/2 Tokens After Layer 2

14:10

FastV: An Image is Worth 1/2 Tokens After Layer 2

zhlédnutí 384Před 5 měsíci

Large Vision-Language Models (LVLMs) tackle question answering in images and videos, but they struggle to efficiently utilize attention, resulting in high computational costs during inference. The FastV paper introduces a method to prune up to 50% of image tokens while maintaining comparable performance to the original model. Paper link: arxiv.org/pdf/2403.06764.pdf Table of Content: 00:00 Intr...

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

28:39

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

zhlédnutí 1,4KPřed 5 měsíci

Large language models (LLMs) typically demand substantial GPU memory, rendering training impractical on a single consumer GPU, especially for a 7-billion-parameter model that necessitates 58GB of memory. In response, the GaLore paper introduces an innovative strategy that projects gradients into a low-rank space, enabling the model to fit within the constraints of a single GPU. Remarkably, this...

PoseGPT (ChatPose): Chatting about 3D Human Pose

32:22

PoseGPT (ChatPose): Chatting about 3D Human Pose

zhlédnutí 681Před 7 měsíci

Human pose estimation models struggle to grasp contextual information in images or video frames. Meanwhile, text-to-pose generation models, having limited training data, cannot effectively generate accurate poses for novel prompts. PoseGPT, a novel multimodal language model, not only comprehends 3D human pose but also processes image and text data. This innovative model excels in speculative po...

MotionAGFormer (WACV2024): Enhancing 3D Human Pose Estimation with a Transformer-GCNFormer Network

9:13

MotionAGFormer (WACV2024): Enhancing 3D Human Pose Estimation with a Transformer-GCNFormer Network

zhlédnutí 900Před 8 měsíci

In this video, I review the MotionAGFormer paper for the task monocular 3D human pose estimation. Paper link: arxiv.org/abs/2310.16288 GitHub link: github.com/TaatiTeam/MotionAGFormer Table of content: 00:00 Intro 00:10 MetaFormer and GCFormer 01:25 MotionAGFormer 03:52 GCNFormer's Adjacency Matrix 06:34 MotionAGFormer Variants 06:56 Results and Comparison

HD-GCN (ICCV2023): Skeleton-Based Action Recognition

35:08

HD-GCN (ICCV2023): Skeleton-Based Action Recognition

zhlédnutí 1,3KPřed 9 měsíci

In this video I review the paper "Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition" which is published in ICCV2023. Paper link: arxiv.org/abs/2208.10741 Table of content: 00:00 Intro 00:24 ST-GCN overview 01:28 MS-G3D overview 02:25 CTR-GCN overview 03:31 Hierarchically Decomposed Graph 20:55 A-HA module 27:12 Six-way Ensemble 29:41 Network Architectu...

ST-GCN: Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

8:25

ST-GCN: Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

zhlédnutí 3,1KPřed 10 měsíci

ST-GCN is the first GCN-based method for the task of skeleton-based action recognition. In this video, I explain how it works. Paper link: arxiv.org/abs/1801.07455 Icon made by Freepik from flaticon.com

Graph Convolutional Networks (GCN): From CNN point of view

13:08

Graph Convolutional Networks (GCN): From CNN point of view

zhlédnutí 4,6KPřed 10 měsíci

Table of Content: 00:00 CNN Summary 00:58 Analogy of CNN with Graph 03:00 Self-loop connection 04:22 GCN paper link: arxiv.org/abs/1609.02907 Icon made by Freepik from flaticon.com

DINO: Self-Supervised Vision Transformers

21:12

DINO: Self-Supervised Vision Transformers

zhlédnutí 2,6KPřed 11 měsíci

DINO: Self-Supervised Vision Transformers

MoCo (+ v2): Unsupervised learning in computer vision

31:03

MoCo (+ v2): Unsupervised learning in computer vision

zhlédnutí 2,6KPřed rokem

MoCo ( v2): Unsupervised learning in computer vision

22:30

ViTPose: 2D Human Pose Estimation

zhlédnutí 3,3KPřed rokem

ViTPose: 2D Human Pose Estimation

TrackFormer: Multi-Object Tracking with Transformers

28:40

TrackFormer: Multi-Object Tracking with Transformers

zhlédnutí 4,5KPřed rokem

TrackFormer: Multi-Object Tracking with Transformers

MetaFormer is Actually What You Need for Vision

10:59

MetaFormer is Actually What You Need for Vision

zhlédnutí 939Před rokem

MetaFormer is Actually What You Need for Vision

ConvNet beats Vision Transformers (ConvNeXt) Paper explained

21:00

ConvNet beats Vision Transformers (ConvNeXt) Paper explained

zhlédnutí 1,6KPřed rokem

ConvNet beats Vision Transformers (ConvNeXt) Paper explained

21:32

Swin Transformer V2 - Paper explained

zhlédnutí 3,1KPřed rokem

Swin Transformer V2 - Paper explained

Masked Autoencoders (MAE) Paper Explained

15:20

Masked Autoencoders (MAE) Paper Explained

zhlédnutí 3KPřed rokem

Masked Autoencoders (MAE) Paper Explained

Relative Position Bias (+ PyTorch Implementation)

23:13

Relative Position Bias (+ PyTorch Implementation)

zhlédnutí 3,6KPřed rokem

Relative Position Bias ( PyTorch Implementation)

19:59

Swin Transformer - Paper Explained

zhlédnutí 11KPřed rokem

Swin Transformer - Paper Explained

Vision Transformer (ViT) Paper Explained

6:41

Vision Transformer (ViT) Paper Explained

zhlédnutí 2,6KPřed rokem

Vision Transformer (ViT) Paper Explained

Convolutional Block Attention Module (CBAM) Paper Explained

7:05

Convolutional Block Attention Module (CBAM) Paper Explained

zhlédnutí 6KPřed rokem

Convolutional Block Attention Module (CBAM) Paper Explained

Squeeze-and-Excitation Networks (SENet) paper explained

9:11

Squeeze-and-Excitation Networks (SENet) paper explained

zhlédnutí 4,8KPřed rokem

Squeeze-and-Excitation Networks (SENet) paper explained

12:18

Faster R-CNN: Faster than Fast R-CNN!

zhlédnutí 6KPřed rokem

Faster R-CNN: Faster than Fast R-CNN!

Receptive Fields: Why 3x3 conv layer is the best?

8:11

Receptive Fields: Why 3x3 conv layer is the best?

zhlédnutí 7KPřed rokem

Receptive Fields: Why 3x3 conv layer is the best?

Komentáře

@ravibhushandixit3500 Před dnem
can i implement efficientNet with squeeze and excitation
@ravibhushandixit3500 Před dnem
can i implement efficientnet + SE...???
@rojinapanta-q3i Před dnem
i still cannot understand how self.selative_postion_bias is changed during training. Could please elaborate it ?
@shklbor Před 3 dny
how do they detect poses from heatmaps for say 'k' people?
@shklbor Před 3 dny
nevermind it doesn't detect multiple poses
@anupammishra8273 Před 6 dny
Great explanation
@pakalapatisanjay1068 Před 10 dny
Thanks for the Explanation!!! Loved it
@pakalapatisanjay1068 Před 10 dny
Excellent Explanation. Thanks for that!!
@MehdiBarzegar-x6l Před 12 dny
Great Explanation!!! one of the best video I've ever seen in GCN thank you
@amirhosseinmohammadi4731 Před 15 dny
It was very comprehensive, thanks a lot Soroush
@GopalSharma-sf1zz Před 19 dny
Nice short explanation!
@soroushmehraban Před 19 dny
Thanks!
@vidaadelimosabeb6689 Před 22 dny
Great video, keep it up!
@soroushmehraban Před 22 dny
@@vidaadelimosabeb6689 Thanks😃
@HassanHamidi-v8s Před 22 dny
Wow! great video. thanks a lot.
@soroushmehraban Před 22 dny
@@HassanHamidi-v8s Thanks for watching it 🙂
@ziku8910 Před 24 dny
Best explanation of GCNs! Thank you.
@sanurcucuyeva1958 Před 24 dny
I really appreciate it, very good explanation. Thanks!
@armanhatami5706 Před 25 dny
awesome soroush. nice and clear explain
@soroushmehraban Před 25 dny
@@armanhatami5706 Thanks 😃
@ai1998 Před 25 dny
you are great , please keep going
@soroushmehraban Před 25 dny
Thanks Ahmed! Appreciate it.
@user-qp9so1by1j Před měsícem
Such a wonderful and clear video! Thank you
@soroushmehraban Před měsícem
@@user-qp9so1by1j thanks for the feedback 🙂
@lucacazzola5327 Před měsícem
sick channel, you're a nice orator! Working on a TTA project right now in Uni 💪
@soroushmehraban Před měsícem
@@lucacazzola5327 Thank you so much Luca! I’ve put more recent TTA papers on my agenda to create videos about. Stay tuned🙂
@lucacazzola5327 Před měsícem
@@soroushmehraban I'll definitelly check It out! Do you have by chance any paper suggestions which specifically targets improving over TPT? I'm running out of ideas (and time 😢)
@soroushmehraban Před měsícem
@@lucacazzola5327 What is TPT?
@lucacazzola5327 Před měsícem
@@soroushmehraban paper - "Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models" Basically a TTA solution which uses CLIP as backbone
@soroushmehraban Před měsícem
@@lucacazzola5327 Oh sorry I didn't know about that. Thanks for introducing it though.
@laspinetta2954 Před měsícem
This part of the SwinTransformer paper is the least understood and took a long time, finally I understood it clearly thanks to this lecture. I would really appreciate it if you could find these points of many papers in the future and explain them easily! Super Thanks!
@soroushmehraban Před měsícem
@@laspinetta2954 I have to understand them first lol. I like to focus on them and make cool videos but recently I got so busy unfortunately
@proterotype Před měsícem
Yeah I’m with this guy
@paniProvorova Před měsícem
thank you for great explanations!
@efeburako.9670 Před měsícem
nice one thx
@Karthik-kt24 Před měsícem
very nicely explained thank you! likes are at 314 so didnt hit like it😁subbed instead
@angnguyenkhoa5093 Před měsícem
The best video on this topic hands down
@tomactor50 Před měsícem
Fudge, you copy other's work
@dslkgjsdlkfjd Před měsícem
2:43 C would be equal to the number of filters not the number of kernels. In the torch.nn.conv2d operation being performed we have 3 kernels for each input channel and then C number of filters. Each filter having 3 kernels not C number of kernels.
@noony31122009 Před 2 měsíci
awesome
@marioparreno24 Před 2 měsíci
Thanks for the intuitions, faqs and clearly explained topics!
@soroushmehraban Před 2 měsíci
Glad you liked it Mario🙂
@marioparreno24 Před 2 měsíci
@@soroushmehraban Just one question. Why is centering only applied to the teacher and sharpening to both the student and the teacher? Could we not apply centering to both? Maybe if we add both operations to both sides we play a sum 0 game and we have the collapse problem again, I dont know 😅 Maybe we need then artificially create an unbalance
@soroushmehraban Před 2 měsíci
@@marioparreno24 From my understanding, sharpening makes the model more confident that this sample belongs to a certain sudo-class (the output label of model that we don't have ground truth). And we want the student to be kept certain about it and we sharpen it. The less certain the student is, the less certain it is to differentiate samples from different images. But for images we do both to prevent the mode collapse. But this is just based on my intuition. Don't quote me on that lol.
@MadinideAlwis Před 2 měsíci
Very interesting! need more videos.
@jialiangxu1657 Před 2 měsíci
Hi, I'm still a bit confused so could you please tell me how do you solve the 3D pose judder. The 2D pose contains the judder problem, but I can not find it after lifting to 3D pose in the demo video of your code. Thank you.
@soroushmehraban Před 2 měsíci
Hi Jialiang, Throughout training the model also sees 2D poses with jitters but as the ground truth output, it sees motion capture 3D and we have a velocity loss (we multiply by 20 to make it 20 times more important than MPJPE), that make the model estimation to have the same velocity as the ground truth and penalizes it if it has jitters. So the model in addition to lifting the input from 2D to 3D and inferring the underlying 3D structure, it also has to denoise the input.
@pranavgandhiprojects Před 2 měsíci
veryy veryyy well explained..... i also loved your video on fast rcnn:) amazing workk
@pranavgandhiprojects Před 2 měsíci
WOw so well explained....thankyou very much:)
@yakuzi07 Před 2 měsíci
Is there a way to use grad cam on a Siamese cnn network. I'm getting graph disconnect error whenever i try and i have read that it's because grad cam was originally designed to accept a single input instead of multiple inputs.
@VedantJoshi-mr2us Před 2 měsíci
By far one of the best + complete, SWIN transformer explanations on the entire Internet.
@soroushmehraban Před 2 měsíci
Thanks!
@FinalProject-rw1yf Před 2 měsíci
@@soroushmehraban Hi sir, could you also explain the FasterViT and GCViT paper...
@hamidrezahemati8837 Před 2 měsíci
Great video. keep up the good work
@SaraTaro Před 2 měsíci
This made it so much clear!! Great job :)
@user-gl5ys8nr2u Před 2 měsíci
Excellent video! Would you recommend any resources that explains the theorems they propose for low-rank gradients and their convergence in-depth? Also, what tools do you use to create such cool animations?
@victormanuel8767 Před 2 měsíci
I may not be fully caught up but this gives some context around why cross entropy loss is minimized as a criterion during training. Thanks for this overview.
@mjavadrajabi7401 Před 3 měsíci
Prefect !!
@soroushmehraban Před 3 měsíci
Thanks for watching! 😃
@rohollahhosseyni8564 Před 3 měsíci
Great video Soroush. Thanks.
@soroushmehraban Před 3 měsíci
Thanks for the feedback 😃
@NarkeEmpire Před 3 měsíci
You are a great teacher 🙏
@soroushmehraban Před 3 měsíci
Thanks😃
@user-zb9ub5nd1z Před 3 měsíci
Hello Soroush, how can I contact you please? I am working on my thesis and wanted to need your intake on something. Thanks
@soroushmehraban Před 3 měsíci
Hello, Just search my name on google and you find me on Twitter or Linkedin. My email is also shared here on CZcams
@alinaderiparizi7193 Před 3 měsíci
<3
@alinaderiparizi7193 Před 3 měsíci
Liked (❤)
@alinaderiparizi7193 Před 3 měsíci
Perfect, Thank you.
@soroushmehraban Před 3 měsíci
😃❤️
@ericsy78 Před 3 měsíci
Fantastic👌
@soroushmehraban Před 3 měsíci
Thanks!
@ericsy78 Před 3 měsíci
You're amazing, create more!
@soroushmehraban Před 3 měsíci
Thanks for the kind words 🙂
@senpanwu5163 Před 3 měsíci
Great Work! You explained 1000 times better than my uni lecturer :D
@subramanyabhat446 Před 3 měsíci
The loss functions were definitely a bit tricky to get around. But that was a really cool video tho! One thing you could've also touched upon is the usage of deformable detr in place of detr. I can see the trackformer code does incorporate it but wanted to know what changes in trackformer when you switch from detr to deformable one?
@hasanghavidel2701 Před 3 měsíci
you explain complicated stuff very clearly.. thx
@user-ui5dg3nr3r Před 3 měsíci
usefull

Soroush Mehraban

Komentáře