Sergey Feldman: You Should Probably Be Doing Nested Cross-Validation | PyData Miami 2019

Sdílet
Vložit
  • čas přidán 25. 07. 2024
  • It is common to perform model selection while also attempting to estimate accuracy on a held-out set. The traditional solution is to split a data set into training, validation, and test subsets. On small datasets, however, this strategy suffers from high variance. A common approach to reusing a small number of samples for model selection is cross-validation, which typically is applied across an entire dataset. Then the best model is evaluated on the test set. This approach has a fundamental flaw: if the test is small, the performance estimate is high variance. The solution is double (or nested) cross-validation, which will be explained in this talk.
    www.pydata.org
    PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
    PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases. 00:00 Welcome!
    00:10 Help us add time stamps or captions to this video! See the description for details.
    Want to help add timestamps to our CZcams videos to help with discoverability? Find out more here: github.com/numfocus/CZcamsVi...
  • Věda a technologie

Komentáře • 6

  • @iancherabier5920
    @iancherabier5920 Před 5 měsíci

    Thanks a lot, an extremely clear explanation of nested CV!

  • @bryanparis7779
    @bryanparis7779 Před rokem

    THANK YOU so helpful! so interesting so so so :)

  • @BulkySplash169
    @BulkySplash169 Před 2 lety

    Nice, thx!

  • @QIQIWU-fd1xz
    @QIQIWU-fd1xz Před rokem +1

    This is really helpful! Thanks for sharing. One question, at 10:55, when running the 5-fold CV, shouldn't we use X_train_val instead of X_train? Because the splitting is done by sklearn, thus we don't need to hold out a validation set.

  • @nespereira
    @nespereira Před 9 dny

    Very useful! One question: in many medical datasets, especially in single-group research settings, the sample sizes are more around 100 or less (being in the thousands is rare). With this number of subjects, one worry is that putting away subjects for testing removes samples in a context where there really is not much data to begin with. Then you need to think about how many features you can afford etc...
    Don't get me wrong, I'm all in for nested cross-validation, but I am curious to hear your thoughts on this type of scenario, where getting data is really expensive.