Brian Kent: Density Based Clustering in Python

Sdílet
Vložit
  • čas přidán 5. 07. 2024
  • PyData NYC 2015
    Clustering data into similar groups is a fundamental task in data science. Probability density-based clustering has several advantages over popular parametric methods like K-Means, but practical usage of density-based methods has lagged for computational reasons. I will discuss recent algorithmic advances that are making density-based clustering practical for larger datasets.
    Clustering data into similar groups is a fundamental task in data science applications such as exploratory data analysis, market segmentation, and outlier detection. Density-based clustering methods are based on the intuition that clusters are regions where many data points lie near each other, surrounded by regions without much data.
    Density-based methods typically have several important advantages over popular model-based methods like K-Means: they do not require users to know the number of clusters in advance, they recover clusters with more flexible shapes, and they automatically detect outliers. On the other hand, density-based clustering tends to be more computationally expensive than parametric methods, so density-based methods have not seen the same level of adoption by data scientists.
    Recent computational advances are changing this picture. I will talk about two density-based methods and how new Python implementations are making them more useful for larger datasets. DBSCAN is by far the most popular density-based clustering method. A new implementation in Dato's GraphLab Create machine learning package dramatically speeds up DBSCAN computation by taking advantage of GraphLab Create's multi-threaded architecture and using an algorithm based on the connected components of a similarity graph.
    The density Level Set Tree is a method first proposed theoretically by Chaudhuri and Dasgupta in 2010 as a way to represent a probability density function hierarchically, enabling users to use all density levels simultaneous, rather than choosing a specific level as with DBSCAN. The Python package DeBaCl implements a modification of this method and a tool for interactively visualizing the cluster hierarchy.
    Slides available here: speakerdeck.com/papayawarrior...
    Notebooks: nbviewer.ipython.org/github/pa...
    nbviewer.ipython.org/github/pa... 00:00 Welcome!
    00:10 Help us add time stamps or captions to this video! See the description for details.
    Want to help add timestamps to our CZcams videos to help with discoverability? Find out more here: github.com/numfocus/CZcamsVi...
  • Věda a technologie

Komentáře • 6

  • @aristoi
    @aristoi Před 7 lety

    Great and very clear explanation. I'll be checking out DeBaCl

  • @floyddsouza8855
    @floyddsouza8855 Před 2 lety +1

    is the level set trees similar to HDBSCAN?

  • @shobhitverma9467
    @shobhitverma9467 Před 2 lety

    Wow!

  • @meghanashankar6628
    @meghanashankar6628 Před 8 lety

    awesome explanation...great work

  • @Grepoan
    @Grepoan Před 7 lety

    Were the clusters in the hurricane data/figure correlated with time or season or temperature or CO2 level? :)

  • @shruthihariharapura
    @shruthihariharapura Před 8 lety

    Very informative Lecture