"Probabilistic scripts for automating common-sense tasks" by Alexander Lew

Sdílet
Vložit
  • čas přidán 29. 08. 2024

Komentáře • 74

  • @nuclearlion
    @nuclearlion Před 4 lety +42

    As a former trucker, could have used a trigger warning for that low bridge :) anything less than 13' 6" clearance is potentially traumatic. Loving this talk, thank you!

    • @grex2595
      @grex2595 Před rokem

      Don't look up can opener bridge.

  • @davidknipe1179
    @davidknipe1179 Před 4 lety +30

    10:20: "At HP in Cambridge they've made a neon sign of the formula." I used to work at that office. That sign was removed a few years ago. The rumour was that it was judged to be intimidating to customers. And in fact the sign predated the takeover of that company (Autonomy) by HP, so it's not true to say that HP made the sign. I believe they also removed one that said "S = k log W" (entropy) at the same time.

  • @rban123
    @rban123 Před 4 lety +6

    When you said “cleaning data is the most time-consuming part”....... I felt that

  • @GopinathSadasivam
    @GopinathSadasivam Před 5 lety +48

    Excellent, Concise and Clear presentation!
    Thanks!

  • @mateuszbaginski5075
    @mateuszbaginski5075 Před 2 lety +1

    That's awesome. Maybe one tiny but significant step closer to formalizing and operationalizing common sense.

  • @sau002
    @sau002 Před 4 lety +1

    I like the scenario based approach. Demonstrate the problem from an end user's perspective and then arrive at the solution.

  • @catcatcatcatcatcatcatcatcatca

    ”We can’t just train a neural network on a slide-deck and tell it to use common sense” captures quite well how far we have gotten in three years.
    More formal rules might still provide better data, and there is no way chatGPT could parse such dataset in one go (and so it can’t utilise the already included data properly).
    But still, we now can. And if properly implemented it is likely on par with human common sense, if not just better. It might not even need the slide-deck! That’s wild

  • @gregmattson2238
    @gregmattson2238 Před 3 lety

    wow, three talks, three times my mind has been blown. in particular, the idea of iterating over the data itself to get valid values is just plain insane. I may actually use alloy, pclean, stabilizer and coz.

  • @thestopper5165
    @thestopper5165 Před 4 lety +6

    This is a nice way of presenting stuff that good data-mungers have done since I was a grad student (last millennium).
    Before everyone gets all wet about it, you need to consider that the proposed solution *requires data that is not part of the original table* - specifically, population by city by state. May as well just assume that your dodgy data can be saved by some other data you've got lying around. Good old *Deus ex machina* .
    Also, not for nuthin'... *rental prices are also subject to typos* - so in the example at 16:00 the sign of the outcome is reversed if there is a typo in the rental price.
    Let's say it should have been $400 rather than $4000 - although there will be more available rentals in Beverly Hills CA, very few of them will be at $400/rm per mth (or more accurately, $400/mth will be further from the mean for BH, CA than it is for BH, MO.).
    TL;DR: CompSci guys need to learn actual statistics before they learn statistics buzzwords - and not just a 1-semester Intro Stats that people take as undergrad, but something with some genuine meat.
    Once you accept that *everything* is subject to typos, the number of tailored kludges needed to clean a set of data is arrived at by iteration.
    And that's leaving aside things like spatial data: if you know that a polygon representing a property boundary can't have self-intersections, what is the correct mechanism when ST_IsValid() fails? (Answer: it depends).
    If parcel boundaries and property boundaries can overlap, and sets of properties and parcels represent a specific set of coordinates, what is wrong when the total area of the parcels exceeds that of the properties? (Answer: it depends).
    Part of my role used to involve cleaning a geospatial dataset that contained 3 million property boundaries, and 3.7 million parcel boundaries, that cover a state. It had to be done every month, even though a very large proportion of the data had identifiers that showed that the data was unchanged for the month (more polygon ST_isValid() errors would pop up in the supposedly unchanged data, than total actual changes due to subdivision or consolidations). *FML* in other words.

  • @jimmy21584
    @jimmy21584 Před 4 lety +2

    That was a fantastic talk. I was waiting for some pseudocode or a deeper technical breakdown of how they implemented it, but it was great for an introduction.

    • @ukaszMarianszki
      @ukaszMarianszki Před 4 lety +1

      What? You got both actual code [or if what he showed wasnt actual code, then it's most certainly pseudocode], and a bit of a peek on how they actually implemented it. [altho not a complete breakdown]

  • @123TeeMee
    @123TeeMee Před 4 lety +3

    Wow, amazing talk! Realy enjoyed it.
    I can see this being refined into a powerful tool that many people can and should use.

  • @Spookyhoobster
    @Spookyhoobster Před 4 lety +2

    This seems so crazy, definitely going to give Gen a look. Thanks for the video.

  • @adamdude
    @adamdude Před 8 měsíci

    The question that is most important is does doing data cleaning this way improve your final result? Like, does it make your final conclusions about the data more accurate to the real world? Does it help you gain insight you didn't already have? Does it make it more or less likely to reveal new true conclusions about the data?
    I'd imagine that by making so many assumptions about messy data, you're not learning anything new from that data.

  • @nikhilbatta4601
    @nikhilbatta4601 Před 4 lety +6

    Thanks for this helped me open my mind and think of a new way to solve a problem

  • @borstenpinsel
    @borstenpinsel Před 4 lety +29

    2009: this is the web, user input must be cleaned and validated, best to give them a list of states
    2019: let's assume a person meant post an ad for Beverly hills, ca, simply because more people posted ads there before.
    Reminds me of my Dad trying to order a "Big Mac" at Burger King. Of course they insisted that they dont have a big mac. Should they have assumed he wanted a "whopper"? Maybe this is exactly what he wanted and he just mixed up the names, or maybe he wanted a big mac and mixed up the stores. In this case he would have been disappointed.
    I guess you could conjure up all kinds of shenanigans when you know theres a computer program guessing missing inputs. Like listing a ton of fake apartments in a rural area and thus any following, erroneous listing would be placed in the wrong state.
    Just think how many times you thought "i don't quite understand, he probably means this and that" and you turned out to be wrong.
    This is all very interesting but also kind of dangerous.

    • @buttonasas
      @buttonasas Před 4 lety +3

      That's why the feature of the added metadata seems really good to me - you could run this script anyways and then choose whether to discard dirty data or keep it. Or maybe change your approach afterwards.

    • @RegularExpression1
      @RegularExpression1 Před 4 lety +6

      Having done many database conversions of one kind or another over the years, I can say that sometimes it is necessary to “make a call.” If you’re lacking a State but have a ZIP, fine. But if you have a policy number that begins with R but don’t know the name of the insurance carrier, and the person is a government employee, you have a pretty good sense they’re on Federal” BCBS and you can safely make the change. The damage done if it is wrong is a denied claim which is what you’d have either way.
      In real life you get data that is lousy and you can’t just refuse to work with it. You refine and re-refine until the data is as clean as you can get it. I like this guy's implementation of Bayesian techniques and one could see a strong, canned solution for basic functionality with Bayes. I like it.

    • @benmuschol1445
      @benmuschol1445 Před 4 lety

      I mean, obviously validate input where possible. This talk is referring to one of the many, many circumstances where you don't control the data set but have to do some sort of analysis. There are obviously dumb ways to use this tech. Don't use it in a web app like the one your example. But it's still a helpful and pretty innocuous technology.

    • @ukaszMarianszki
      @ukaszMarianszki Před 4 lety

      Well this is a bit more meant as a tool for data scientists to use to clean their data sets for use with things like neural networks that usually can detect outliars and in fact benefit from having them in the data set in some cases. So i don't really think you would actually use this to validate user input in your live database

  • @aikimark1955
    @aikimark1955 Před 4 lety +1

    Wonderful presentation, Alex.

  • @timh.6872
    @timh.6872 Před 4 lety +4

    I wonder if certain human computer interaction situations can be framed as an inference over unknowns. Not from the standpoint of making a computer guess the intent of the human, but from the "embedded domain knowledge" standpoint.
    Logic programming and super high-level type theory also have this mindset, but approach it from a very cut and dry formal proof perspective, where the programmer has to fully specify what they mean and then the computer checks to see if that makes sense according to deterministic rules.
    I'm not sure if this probabilistic programming works in this way, but it seems to open up an avenue to quantify and reason about unknowns. Consider a self-employes user that wants to have their computer organize documents for their business. Between hard logical constraints (need to know the "shape" of the data before reasoning about the contents, need to know about the overall format of the files before figuring out the shape) and some existing knowledge (preferences and opinions the user holds, internal implementation knowledge, general encyclopedic knowledge in areas the user looks up), the computer should be able to use this probabilistic inference to start asking "useful" questions, in the sense that statistical uncertainty in the data set indicates a need to consult the user. "I found some "FooBar" forms from customers X, Y, and Z. They don't match the current catrgories very well, where should they go?"
    If probabalistic programming could actually work in that way, it starts to make "interactive" interfaces possible, in the sense that it provides a question-asking heuristic. A computer transforms from a petulant tool that must be placated and carefully callibrated into a useful assistant.

  • @Here0s0Johnny
    @Here0s0Johnny Před 4 lety +15

    good talk, but couldn't this screw up the dataset? you might just remove outliers and fill the table with biases.

    • @Zeturic
      @Zeturic Před 4 lety +21

      Isn't that true of any attempt at cleaning a dataset?

  • @DylanMaddocks
    @DylanMaddocks Před 4 lety +2

    When he first proposed the problem I immediately thought neural networks would be the best way to go. It really makes me wonder how many of the problems that are being solved by neural networks would be better solved by this probabilistic programming approach. I'm also wondering how fast/slow the expectation maximization algorithm is, because that could be a big constraint.

    • @tkarmadragon
      @tkarmadragon Před 4 lety

      The reason why GPT2 and other GAN-AIs are so powerful is because they are also using probabilistic programming to generate their own training sets ad-infinite.

  • @fabriceaxisa
    @fabriceaxisa Před 4 lety

    I implemented such aalgorythm in 2016 I am always happy to learn more about it

  • @DKL997
    @DKL997 Před 4 lety +1

    Great talk! A very useful subject, and presented quite well.

  • @jn-iy3pz
    @jn-iy3pz Před 4 lety +4

    Typo in slide at 5:20, if statement should be `if r[:city] in cities[s]`

    • @buttonasas
      @buttonasas Před 4 lety

      Same mistake was in the @pclean part, it seems...

  • @jjurksztowicz
    @jjurksztowicz Před 4 lety

    Great talk! Excited to dig in and try Gen out.

  • @mjbarton7891
    @mjbarton7891 Před 4 lety +4

    What is the benefit of generating the most likely value for the record? How is it more beneficial than eliminating the record from the data set entirely?

    • @alchemication
      @alchemication Před 4 lety +4

      Hi Mike, this is only a simple example. Imagine if your data-set has 100 columns and in a random record only 2 columns are missing values. By discarding this observation entirely you might be loosing important information. Now imagine that 90% of your data has at least 1 column missing. Are you seeing a pattern here? Hope it helps.

    • @brianrobertson781
      @brianrobertson781 Před 4 lety +1

      Two reasons: 1. You might not be able to delete any records (for instance, real-time service delivery, or the data has a sufficiently-high error rate you’d be eliminating most of the data set); and 2. You don’t know which records are of issue - you have to validate both complete and incomplete fields and a likelihood score tells you where to focus your efforts.
      I really could have used this 10 years ago. And whole armies of financial analysts are about to get automated.

  • @drdca8263
    @drdca8263 Před 4 lety +3

    Wow! This sounds really cool!
    I wonder if this would work for, like,
    Ok so, where I was working until recently (left to go to grad school), they had an “auto assign” process, where certain tasks are assigned to different people according to a number of heuristics that we had to change from time to time.
    I’m wondering if pclean (or if not pclean, something based on gen?) could help with that.
    Because putting in all those heuristics got complicated.

  • @artzoc
    @artzoc Před 4 lety

    Excellent and superb! and a little bit unbelievable. Thank you

  • @chaosordeal294
    @chaosordeal294 Před 4 lety +8

    Does this method ever render demonstrably better results than just ejecting dirty (and "suspected dirty") data?

    • @quasa0
      @quasa0 Před 4 lety +1

      it does, the quality of the results are comparable to the results of by-hand data cleaning, and in most cases, i believe, it would be much better than the human made script, just because we are removing the human error and bias from the equation

    • @y.z.6517
      @y.z.6517 Před 4 lety

      Ejecting an entry, just because New York was written as NewYork would lead to unrepresentative data.

    • @y.z.6517
      @y.z.6517 Před 4 lety

      If a row has 10 columns, falsifying 1 column still allows the other 9 to be used. Better than ejecting all 10. An alternative approach is to take the average value for the 1 invalid value, so it's equavalent to being ejected.

  • @fedeoasi
    @fedeoasi Před 4 lety +2

    I thought this was of the best talks at Strange Loop this year. Does anyone know if pclean is available somewhere? I see that Gen is available but I found no mention of pclean in the Gen project or anywhere else.

  • @wujacob4642
    @wujacob4642 Před 4 lety +2

    In the preliminary results slide, PClean and HoloClean have exactly the same scores(Recall 0.713, F1 Score 0.832). Is it coincidence( looks unlikely ) or HoloClean has something in common with PClean ?

  • @pgoeds7420
    @pgoeds7420 Před 4 lety +3

    27:28 Might those be independent typos or copy/paste from the same landlord?

  • @joebloggsgogglebox
    @joebloggsgogglebox Před 4 lety +1

    The text description under the video says that metaprob is written in clojurescript, but the video makes it clear that Julia was used. Is there also a clojurescript version?

  • @PinataOblongata
    @PinataOblongata Před 4 lety +4

    This might be a silly question, but I don't understand how you could possibly get such messy data in the first place - you don't create online web forms where people can just enter in whatever you like, you use drop-down boxes where they have to choose an option that will be 100% correct unless they somehow click on the wrong one without realising or purposefully mislead. You can't even type numbers in to date fields anymore for this reason! Is this more a problem for paper forms that are scanned, or am I missing something?

    • @buttonasas
      @buttonasas Před 4 lety +1

      Yes - something like scanning paper forms. You never know when you get some nulls in your data, even when completely digital.

  • @lorcaranr
    @lorcaranr Před 4 lety +3

    So why do we trust what the user entered for the rent, perhaps they entered that incorrectly? If they can't spell then surely they can make a mistake entering the rent?

    • @tkarmadragon
      @tkarmadragon Před 4 lety

      You are right. Actually the beauty of the proposed software is that if you do not trust the rent, then you can choose to incorporate the median rent. It's your call as the admin.

  • @arielspalter7425
    @arielspalter7425 Před 4 lety

    Fantastic talk.

  • @phillipotey9736
    @phillipotey9736 Před 4 lety +2

    Amazing, manually programmable AI. Let me know where I can get this scripting language.

  • @Larszon89
    @Larszon89 Před 3 lety +1

    Very interesting talks. Any news on the package, when will it be available? Tried looking the related articles in arxiv that were to contain the source code for the examples as well, but didnt't manage to find it either.

  • @Keepedia99
    @Keepedia99 Před rokem

    Not the point of the talk but wonder if we can make programs guess/correct their own bugs by checking the distribution of their other outputs

  • @verduras
    @verduras Před 4 lety

    fantastic talk. thank you

  • @aristoi
    @aristoi Před 4 lety

    Great talk.

  • @walterhjelmar1772
    @walterhjelmar1772 Před 4 lety +2

    Really interesting talk. Is the pclean implementation publicly available?

    • @SaMusz73
      @SaMusz73 Před 4 lety +1

      as said in the talk it's at github.com/probcomp/Gen

    • @gabbrii88
      @gabbrii88 Před 4 lety +1

      @@SaMusz73 it is not in the source code of Gen. I think they did not make it available as built in library to Gen.

  • @sefirotsama
    @sefirotsama Před 4 lety

    I still miss in what areas can I use that sort of programming other than guessing fuzzy values of bulky data.

  • @anteconfig5391
    @anteconfig5391 Před 4 lety

    Is it possible to use what was seen in this video to write summaries of what was in a chapter of a book or website?

  • @ewafabian5521
    @ewafabian5521 Před 4 lety

    This is brilliant.

  • @Muzika_Gospel
    @Muzika_Gospel Před 4 lety

    Excellent

  • @guilloutube
    @guilloutube Před 4 lety

    Great work. Awesome. is it open source? I didn't find pclean source code. I'll take a look at Gen.

    • @SaMusz73
      @SaMusz73 Před 4 lety

      go there : github.com/probcomp/Gen

  • @remram44
    @remram44 Před rokem

    Using probabilistic methods and a priori knowledge to generate data encoded as definite facts (rather than something that would come with confidence values) seems really dangerous. In your example, your data is no longer an unbiased source to conclude anything about the distribution of rents in the US, because you assumed the conclusion to produce the data: of course the results will show that rents are higher in California than Missouri, because otherwise you would have "corrected" a good portion of it as typos!
    It would be nice to have a framework that goes end-to-end, providing you with a way to check those assumptions.

  • @averageengineeer
    @averageengineeer Před 4 lety

    based on naive-bayes theorem ?

  • @F1mus
    @F1mus Před 4 lety

    Wow so interesting.

  • @MaThWa92
    @MaThWa92 Před 4 lety

    Wouldn't making estimators for P(B|A) and P(A) from the dataset and then using these to evaluate the same dataset be an example of overfitting?

  • @PopeGoliath
    @PopeGoliath Před 4 lety +3

    I'm 20 minutes in, and the domain knowledge he needs to have to write the checker seems like exactly what the data set is designed to unveil. Is this a chicken/egg problem? Your study is to answer questions about a knowledge domain, but you need to already have that knowledge to error check your data. Where does it start?
    Edit: At 22 minutes he says he'll get to that. Oh good.

  • @user-lt9oc8vf9y
    @user-lt9oc8vf9y Před rokem

    This tool sounds like it could be discriminating real quickly if you don't pay attention to what the columns mean