MapReduce - Computerphile

Sdílet
Vložit
  • čas přidán 9. 07. 2024
  • Peforming operations in parallel on big data. Rebecca Tickle explains MapReduce.
    / computerphile
    / computer_phile
    This video was filmed and edited by Sean Riley.
    Computer Science at the University of Nottingham: bit.ly/nottscomputer
    Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharan.com

Komentáře • 63

  • @jaybakerwork
    @jaybakerwork Před 5 lety +168

    From reading the comments, I think what some people are missing is the following: the value of MapReduce as it was implemented at Google (and in Hadoop) is that it is a framework for doing these sorts of jobs at scale. Furthermore, the framework allows one to plug in your map and reduce functions independently of everything else. That is why the shuffle is important for example. Often, people see the word count example and they want to just take that out and count at that point. And that might be better for this particular example, but the idea of the framework/algorithm is that it does not "know" at the shuffle phase that you are going to count. I think more examples would really help people understand it better.

  • @szhangster
    @szhangster Před 5 lety +360

    Professor Baclawski developed the search engine technology now called MapReduce in 1993. Northeastern University patented this technology, and the patent has not been successfully challenged. Northeastern University sued Google for patent infringement, and Google awarded a settlement to Northeastern University.

  • @NikolajLepka
    @NikolajLepka Před 5 lety +286

    map and reduce also just happen to be the two most commonly used functions in functional programming, and have been for decades
    this tech isn't new, it's just being applied at a large scale

    • @BarbarianGod
      @BarbarianGod Před 5 lety +38

      Glad I'm not the only one bothered by the opening of this video

  • @No0utlet
    @No0utlet Před 5 lety +76

    I remember learning about Hadoop Map Reduce in 2013 and being completely confused about it. Thanks!

  • @tomburns5231
    @tomburns5231 Před 5 lety +46

    I don't understand what a, b, and c are in the example. I thought they were words, but then the presenter calls them letters. In any case, given the example case of distributing a word count over parallel processors, why isn't the answer simply splitting the whole document/data into chucks, and asking the processors to count the words of one chuck each. Afterwards we can just sum the result. What am I missing?

  • @cl0udbear
    @cl0udbear Před 5 lety +72

    I don't get it. So you shuffle the keys between the nodes so that each node is counting all the instances of the same key? Wouldn't that shuffling have been the perfect time to just do the count and get it done with?

  • @philippk7554
    @philippk7554 Před 5 lety +9

    You can actually use k-means with MapReduce. It basically works by splitting the data into sets and running k-means on different nodes with the same cluster values. After that, the new means are calculated and broadcasted to all nodes for the next iteration.

  • @cthudo
    @cthudo Před 5 lety +208

    I admit I'm an old fogy programmer... I have a massive problem with crediting this to Google. They *may* have been the ones to put a name on it, but that doesn't mean this isn't an exceedingly basic strategy of dividing work and collating the results that was being used by a lot of people for decades.

  • @STDrepository
    @STDrepository Před 5 lety +68

    In this example why wouldn't you just have each server count the words in its data and then send the results back to the operator's machine which adds them up?

  • @IceMetalPunk
    @IceMetalPunk Před 5 lety +5

    Why does this allow duplicate keys on the same node instead of just one key per word with the value being the total count of the word on that node? Seems like that would speed up both the shuffle and reduce phases?

  • @luciengrondin5802
    @luciengrondin5802 Před 5 lety +2

    When I hear about map/reduce I think about the corresponding commands in perl and that confuses me because they don't seem to be much related.

  • @233kosta
    @233kosta Před 5 lety +3

    Can you do a segment on the MPI protocol?

  • @Kachkavalj
    @Kachkavalj Před 3 lety +19

    it's really refreshing to hear someone explain this in a straight forward way ... kudos to that young lady!

  • @theguitarhero649
    @theguitarhero649 Před 5 lety +4

    I just had to do a mapreduce implementation with grpc for my Advanced Operating Systems class that was due last night, wish this had come out earlier -_-

  • @mrtoastyman07
    @mrtoastyman07 Před 5 lety +52

    This feels like the most impenetrable way of explaining this...

  • @MokhlesBouzaien
    @MokhlesBouzaien Před 4 lety +4

    It was confusing until I found a Computerphile video.

  • @TheAstronomyDude
    @TheAstronomyDude Před 5 lety

    How would you implement this if you want to leave your data untouched? Mirror your data?

  • @Juan-Hdez
    @Juan-Hdez Před 2 lety

    Very useful. Thank you very much!

  • @brookead
    @brookead Před 5 lety +59

    Hadoop as an approach to map reduce is so over engineered and complicated that most people avoid it. Also, unless you're at really big scale, the overhead of Hadoop is so big that any decent laptop can outperform it. That's specifically a Hadoop criticism, not a criticism of MR in general. Alternatives like DASK (simplistically: distributed python pandas) is much simpler to manage for most purposes... Not quite the same thing but ultimately, you pick your poison... And as noted by another commenter, the thing Google "pioneered" was _distributed_ MR. MR itself has been a part of many languages for many years... And the good explanation given in this video works just as well for those as it does for distributed versions thereof. :)

  • @steven1671
    @steven1671 Před 5 lety +28

    Is "reduce" the same as the fold function in Haskell?

  • @Myziikhavus
    @Myziikhavus Před 4 měsíci

    This is a good video, no idea wtf the rest are yapping about. She actually goes through the main bit of how data shuffles and moves to different nodes

  • @monthlyduc
    @monthlyduc Před 5 lety +6

    Could you do a video on concurrency and possible solutions/history video? I'm surprised you haven't yet touched on sequential and concurrent execution. Great vid!

    • @mytech6779
      @mytech6779 Před 5 lety +1

      They have covered concurrency within a CPU core, unless maybe you mean parallel execution.

  • @unvergebeneid
    @unvergebeneid Před 5 lety +371

    Is it just me or was that not a particularly good example?

    • @viorelanghel5532
      @viorelanghel5532 Před 5 lety +62

      It's the classic map reduce example but you need to imagine 10TB of text, not 4 words a b c d

    • @KylePiira
      @KylePiira Před 5 lety +33

      Perhaps not, but word counting is the typical Hello World project for MapReduce so it makes sense to use it.

    • @Ceelvain
      @Ceelvain Před 5 lety +53

      The word count is a good example. (Actually the best I came across.) She just should have choosen sentences with actual words, and put it in the context of a search engine.

  • @MayankArora
    @MayankArora Před 5 lety +1

    I love computerphile ❤️

  • @Gunbudder
    @Gunbudder Před 5 lety +7

    Map reduce gives me google interview flashbacks...

  • @mro2352
    @mro2352 Před 4 lety

    My previous project was with Apache spark which uses Hadoop and map reduce and we used it for batch jobs.

  • @ignacio560
    @ignacio560 Před 5 lety +17

    Thank you, amazing video! Any chance you’ll talk about Spark next? 😄

  • @man-alive
    @man-alive Před 5 lety +6

    if the mapping creates a single record for each occurence (e.g. a 1, b 1, c 1), why not just use an array (e.g. a, b, a), since the 1 is implied?

  • @Metruzanca
    @Metruzanca Před 5 lety +5

    Does the way this is done have any similarity to how array.prototype.map() and array.prototype.reduce() works in ES6 JS?
    Edit : i mean at a level of dividing the map process to different processes/threads/tasks to speed up compution.

    • @belst_
      @belst_ Před 5 lety +6

      map is not computed concurrently but sequential in js.

  • @worlddaves
    @worlddaves Před 5 lety +24

    Why do you always write on that particular type of paper in your videos?

    • @samsharma8621
      @samsharma8621 Před 5 lety +18

      They get that for free

    • @rich1051414
      @rich1051414 Před 5 lety +72

      Consistency. Computerphile uses dot matrix paper while numberfiile uses brown paper towels.

    • @feldinho
      @feldinho Před 5 lety +16

      it's called "continuous form" and was historically used with daisy wheel-type printers (fix type face, size and line height) to ease the reading.

    • @SteelSkin667
      @SteelSkin667 Před 5 lety +7

      Because computers.

    • @saschar.8736
      @saschar.8736 Před 5 lety +4

      @@feldinho Yeah; I guess, they have a bunch of it lying around from that era. At home we also have (or had; don't know if my father threw it away, yet) a stack of similar paper from back when we had a dot matrix printer. And I imagine, that, at a university, they got tons of that stuff left.

  • @lionp696
    @lionp696 Před 5 lety +8

    Needed this a week ago for my exam :(

  • @cediddi
    @cediddi Před 5 lety +11

    no love for filter :( it's map-reduce-filter (atleast for me)

    • @MarioWenzel
      @MarioWenzel Před 5 lety +2

      filter can be part of either phase, depending on how you see it.
      filter p l = foldl (++) [] $ map (\x -> if p then [x] else []) l

  • @MrTridac
    @MrTridac Před 5 lety +102

    When will computer scientists stop putting new fancy names on decade old concepts?

  • @WilliamAndrea
    @WilliamAndrea Před 5 lety +11

    Why are there duplicate keys? Maybe it's cause I only know Python, but I'm really confused.
    Like for the first line of text, I would write the output of the map phase as a list of tuples, [('a', 1), ('b', 1), ('a', 1)], but wouldn't it make more sense as a dict, {'a': 2, 'b': 1} ?

  • @allmhuran
    @allmhuran Před 5 lety +11

    select count(...) group by ...
    Query plan: parallelism, stream aggregate
    But sure, tell me more about how SQL is outdated.

  • @ninocraft1
    @ninocraft1 Před 5 lety

    brilliant example of a practical application

  • @sabriath
    @sabriath Před 5 lety +27

    I'm not seeing the practicality of this, shuffling seems to make it pointless. It's almost like seeing a huffman tree being formed, but then introducing it to some meth.

  • @PilotPlater
    @PilotPlater Před 5 lety +9

    forget the nasty comments, I found this fascinating, and so did 94% of people who rated.

  • @Master1906
    @Master1906 Před 5 lety +81

    She didn't explain it very well...

  • @FahadKiani1
    @FahadKiani1 Před 3 lety

    anyone knows who the genius is?

  • @coder001
    @coder001 Před 5 lety +9

    too much talking and not enough visual examples

  • @__-to3hq
    @__-to3hq Před 5 lety +3

    the sound of felt pens makes me cringe

  • @geekworthy7938
    @geekworthy7938 Před 5 lety +29

    Yeah, that was a pretty bad example and quite boring.

  • @tenseikenzx-3559
    @tenseikenzx-3559 Před 5 lety +1

    What's better MapReduce or CUDA?

    • @CJSwitz
      @CJSwitz Před 5 lety +31

      That's like comparing Apache Webserver to Fortran. They are completely different things.

    • @SteelSkin667
      @SteelSkin667 Před 5 lety +8

      I don't think these two are directly comparable. I'm pretty sure there are GPGPU implementations of MapReduce that make use of CUDA.

    • @IsYitzach
      @IsYitzach Před 5 lety +6

      MapReduce is a generic algorithm paradigm. There's a CUDA implementation for numerical purposes. This is a description for a cluster size problem rather than a GPU size problem.