Building real-time data products at LinkedIn with Apache Samza

Sdílet
Vložit
  • čas přidán 17. 11. 2014
  • Presented at Strata+Hadoop World, New York, 16 October 2014 strataconf.com/stratany2014/pu...
    Slides: speakerdeck.com/ept/building-...
    Abstract:
    The world is going real-time. MapReduce, SQL-on-Hadoop and similar batch processing tools are fine for analyzing and processing data after the fact - but sometimes you need to process data continuously as it comes in, and react to it within a few seconds or less. How do you do that at Hadoop scale?
    Apache Samza is an open source stream processing framework designed to solve these kinds of problems. It is built upon YARN/Hadoop 2.0 and Apache Kafka. You can think of Samza as a real-time, continuously running version of MapReduce.
    Samza has some unique features that make it powerful. It provides high performance for stateful processing jobs, including aggregation and joins between many input streams. It is designed to support an ecosystem of many different jobs written by different teams, and it isolates them from each other, so that one badly behaved job can’t affect the others.
    At LinkedIn, we have been using Samza in production for some time, both for internal analytics purposes and for data products that are served on the live site. In this talk, we’ll discuss our experience of working with Samza. You’ll learn about:
    - What kinds of real-time data problems you can solve with Samza
    - How Samza reliably scales to millions of messages per second
    - How Samza compares to other stream processing frameworks
    - How Samza can help collaboration between different data science, product, and engineering teams within an organization
    - How to avoid implementing the same data pipeline twice (once for offline/batch processing and once for real-time/stream processing)
    - Lessons we learnt on how to structure real-time data pipelines for scale and flexibility

Komentáře • 11

  • @MrMukulj
    @MrMukulj Před 7 lety

    Great talk Martin. Very well done!

  • @SamBessalah
    @SamBessalah Před 9 lety

    Great talk Martin.

  • @m1169199
    @m1169199 Před 9 lety +4

    I like your slides, what did you use to make them?

  • @pollathajeeva23
    @pollathajeeva23 Před rokem

    TimeSeries tool may be more than

  • @houssemghazala
    @houssemghazala Před 10 měsíci

    👏👏👏🙏🙏🙏

  • @CoderCoronet
    @CoderCoronet Před rokem

    Hello Martin!
    Thank you very much for sharing such valuable content.
    I’m trying to find your video about building robust data infrastructure with logs. The link to the video on the talk transcript is broken. Can you share a new link to that video?
    Thank you!

  • @pinhusdash6895
    @pinhusdash6895 Před 9 lety

    What happens when a user from one partition views a user from another partition. How does the enrichment happen? Do you send a copy of the event to both partitions?

    • @pinhusdash6895
      @pinhusdash6895 Před 9 lety

      Pinhus Dash I think you kind of answer this at 46 minutes. But it does seem to double the time.

  • @dudeabideth4428
    @dudeabideth4428 Před 4 lety

    Isn't that a big database of profiles to have a copy? Or is it only the subset we care about? It sounded like the profiles replica got created from every profile edit event. So it sounds like a full replica

    • @tomhpolo
      @tomhpolo Před 3 lety

      I might be wrong, but in his talk it sounded like there are 2 levels of partioning: by user and by job.
      By job: The stream processor for PageViewEventWithViewerProfile doesn't need all data from the EditUserProfile event, so it grab/replicate whatever fields it wants from that event.
      By user: If you partition users into different processors (ie: profile['id'] modulo N), then each replica only has that % of users in it.

  • @GlebWritesCode
    @GlebWritesCode Před 8 lety

    I would say this talk has very little to do with Samza. Just a general view how LinkedIn does stream processing