Demystifying Sharding in MongoDB (MongoDB World 2022)

Sdílet
Vložit
  • čas přidán 17. 07. 2022
  • The number one mistake made with sharding is not planning for it early on. Like most daunting things, the “ignorance is bliss” approach is appealing until it’s unboxing day...
    You’re in luck. It turns out sharding your cluster is not so scary after all! This talk explores the concepts and architecture behind sharding while demonstrating its use cases and strengths. Furthermore, we outline how to pick a shard key and ultimately boost your application's performance through our sharding technology.
    Demystify one of MongoDB’s most powerful features with one of our sharding engineers. The talk is crafted to be educational for all expertise levels.
    Subscribe to MongoDB ➡️ bit.ly/3bpg1Z1
  • Věda a technologie

Komentáře • 4

  • @ahmadawad4782
    @ahmadawad4782 Před rokem

    Thanks for this exciting lecture. However, in the example in about minute 21, the use case of using _id as a shard key is questionable. Range queries on this key are unlikely to happen and a B-Tree index would have solved the problem of direct access to the required document, wouldn't it?

    • @galeop
      @galeop Před rokem

      I am not sure to understand your question, but I would say that having a high cardinality and even distribution of data are the most important criterion for the choice of shard key. Indeed you want your partitions to be as small as possible, so that the cluster can break your collection as much as possible if needed.

    • @manasmagdum
      @manasmagdum Před rokem

      we also want our queries to hit as less shards as possible, am I right?

    • @galeop
      @galeop Před rokem

      You can read and (most importantly) write in parallel from/to different shards at the same time.
      So to come back to your question, I'd say it depends on your use case:
      - For transactional apps, you want to be able to scale your write-throughput, by parallelising the WRITEs to different shards at the same time.
      - For analytical apps, you want to avoid having to read unnecessary data that would have to be filtered out, so yeah, you would want to store on the same partition data that is often read together. In practice, it means you'd partition your table according to a column that is often used as a filtering criteria or JOIN criteria.
      As you can see for the "Analytics" point, I'm using SQL-wording ('table", "JOIN"), because sharding in that context would be done by a RDBMS, such as AWS Redshift.
      But my understanding is that MongoDB is the 1st category of workloads: for transactional apps. In that case you want to partition your data so that there's no single partition handling most of the "write" operations; as you want to parralelize _writes_ so as to get the best write-throughput.