Workflow Orchestration for Building Resilient Software Systems

SdĂ­let
VloĆŸit
  • čas pƙidĂĄn 27. 07. 2024
  • Building resilient software systems can be difficult especially when they are distributed. Executing a long-running business process or workflow that involves multiple different service boundaries requires a lot of resiliency. Everything needs to be available and functioning because if something fails mid-way through you can be left in an inconsistent state. So a the solution? Removing direct service to service communication and temporal coupling.
    🔗 Solace
    solace.com/codeopinion
    🔔 Subscribe: / @codeopinion
    đŸ’„ Join this channel to get access to source code & demos!
    / @codeopinion
    đŸ”„ Don't have the JOIN button? Support me on Patreon!
    / codeopinion
    📝 Blog: codeopinion.com
    👋 Twitter: / codeopinion
    ✹ LinkedIn: / dcomartin
    📧 Weekly Updates: mailchi.mp/63c7a0b3ff38/codeo...
    0:00 Intro
    1:03 Distributed Monolith
    3:53 Temporal Coupling
    5:50 Orchestration
    #softwarearchitecture #softwaredesign #codeopinion
  • Věda a technologie

Komentáƙe • 85

  • @Pretence01
    @Pretence01 Pƙed 2 lety +9

    Increased resilience is one of the added benefits of using messaging, another important one would be that it effectively enables you to emulate a distributed transaction in a microservice environment by using an outbox on the sender side that defers publishing the outgoing message until its own state was successfully persisted.

  • @charlesopuoro5295
    @charlesopuoro5295 Pƙed rokem

    Thank you very much for this hands-on, pragmatic approach to understanding Workflow Orchestration.

  • @shashanksingh9238
    @shashanksingh9238 Pƙed 2 lety

    Thank you so much for explaining this. My Company is using Netflix conductors and documentation for it is very complex. Thank my Tech Gods that I landed here :-)

  • @JosueIbarraNinja
    @JosueIbarraNinja Pƙed 4 měsĂ­ci

    Great explanation of workflow orchestration! I’m part of the Cadence team at Uber

    • @CodeOpinion
      @CodeOpinion  Pƙed 4 měsĂ­ci

      Nice! Thanks for the comment and some validation 😀

  • @avineshwar
    @avineshwar Pƙed 2 lety

    I usually think about services that are being built as not just one service, but, a pair of services:
    - main service ("the very core" (perhaps, an even refined version) of business logic code)
    - main's supporter service (for development simplification, e.g. steer back a database to a consistent state)
    - any other infra piece needed to support this (e.g. object store)
    Once we could model and generalize, we should have a working model for a broad class of issues.
    I was hoping you think about it.

  • @alexsiuwh
    @alexsiuwh Pƙed 2 lety +1

    I have been with WFL orchestration for 20 years , 100% with you technically, but what makes project fails are mainly human factors and work politics in user environment

    • @CodeOpinion
      @CodeOpinion  Pƙed 2 lety

      Ya, pretty much the case with everything.

  • @loren-sr
    @loren-sr Pƙed 2 lety +7

    Great video, thanks for making! I love Temporal, and work there, if you have any questions! Main differentiator is orchestrating via code instead of via JSON/YAML, and we have SDKs for Go, Java, PHP, TS/JS. Python/Ruby/.NET/Rust SDKs are in development.

    • @CodeOpinion
      @CodeOpinion  Pƙed 2 lety +1

      Was just recently looking at the GitHub repo for the .NET SDK for point of reference to see what it looked like.

    • @avineshwar
      @avineshwar Pƙed 2 lety

      So, just to be sure how Temporal works, if an "n" step process (Workflow) is executed and if somewhere a step (Activity) "x" is considered (through business logic code) to be failing, then we c/would re-attempt our workflow from the step "x" / "x-1" (assuming idempotency)?
      If yes, does that mean Temporal is trying to influence 2 (arguably huge) scenarios:
      - simplify failure management pattern (reduce code/implementation branching depth on left or right)
      - simplify developer lives by letting them not worry about simple enough things
      Those are big things, but, maybe someday we will get:
      - detect an underlying reason for a certain observation (e.g. exception due to socket close) and deal in some standard way

    • @loren-sr
      @loren-sr Pƙed 2 lety +1

      @@avineshwar yes and yes. For the last, you can certainly handle in code different errors in different ways, and can do so across all Activities and Workflows. It would be nice someday to have some automated ML-based error handling, and a system like Temporal is a necessary base for that, since we not only have information on all failures, but are also the orchestrator deciding what to do next.

    • @avineshwar
      @avineshwar Pƙed 2 lety

      @@loren-sr I see (when I think about "fast af" operations and AI, seems like apple/oranges by today's standard). Thanks. All good information.

    • @morespinach9832
      @morespinach9832 Pƙed 5 měsĂ­ci

      Temporal doesn’t have a visual flow charter like camunda. Correct?

  • @buildingphase9712
    @buildingphase9712 Pƙed rokem +1

    I get the point however payments are probably going to happen client side with a redirect to the payment gateway and a success message call. But point taken in terms of async messages.

  • @sathyajithps013
    @sathyajithps013 Pƙed 2 lety +3

    Cool vid. My first exposure to something like this is in MassTransit Sagas. I think this example can be done using Masstransit Sagas + Courier. Perfect for these kind of scenarios.

    • @TheRak00792
      @TheRak00792 Pƙed 2 lety +1

      Routing slip can definitely work for the provided example. I'll prefer coupling it with a state machine for complex scenarios though

    • @morespinach9832
      @morespinach9832 Pƙed 5 měsĂ­ci

      As in choreography?

  • @rafaspimenta
    @rafaspimenta Pƙed 2 lety

    Hi Derek, thank you for the great contente as usual. Could you tell about the tools that you use to draw architectural diagrams and you thinking process to build one?

    • @CodeOpinion
      @CodeOpinion  Pƙed 2 lety +4

      I just use PowerPoint, nothing special. In terms of how I make them, specifically for a video, I've been thinking about making a video about that

  • @softcoda
    @softcoda Pƙed rokem

    How do you know if the service is down and would like to take an action subsequently

  • @morgadoapi4431
    @morgadoapi4431 Pƙed 2 lety +1

    Thanks for the video

  • @morgadoapi4431
    @morgadoapi4431 Pƙed 2 lety

    Kafka provides transactions but in a more technical sense. These transactions are to guarantee that either a collection of messages are written to many topics or not at all.

  • @nav201182
    @nav201182 Pƙed 3 měsĂ­ci

    In case we want to buy a tool from market to perform workflow orchestration, which tool you would recommend?

  • @saharis811
    @saharis811 Pƙed rokem

    Greate video as usual Derek!!
    I have one question, when for example one of the services in the workflow is in inavailablity state, now this service is unavailable and is a bottleneck, it's not a failure for this point in time, but we don't really know until it will be available again when it will be available it will process the request and then will publish success or failure, it can happen instantly, or after days, we don't really know.
    We need to respond to the client instantly. How can we handle this from client side?

    • @CodeOpinion
      @CodeOpinion  Pƙed rokem

      Well the key is you don't/shouldn't need to respond to the client instantly. The simplest example is placing an order. You don't need the entire workflow to succeed/fail to respond to the client. You accept the request to place the order, you do so and return to the client. The payment processing and everything else involved can be async/out of process. If the payment service is unavailable or the payment fails, you'd email the customer to notify them. I talk about this a bit more in this video: czcams.com/video/wEUTMuRSZT0/video.html

  • @dmsanz_youtube
    @dmsanz_youtube Pƙed rokem

    To keep that workflow "state" is it necessary to use some kind of saga or similar? Or is it enough with having service Ordering state (i.e: aggregate) capturing all the "distributed state" in order to react with compensating actions, etc? Or would that be too aware of other domains and a violation of separation of concerns?
    I suppose sagas (with mass transit, nservicebus) help a lot with these things if we want to have this external orchestration. But what if we don't use a message broker and we simply have event streams the services are subscribed to?

    • @CodeOpinion
      @CodeOpinion  Pƙed rokem

      What you're referring to is more event choreography where you don't have centralized orchestrator, but each boundary is consuming events and publishing events. Check out: czcams.com/video/TA12e2ZJcGg/video.html

  • @rcts3761
    @rcts3761 Pƙed 2 lety

    Do you know some good strategies to make sure that services which publish commands reliably process all possible "response" events? For example, a developer might add a new response type to the responding service and forget to update the event handler in the commanding service.

    • @b1ueocean
      @b1ueocean Pƙed 2 lety +1

      responding service doesn’t care about the commanding service (or any other) in your scenario above - responding service simply lands a response event in the broker.
      commanding service emits messages without knowing the who and the how regarding response events.
      if a specific event handler is missing from the commanding service how has it been released while falling short of the Definition of Done? 👈
      Even if you rely on a list of supported events in the commanding service’s configuration and hand roll verification to ensure supporting handlers are available/registered, such configuration needs to be kept up-to-date.
      easiest strategy is good testing practices 😋

  • @maxkomarow
    @maxkomarow Pƙed 2 lety +1

    Thanks for the video. Isn't it a saga orchestration pattern that you described? And also I wonder how the orchestrator can be implemented without frameworks. It seems like he has to have its own tables in a service database to check completed events. If so, we have to have something to get and save the orchestrator. Repository? But the orchestrator doesn't seem like a part of the domain model, more like a part of an application service. Would be glad to hear your thoughts on this

    • @CodeOpinion
      @CodeOpinion  Pƙed 2 lety +4

      It fits more on top of messaging infrastructure. There can be state involved which is needed if you want to keep track of which events have occurred or something in their contents to then send a command. If state is involved, you'll likely use a library to handle this. The alternative is event choreography, which I will cover independently in a separate video.

    • @juliancyanid
      @juliancyanid Pƙed 2 lety

      @Petar Vucetin aka "Routing Slip"-pattern. My first goto when flow is linear, especially if team/codebase has no existing orchestration (it's not only "stateless", it's also dead simple).

  • @jalalalmutawa4889
    @jalalalmutawa4889 Pƙed rokem

    Hi Derek, how should we handle the failure of a service that has updated its database but failed before sending an event/reply?

    • @CodeOpinion
      @CodeOpinion  Pƙed rokem

      One option is the outbox pattern: czcams.com/video/u8fOnxAxKHk/video.html

  • @BonnakChea
    @BonnakChea Pƙed 2 lety +2

    Thank for the video. It helps me a lot. However, I couldn't find one that supports orchestration for Nodejs. Really appreciate if I can get a stable one.

    • @loren-sr
      @loren-sr Pƙed 2 lety +1

      Temporal's Node/TypeScript SDK is pretty stable, and hitting v1 soon (which won't have any major breaking changes).

    • @BonnakChea
      @BonnakChea Pƙed 2 lety

      @@loren-sr Thanks a lot.

  • @rockmanjacky
    @rockmanjacky Pƙed 2 lety

    That's a very good video to explain the message queue system design, but how can we prevent the single point of failure if the message queue is down?

    • @CodeOpinion
      @CodeOpinion  Pƙed 2 lety +1

      Good question! It's core infrastructure, no different than a database. What do you do to prevent your database from being a single point if failure? Generally, a cluster for high availability and also use the outbox pattern: czcams.com/video/u8fOnxAxKHk/video.html

  • @AntonioRonde
    @AntonioRonde Pƙed 2 lety +1

    Why do we place the workflow inside a service like ordering instead of in the broker? This way the start of a workflow would go the same route as any ongoing workflow. Don't know if this is possible / advisable, just wondering about any tradeoffs.

    • @CodeOpinion
      @CodeOpinion  Pƙed 2 lety

      Not entirely sure what you mean by of "in the broker". Ultimately you're using a broker in a more of a "hub and spoke" way to broker messages between services. How you interact and handle those messages will be something you need to write in code. Can't say I'm a fan of using anything that's based more on markup/metadata like json to define the workflow.

    • @juliancyanid
      @juliancyanid Pƙed 2 lety +1

      Mt 2 cents:
      In this example, probably the _business_ of 'ordering' depends on (business) capacities of payment and warehouse. If your business had a "sales", maybe _that_ would talk to payment. Business workflows belongs to some business context, and is bound to change with the way business works. That's also why you don't want the flow-logic (orchestration) in some generic component.

    • @AntonioRonde
      @AntonioRonde Pƙed 2 lety

      @@CodeOpinion Yes, we use the broker as a "hub and spoke" for messages. Except for the invocation of the workflow at 6:15, there the client directly invokes the workflow at a service and doesn't follow the route via the broker. My question is why not use the broker here? Would it make sense to also relay this first message/call that invokes the workflow via the broker?

  • @ItzukiTheDemon
    @ItzukiTheDemon Pƙed 2 lety

    How would you notify the client about the success or failure? Is you depiction here a fire and forget from the “client” to “ordering”?

    • @CodeOpinion
      @CodeOpinion  Pƙed 2 lety +2

      It depends on what the workflow is and if the client is "aware" of what's actually happening. In the case of an Order, you simply tell them the order was placed initially. If there is an issue with their credit card or payment, you can email them etc. It really depends on the exact use case. Long running process means it can take milliseconds or it can take days or weeks even. It all depends on what the workflow is. Check out the video I did on using WebSockets as a means to push down to the client. czcams.com/video/Tu1GEIhkIqU/video.html

    • @ItzukiTheDemon
      @ItzukiTheDemon Pƙed 2 lety

      @@CodeOpinion I’ll definitely check out the video. Thank you!

  • @selvakumars6487
    @selvakumars6487 Pƙed 2 lety +1

    Hi Derek, Is this not chroreography as there is no master process co-ordinating the flow like orchestration ?

    • @CodeOpinion
      @CodeOpinion  Pƙed 2 lety

      This is orchestration as there is something controlling the workflow. I'm using a combination of sending commands and consuming events (or replies) in the orchestration. Choreography would be each service consuming events and publish events without any central knowledge of the workflow.

    • @chengchen9032
      @chengchen9032 Pƙed 2 lety

      ​@@CodeOpinion I have the same question for the video. because I was assuming there would be a master process with central knowledge of the workflow which would use that knowledge to pick up events from the MQ, convert those events to specific commands and then put it back to the MQ. so the chain would goes like
      service A -> Event -> MQ -> Orchestrator -> Command -> MQ -> service B.

    • @CodeOpinion
      @CodeOpinion  Pƙed 2 lety +1

      @@chengchen9032 Exactly. There's different tools/libraries/frameworks that can help you orchestrate. In a bunch of my videos I use NServiceBus which is a .NET Library, but you can also use other tooling such as Termporal (temporal.io) that has SDKs for bunch of languages/platforms.

    • @selvakumars6487
      @selvakumars6487 Pƙed 2 lety

      I got it, ordering service (the one with the workflow label) owning the workflow and orchestrates it. I assume, If there is any failure, the same flow expected to issue compensation for all the involved parties.

    • @CodeOpinion
      @CodeOpinion  Pƙed 2 lety

      Correct

  • @vicradon
    @vicradon Pƙed 2 lety

    So an example of such workflow orchestration tool will be Airflow right?

    • @CodeOpinion
      @CodeOpinion  Pƙed 2 lety

      Yes as well as other tools like Temporal or messaging libraries that provide a stateful saga.

  • @maartenlouage
    @maartenlouage Pƙed 2 lety

    How about using Azure Durable Functions for orchestration?

    • @katjejoek
      @katjejoek Pƙed rokem

      That's what I was thinking as well. I guess it depends on how the asynchronous communication is done between the parts. What happens if something is not available? Will it be picked up once it is back or will it fail immediately?

  • @andreashe36
    @andreashe36 Pƙed 2 lety

    Can you make a video, how to aggregate if models from a different domain is needed? Eg if I need summary sells of a product as a projection. But product and order may be processed in different domains and different event stores. How does the order get info about product details?

    • @CodeOpinion
      @CodeOpinion  Pƙed 2 lety +1

      Ya good suggestion.

    • @mrxscheia
      @mrxscheia Pƙed 2 lety

      There’s a great video on this topic from Mauro Servienti czcams.com/video/hev65ozmYPI/video.html

  • @pelaoinfo
    @pelaoinfo Pƙed rokem

    I'm studying this in deep since I'm modeling a whole system, but I can't still understand whether Orchestration is async (messages queues) or sync (req/res approach), most of the info found tells Orchestration is sync req/res.

    • @CodeOpinion
      @CodeOpinion  Pƙed rokem

      I'd say more often then not it's async because of resilience and failures. Check out czcams.com/video/LMKVzguhFw4/video.html

  • @lilivier
    @lilivier Pƙed 10 měsĂ­ci

    But what if after creating your order in database you never receive a message back from the next service ? You have a 'ghost' order still commited in the database. How do you handle that ?

    • @JosueIbarraNinja
      @JosueIbarraNinja Pƙed 4 měsĂ­ci

      You generally have a timeout and retry limits set as parameters when starting your workflow or calling your next activity.
      In Cadence, the cadence service (broker) ensures to follow up on the pending workflow. Even if the codebase changes, you can replay the workflow in its entirety

  • @sangmin7648
    @sangmin7648 Pƙed 2 lety

    The comparison always seems to be between synchronous api call and asynchronous messaging. But how about asynchronous api call? Could there be any advantage of using async api call over messaging (other than ease of infra management maybe)?

    • @CodeOpinion
      @CodeOpinion  Pƙed 2 lety +1

      Meaning something like gRPC's non-blocking async? It's still service to service direct calls. If you call a service and it's unavailable, or you call a service async non-blocking, and your own service fails and doesn't get the reply. Adding messaging that's durable in the mix removes services needing to be online.

    • @sangmin7648
      @sangmin7648 Pƙed 2 lety

      More like rest api call on separate thread. I was thinking about something like an outbox pattern; but instead of messaging event to the queue, use rest api call on separate thread. With this approach, other services doesn’t have to be online.

    • @CodeOpinion
      @CodeOpinion  Pƙed 2 lety +1

      Yes to a degree. The problem with service to service is if you do it, it can become viral. A calls B that calls C that calls D. At some point that becomes unreliable. Rather deliver a message to a broker and then you're done. I have a video coming out in a few weeks that illustrates where I think rpc and service to service is viable.

    • @sangmin7648
      @sangmin7648 Pƙed 2 lety

      Thanks for the reply. I’ll think about how separate thread rest api call can(or cannot) work in the situation you mentioned. I was wondering about this because my company’s going the separate thread route. Having watched your videos, it felt awkward, but as a clueless junior couldn’t make an argument against it. Thanks again for the great video as always

    • @evilroxxx
      @evilroxxx Pƙed rokem

      My application currently works exactly as you described. 1 api asynchronously awaits for another apis response and then once received continues the rest of its own execution. However this is called temporal coupling and must be avoided. Your 1st api cannot proceed with the success flow or error flow until it receives a response and throws an exception if it times out. So it’s not exactly an asynchronous execution if it’s waiting for something to finish. Hence this complex event driven architecture is favorable because not only can you control the execution flow but also you can pick up where you left off in case if the consumer of a message is unavailable momentarily and then comes back alive after a short period or something.

  • @ThugLifeModafocah
    @ThugLifeModafocah Pƙed 2 lety

    But what happens when the broker is broken?

    • @CodeOpinion
      @CodeOpinion  Pƙed 2 lety

      Use an outbox: czcams.com/video/u8fOnxAxKHk/video.html

  • @thedacian123
    @thedacian123 Pƙed rokem

    As far as i understood the orchestartor is a pHYSIcALL app from a given boundary, which is dumb cannot contain business logic.Are not you going to run into single point of failure issue,when this app were to failure?Thank you!

    • @CodeOpinion
      @CodeOpinion  Pƙed rokem

      Generally the logic is around consuming events and sending commands. Generally no I/O or db access. If the service owning the orchestration is down, then yes it won't execute until it's back up

  • @marcelbricman
    @marcelbricman Pƙed 16 dny

    your line of argument is that its impossible to roll back in a distributed way and then magically with messages it works - calling BS: the problem is still exactly the same. dont get me wrong, temporal decoupling is good, also the centralised workflow engine can be a great benefit, but your argument dpes not illustrate the point of all that. if every service can perform rollbacks asynchronously, there are plenty onther ways to resolve this problem

  • @evilroxxx
    @evilroxxx Pƙed rokem

    What is the definition of a service here? Ordering, payment, warehouse are all single api endpoints? Are they console applications running processes in the background?
    If they’re apis then how to they pick up events generated and process them?
    If they’re background processes how’s the client going to communicate with them?
    Let’s say Ordering service is going to communicate with products and/or pricing and/or coupons or inventory services. Is it going to send messages on the broker and wait for a response or an event from them and will that event contain the required data or will it need to do something else to get the data to prepare the order?
    How’s all this going to work to handle scale of say millions of orders daily? Won’t that overwhelm the broker? What if the queues get so full that they cannot accept any more commands? What’s going to happen to the data then?
    Can you please shed some light on these points?
    As always I’m a super fan your videos Derek. Thanks so much for enhancing the community’s knowledge🎉

    • @CodeOpinion
      @CodeOpinion  Pƙed rokem

      My definition of a service is "the authority of a set of business capabilities." They could be a combination of HTTP APIs as well as separate processes that are consuming messages off a queue/broker. A lot of your questions I actually cover in a ton of videos. You're about to go down a rabbit hole :) Here are some that might be helpful.
      re: scaling message processing - czcams.com/video/xv6Ljbq6me8/video.html
      re: communicating to client from background processing: czcams.com/video/Tu1GEIhkIqU/video.html

  •  Pƙed rokem +1

    You forgot idempotence.

    • @CodeOpinion
      @CodeOpinion  Pƙed rokem +2

      Messaging is a broad topic. I've got videos at many that focus specifically on various aspects, one of them being idempotence.

  • @haskell3702
    @haskell3702 Pƙed rokem

    In this case the Ordering will be the orchestrator? Does this mean that Ordering has to know Payment and Warehouse (which does not look good)? Or the orchestrator should be an independent microservice of these 3?

    • @maximfateev2369
      @maximfateev2369 Pƙed rokem

      It depends on the orchestrator's implementation. It can be separate, or each of the services can own its own orchestration and execute its own operations and compensations.

  • @marcelbricman
    @marcelbricman Pƙed 16 dny

    your line of argument is that its impossible to roll back in a distributed way and then magically with messages it works - calling BS: the problem is still exactly the same. dont get me wrong, temporal decoupling is good, also the centralised workflow engine can be a great benefit, but your argument dpes not illustrate the point of all that. if every service can perform rollbacks asynchronously, there are plenty onther ways to resolve this problem