Tuesday, August 9, 2016

The devil is in the details – eventual consistency



In https://www.nginx.com/blog/event-driven-data-management-microservices/, Chris uses an example to demonstrate the challenge ad solutions in keeping data consistent in microservice. 

The example is: when OrderService creates an order, it needs to check with CustomerService to see if a customer has enough credit for the order. OrderService and CustomerSerivce have independent databases, and we do not want to use 2PC because it is not scalable and robust. 

One solution is to use local transaction plus even-driven:
“The Order Service inserts a row into the ORDER table and inserts an Order Created event into the EVENT table. The Event Publisher thread or process queries the EVENT table for unpublished events, publishes the events, and then updates the EVENT table to mark the events as published.”

This is however only half of the picture – how CustomerService handles events is not included, how EventPublisher makes sure events get published is also not mentioned. 

EventPublisher gets all “new” events from DB, publishes them, and updates “new” events to “published”. Upon updating, the DB could be down, or EventPublisher could be down. In either case, when they are up,   EventPublisher will need to republish all “new” events again. This leads to the first design constraint: CustomerService must be able to handle duplicated events.

CustomerService gets an “OrderCreated” event, handles it, and sends a “CreditReserved” or “CreditLimitExceeded” event. Due to the fallacies of distributed system, this event might not reach OrderService in time. OrderService can’t always show to customers “your order is pending, please be patient…” – customers are not that patient, they will pound on the refresh button again and again. After a certain time, OrderService will have to give up and consider the order has failed, since it doesn’t know if CustomerService has processed the order, it needs to send an “OrderUnwinded” event to it. If CustomerService hasn’t processed the order, it is an empty action for CustomerService; otherwise, CustomerService needs to unreserve the customer’s credit. In either case, CustomerService needs to send an “OrderUnwindedSuccess” event to the OrderService. So this leads to the second design constraint: there needs to be a compensating mechanism.

But this “OrderUnwindedSuccess” event may again not arrive at OrderService in time. From a customer's point of view, his order is one transaction: it is either successful (order is created and credit reserved) or fail (order is not created and credit is not reserved), even though our system is not designed to be consistent in all time. If OrderService doesn’t get “OrderUnwindedSuccess” event in a certain period of time, it can’t show to the customers “Your order has failed” because at the moment OrderService doesn’t know if order and credit are consistent. OrderService might retry a couple of more times, give up and log this as an exception somewhere (an error queue, e.g.), and human intervention will be needed to sort it out. So this leads to the third design constraint:  there needs to be a mechanism to handle exceptions.  

To make the human intervention easier, there is better a troubleshooting tool that can dig through all related services/tables/message brokers and piece together an event flow in a time order, e.g.

  • at time #, OrderService creates an order#;
  • at time #, EventPublisher publishes the event OrderCreated for order#;
  • at time #, CustomerServices gets the event OrderCreated for order#
  •  

The complete picture looks like this:

A happy path will be like this:
1) OrderService creates a new order and new OrderEvent

2) EventService queries all "new" events on OrderService
3) EventService publishes all "new" events on OrderService
4) EventService updates "new" events to "published" on OrderService

5) CustomerService gets the event
6) CustomerService processes the event
7) CustomerService updates credit table and creates an "CreditReserved" event 

8) EventService queries all "new" events on CustomerService
9) EventService publishes all "new" events on CustomerService
10) EventService updates "new" events to "published" on CustomerService

11) OrderService gets the "CreditReserved"
12) OrderService updates Order status to "Confirmed"

A sad path will be like this:
1) OrderService times out in getting "CreditReserved"
2) OrderService creates "OrderUnwind" event

3) EventService publishes the "OrderUnwinded" event

4) CustomerService handles "OrderUnwinded" event
5) CustomerService sends " OrderUnwindedSuccess" event

6) EventService publishes the "OrderUnwindedSuccess" event

When you draw an architecture diagram with boxes and links, it looks so clear and neat. But each box might be very complex. The devil is in the details.  

By the way, I do not think this is a good example of microservice, in this example, CustomerService is really PaymentService. In real life, PaymentService is almost always a different service or even system. And in real life, eventual consistency is to be expected. Perhaps replacing CustomerService with InventoryService and replacing credit with stock would make more sense.

No comments:

Post a Comment