Thundering Herd Problem - ASP Core Solution Architect

 


What is the Thundering herd problem?

Lets describe it using a realistic problem to get better understanding.

Suppose an exposed Weather-API that returns current temperature degree, called by web/mobile apps, and optimized for performance so, for simplicity it utilize caching capabilities so, when it is called, checks for caching and if there is "cache miss", proceed and call a third-party API and cache for later use.

The problem happens when there are a multiple calls to Weather-API at the same time and there is a "cache miss" (e.g. data not found in cache) so, each request go and call third-party API and then if succeed, it will cache data.   The 3-party API is slow and not well-designed for this huge number of requests so, it will be down and typically Weather-API  goes down or out of functionality (e.g. due to coupling with third-party API and not a proper handling for exceptions).

It is a real complicated problem especially when there are a multiple running instances of  Weather-API, plus multiple concurrency requests.😃


The Solution "based on a complex scenario above":-

(e.g. multiple running instances + multiple concurrency requests)

  1. Extract Fetching and Caching functionality out of Weather-API

  2. Use a distributed job mechanism for running only one isolated functional instance which encapsulate/handling the extracted "fetching & caching" functionality on a timely-based or per-order-time(Caching-Job)

  3. Use a message broker for communications/signaling between "Weather-API" and "Caching-Job", for this demo we will use "Redis-pub-sub" as a broker

  4. When Weather-API get called:
    1. Check for cache if exist, then return data directly
    2. If cache not present "cache miss" 
    3. Publish/dispatch an event "3-party API Requesting" to "Caching-Job" instance
    4. Subscribe for event "3-party API Request completed" coming from "Caching-Job" instance
    5. Initiate a Waiting Task waits for completion any of  a dummy "TaskCompletionSource" (e.g. which be resolved when "Cache-Job" instance publish event "3-party API Request completed") and a Time-Out Task (e.g. 9 seconds) 
    6. We used Task.Delay() and Task.WhenAny() instead of Thread.sleep() so, thread can reused by the thread pool for scalability concerns
    7. If  event "3-party API Request completed" happens before Time-Out task
    8. then, Get Data from cache
    9. If Time-out Task completed before the event, then return empty data response, so web/mobile clients can try later

  5. For "Caching-Job" one instance:
    1. Listen for "3-party API Requesting" events coming from "Weather-API"
    2. Use "Semaphore" to allow only one "3-party API Requesting" event to be handled at a time and lock other same events handling
    3. If the running logic under semaphore-protection succeed and "fetching & caching" occurred, then drop other events immediately
    4. If not succeed, then pick another event to handle  
    5. Publish "3-party API Request completed"  event back to "Weather-API"

Querying "Caching Job" by event to fetch and cache data & use Redis as pub/sub broker & Timeout Task


Using Semaphore to allow only one access to cache code and return the subsequent events directly



Code:


Test Solution:

Comments

Popular posts from this blog

Async/Await - OS Concept

Fluent Notification Sender .Net Package