Thundering Herd Problem - ASP Core Solution Architect
What is the Thundering herd problem?
Lets describe it using a realistic problem to get better understanding.
Suppose an exposed Weather-API that returns current temperature degree, called by web/mobile apps, and optimized for performance so, for simplicity it utilize caching capabilities so, when it is called, checks for caching and if there is "cache miss", proceed and call a third-party API and cache for later use.
The problem happens when there are a multiple calls to Weather-API at the same time and there is a "cache miss" (e.g. data not found in cache) so, each request go and call third-party API and then if succeed, it will cache data. The 3-party API is slow and not well-designed for this huge number of requests so, it will be down and typically Weather-API goes down or out of functionality (e.g. due to coupling with third-party API and not a proper handling for exceptions).
It is a real complicated problem especially when there are a multiple running instances of Weather-API, plus multiple concurrency requests.😃
The Solution "based on a complex scenario above":-
- Extract Fetching and Caching functionality out of Weather-API
- Use a distributed job mechanism for running only one isolated functional instance which encapsulate/handling the extracted "fetching & caching" functionality on a timely-based or per-order-time(Caching-Job)
- Use a message broker for communications/signaling between "Weather-API" and "Caching-Job", for this demo we will use "Redis-pub-sub" as a broker
- When Weather-API get called:
- Check for cache if exist, then return data directly
- If cache not present "cache miss"
- Publish/dispatch an event "3-party API Requesting" to "Caching-Job" instance
- Subscribe for event "3-party API Request completed" coming from "Caching-Job" instance
- Initiate a Waiting Task waits for completion any of a dummy "TaskCompletionSource" (e.g. which be resolved when "Cache-Job" instance publish event "3-party API Request completed") and a Time-Out Task (e.g. 9 seconds)
- We used Task.Delay() and Task.WhenAny() instead of Thread.sleep() so, thread can reused by the thread pool for scalability concerns
- If event "3-party API Request completed" happens before Time-Out task
- then, Get Data from cache
- If Time-out Task completed before the event, then return empty data response, so web/mobile clients can try later
- For "Caching-Job" one instance:
- Listen for "3-party API Requesting" events coming from "Weather-API"
- Use "Semaphore" to allow only one "3-party API Requesting" event to be handled at a time and lock other same events handling
- If the running logic under semaphore-protection succeed and "fetching & caching" occurred, then drop other events immediately
- If not succeed, then pick another event to handle
- Publish "3-party API Request completed" event back to "Weather-API"
Comments
Post a Comment