by Herb Sutter for Dr. Dobb’s Journal
“How would you write a fast, internally synchronized queue, one that callers can use without any explicit external locking or other synchronization? Let us count the ways…or four of them, at least, and compare their performance. We’ll start with a baseline program and then successively apply three optimization techniques, each time stopping to measure each change’s relative performance for queue items of different sizes to see how much each trick really bought us.
All versions of the code we’ll consider have one thing in common: They control concurrency using (the equivalent of) two spinlocks, one for each end of the queue. Our goal in refining the code is to allow separate threads calling Produce and/or Consume to run as concurrently as possible, while also improving scalability to get more total throughput through the queue when we use larger numbers of producer and/or consumer threads.”
How much parallel performance does each successive optimization really buy us, and why?


