Doron Holan has a post on his blog about choosing appropriate locking mechanisms for kernel drivers. He points out an amazingly good paper on locking on WHDC.
One thing he didn’t go into (although he may have meant to; his post ended in a comma) is the performance question. Of primary importance, of course, is program correctness, which means choosing the correct lock for the job at hand. But with that given, let’s talk about perf a bit.[1]
Consider what would happen if you simply used a spin lock at the beginning and end of every singe function in your driver. Leaving out the IRQL issues you would have (lots of DDIs can only be called at <=DISPATCH_LEVEL), you would have left yourself with a totally single-threaded driver. You would basically cause any other processor that had need of executing code in your driver to be parked while the current processor does its thing. A much better idea is to protect data structures, rather than code paths, and to hold your locks for as short a period of time as correctness will permit.
Beyond simply optimizing your locks for concurrency, there is the issue of how expensive the lock itself is. The name “fast mutex” should give you a clue – different locks have different perf implications. The slowest locks are the dispatcher objects – semaphores, kernel mutexes, and (although they’re not strictly locks) kernel events. These are certainly appropriate for some uses, but they’re not the fastest locks in the world; they all contend for the single system-wide dispatcher lock, which is one of the hottest locks in the system.
Spin locks are an interesting performance case. On the one hand, they’re extremely fast in the un-contended case, and virtually optimal in the contended case – you won’t wait many more CPU ticks than the bare minimum to acquire a lock once it is freed. On the flip side, they can cause memory bus congestion if more than one CPU is spinning on the same lock. Furthermore, acquisition of a spin lock (uniproc or multiproc) implies going off-core to hit the interrupt controller. If there are many CPUs spinning on a lock, the eventual winner of the lock is essentially unpredictable – there is a small but nonzero probability of lock starvation, purely by chance of the ordering of xchg operations on the bus.
In-stack queued spin locks are a solution to a couple of the problems associated with standard spin locks. Because each waiter supplies its own wait location to spin on, each CPU can spin on a cache line rather than locking the memory bus with each xchg. And, because they are queued, each lock holder releases the next waiter in line in the process of dropping its lock. This enforces fairness in lock wait time – first come, first served.
If you don’t need to raise to DISPATCH_LEVEL, though, you are probably better off in terms of overall system performance by choosing something else still. Some lower-IRQL lock options include fast mutexes, executive resources, and guarded mutexes. Fast mutexes imply hitting the interrupt controller to raise to APC_LEVEL, unless you put yourself in a critical or guarded region first (which is a very fast operation). Guarded mutexes merely disable all APCs on the thread by entering a guarded region; this saves them from hitting the PIC. Both are extremely fast in uncontended cases, assuming you don’t hit the PIC.
Executive resources are kind of a middle ground; they’re not quite as fast to acquire as fast mutexes or guarded mutexes, but they allow for recursion and reader/writer semantics, which can make a world of (positive) difference in driver performance situations where multi-reader/single-writer semantics make sense.
The final truth, though, is that performance tuning is hard work. Beyond some very basic architecture tuning, you’re usually better off optimizing under the direction of a profiler.
[1]A couple of good quotes come to mind: Premature optimization is the root of all evil
— Tony Hoare, but also Andy [grove] giveth, and Bill [gates] taketh away
[...] Manipulation is even simpler; you can (and should) almost always use built-in macros or functions to handle common operations: InsertHeadList, RemoveEntryList, etc. These operations are fast and easy, although if you need interlocking, be sure to use the spinlock-protected ExInterlocked* functions. Or, if spin locks aren’t right for your application, roll your own interlocked operations with the locking of your choice. [...]