Archive for the ‘Windows’ Category

Alex Ionescu is blogging

Sunday, November 19th, 2006

I’ve been meaning to post this for days but I keep forgetting. Alex Ionescu, who is another one of those guys that just seems to know way more than one person should be allowed to know about Windows internals, has started blogging (again).

He’s got a good user-mode debugging series posted, and if I know Alex, there’s plenty more good stuff still to come.

Welcome to the blogosphere, Alex.

More on object lifetimes

Saturday, November 18th, 2006

In an earlier post, I described a subtle race condition resulting from the differing lifetimes of miniport adapters and control device objects. Last week, Gianluca Varenni, the maintainer of WinPcap and one of the brains at CACE technology, pointed out that Microsoft had recently changed the Passthru sample to add reference counting in some instances to adapter objects. I went back and looked, and sure enough, the current DDK WDK sample has additional reference counting built into the driver.

Microsoft didn’t add any comments to the sample describing the reference counting addition, but I found this bug myself a while ago and implemented essentially the same solution. The basic problem is that there is a race between the two different adapter tear-down paths – the one that is initiated from halting the virtual miniport itself and the other that is triggered by the halting of the underlying miniport.

Gianluca also pointed out that nobody in their right mind would write an IM driver from scratch, other than as an educational experience, because it’s entirely too difficult to get the various NDIS synchronization issues right unless you’re an absolute expert at it. Obviously, even Microsoft is still finding bugs.

The good news is that IM’s are dead. Vista has a much-improved lightweight filtering architecture, so the writing is no the wall.

Vista ships

Wednesday, November 8th, 2006

It had to happen eventually. :-) What’s it going to be like getting to ship actual production code with the WDK? I can’t wait to strip out all of the #ifdefs for the old kits…

Congrats to the team(s). Shipping is hard. Shipping something that big is really hard.

It only took a week!

Monday, November 6th, 2006

The Month of Kernel Bugs blog just posted their first Windows hole. They’ve hit several other major OSes already. This one looks like it’s related to Win32k. They claim arbitrary code execution, which makes it a local administrator privilege escalation.

The best part is that it was apparently reported two years ago.

CONTAINING_RECORD for fun and profit

Saturday, November 4th, 2006

Here I want to cover two great uses for the CONTAINING_RECORD macro. CONTAINING_RECORD has been a part of the DDK since forever, and it has a second cousin in Standard C by the name of offsetof.

For those that don’t use it regularly, it looks about like this:

typedef struct _BUFFER {
	LIST_ENTRY  e;
	UCHAR      *buf;
} BUFFER, *PBUFFER;

...

PLIST_ENTRY entry = RemoveHeadList(&listHead);

PBUFFER a = CONTAINING_RECORD(entry, BUFFER, e);

The macro returns a pointer to a data structure given the address of an element inside that structure, as shown above. Although you can get by without it most of the time, there are a couple of good reasons to use it.

First, it helps prevent bugs by keeping you from hard-coding structure offsets elsewhere in your code. For example, take the BUFFER struct above. You could, of course, simply cast BUFFER to LIST_ENTRY, but that encodes the position of e forever in your code. If you change the position (e.g. by adding a member above e in the definition), you will break all instances of code that rely on this layout.

Using CONTAINING_RECORD, on the other hand, allows you to write offset-independent code to refer to elements within the structure. Obviously this won’t work for code that was compiled against a different version of the structure – this is a compile-time technique. But it can mask otherwise silent errors. (Well, silent until the code path in question is executed…)

The reason that they’re silent is also the second great reason to use CONTAINING_RECORD: casts. I’ve written before about how much I like casts in code, so I’m implicitly for anything that eliminates casts. Since CONTAINING_RECORD returns the right type by definition, you no longer have to cast it. So:

PBUFFER buf = (PBUFFER)RemoveHeadList(&listHead);

becomes:

PBUFFER buf = CONTAINING_RECORD(
    RemoveHeadList(&listHead), BUFFER, e);

This has the added bonus of being able to tell when things change out from under it, so it won’t produce as many silent error of the type described above.

Two additional CSQ rules

Wednesday, November 1st, 2006

Cancel-safe queues are a fantastic addition to Windows from back in the XP timeframe, and I have come to rely on them in my drivers. There are a couple of important extra rules that aren’t reflected in the documentation that you should be aware of, though.

Rule #1

If you read the documentation for IoCsqInsertIrp or IoCsqInsertIrpEx, you’ll find that the routine can be called at <= DISPATCH_LEVEL. While this is true, you cannot call it while holding a spin lock.

To see why this is true, consider the case where a dispatch routine receives an IRP, and before it can queue the IRP, another thread swoops in and cancels it. If this happens, someone should call the cancel routine, and that someone is IoCsqInsertIrp. It eventually calls CsqCompleteCanceledIrp, which calls IoCompleteRequest with a status of STATUS_CANCELLED.

(Note the double-L. Very bad news for a guy who can’t spell in the first place. I’ll be looking that one up for the rest of my life.)

The rest should be obvious: it is in fact illegal to call IoCompleteRequest while holding a lock, which is exactly what winds up happening in this case. Therefore, you can’t call IoCsqInsertIrp or IoCsqInsertIrpEx while holding a lock.

Rule #2

The second rule is related to when IoCsqInsertIrpEx marks an IRP pending. The rule is simple: If the supplied CsqInsertIrpEx callback returns STATUS_SUCCESS, the IRP is marked pending. Otherwise, it’s not.

In the case of IoCsqInsertIrp, the IRP is unconditionally marked pending. This is even quasi-documented by the WDK cancel sample, but the corresponding startio sample doesn’t say anything about the behavior of IoCsqInsertIrpEx, which I had always assumed was the same based on that comment. It’s not. :-)

Neither of these rules show up in the docs (at least as of now), so hopefully this will save some confusion down the road.

Vista driver verifier enhancements

Tuesday, October 31st, 2006

I just ran across this document that explains the changes present in Vista’s driver verifier. Verifier is one of the Best Things Ever.

Thanks to Dan Mihai from Microsoft for pointing this out on the newsgroups.

Keeping ExInterlocked* operations interlocked

Tuesday, October 31st, 2006

To continue on yesterday’s discussion of interlocked lists, let’s explore the nature of the interlocking done by the ExInterlocked* APIs. The ExInterlockedRemoveHeadList documentation says the following about its spin lock argument:

You must use this spin lock only with the ExInterlockedXxxList routines.

The documentation page provides a hint as to why this is the case:

The ExInterlockedRemoveHeadList routine can be called at any IRQL.

The reason that this function can be called from any IRQL is that the function acquires the spin lock at the highest IRQL in the system. To understand why this is important, we have to examine another kind of race condition – priority inversion deadlocks.

Recall that the kernel operates on a prioritization scheme implemented using IRQLs. Normal tasks run as PASSIVE_LEVEL, drivers run at some higher IRQL (called DIRQL), and other tasks happen at various points in between. (See the DDK or any introductory driver book for more information on this.) Drivers typically acquire spin locks at DISPATCH_LEVEL, which is below all DIRQLs.

A priority inversion deadlock can happen if a driver acquires a spin lock at DISPATCH_LEVEL, and while holding that lock, is interrupted by hardware. An interrupt service routine is invoked on behalf of the hardware, and runs at DIRQL. If the ISR tries to acquire that same lock, a deadlock will occur: the ISR will spin forever, waiting for the driver to release the lock, but the driver is stuck suspended until the ISR returns.

With that in mind, let’s come back to the ExInterlocked* functions. Suppose you try to acquire the spin lock at DISPATCH_LEVEL (with KeAcquireSpinLock), perhaps for the purpose of removing an entry from the list. Suppose that your hardware interrupts in the middle of your operation, and your ISR lands on the same CPU you were just operating on. If you then call something like ExInterlockedInsertHeadList, you’ll deadlock. The lower-priority routine will own the lock, and the higher-priority routine will wait forever trying to acquire it.

The solution is to follow the documentation’s advice and always use that spin lock exclusively with ExInterlocked* routines. When you use ExInterlockedInsertHeadList from any routine (not just an ISR), it raises the IRQL to the highest IRQL possible on that CPU, which masks out everything else in your driver – even ISRs. This prevents the priority inversion.

For what it’s worth, the documentation used to say something like ExInterlocked routines are only interlocked with respect to each other. The new wording says less but is much clearer in my opinion.

UPDATE: clarified wording to prevent deliberate mis-interpretation.

Why is there no ExInterlockedRemoveEntryList?

Monday, October 30th, 2006

A long time ago, I promised an entry on why there is no ExInterlockedRemoveEntryList function. If you search the NTDEV archives (or if you got to hear Peter Viscarola from OSR discuss it at one of the Driver DevCons a while back), you know that Microsoft left the function out intentionally due to its potential for misuse.

To understand why this is, consider one of the nice properties of a doubly-linked list: constant-time removal of an item from the middle, if you already know the item’s address. The list entries look something like this:

typedef struct _LIST_ENTRY
{
	struct _LIST_ENTRY *Flink;
	struct _LIST_ENTRY *Blink;
}
LIST_ENTRY, *PLIST_ENTRY;

To do a remove operation, you would simply point the next item (Flink) to the previous item (Blink) and vice-versa. No need to walk a long list of items. There’s even a macro to do this for you: RemoveEntryList.

This process is subject to an obvious race condition and another less obvious one. The obvious race is that two different threads could try to mutate the list simultaneously. The naïve solution is to wrap the removal in locks:

LockList();             // Spin lock, mutex, whatever...
RemoveEntryList(item);
UnlockList();

That does indeed prevent two threads from making simultaneous updates, but it misses another important problem: What if the entry you’re trying to remove is no longer on the list? What if another thread has just finished removing the same item, right before your call to LockList above? You certainly have no idea if the item’s neighbors are still valid after the item has been removed from the list, so you could easily trash the list.

The only safe way to do this is to ensure that the item is still a part of the list at the time you remove it. And the only safe way to do that is to walk the list from a point that you know will always be on the list, namely, from the head.

There are, of course, situations in which you can be sure, due to other semantics of the program in question, that your item really is still on the list. In those cases, the pattern above is safe.

But in other situations, you have to walk the entire list. This can be expensive, and has to be done under the protection of whatever lock you’re using. For a list with thousands of entries on it, you would want to avoid this whenever possible, and you should probably try to set up whatever extra bookkeeping you need to take advantage of O(1) removal. But, you don’t get that bookkeeping automatically, so that’s why there’s no ExInterlockedRemoveEntryList.

Downlevel support for Winsock Kernel

Monday, October 30th, 2006

David Powell from the provided me with some insight about the possibility of downlevel support for WSK, now that TDI is being deprecated. He tells me that the WSK team has been getting lots of requests for Windows XP/2003 support lately, and that it’s high on our list of things to do as soon as we get Vista out
the door.

As for Windows 2000 support, my impression is that it is pretty unlikely. If this really matters to you, I’d encourage you to follow the link in my previous post to send the WSK team feedback. Such feedback has been effective before.