Archive for the ‘Programming’ Category

Be careful with MmGetSystemRoutineAddress

Thursday, May 31st, 2007

Bill McKenzie reported on NTDEV that he re-discovered a nasty crash bug in MmGetSystemRoutineAddress. Apparently, all versions of Windows XP and everything before Windows Server 2003 SP1 will bugcheck if they’re passed an invalid system routine name.

This bug is fixed in Vista and is scheduled for fix in XP SP3.

Meanwhile, Peter Viscarola from OSR offers this advice:

The work-around is to always call MmGetSystemRoutineAddress from within a try/except block. If you get the exception, I guess you can assume the routine you’re seeking isn’t there…

UPDATE from Doron Holan:

FYI, using SEH to recover from this bug is *NOT* recommended. SEH is not a formal contract for this API and as such, we (MSFT) cannot guarantee that the OS is still in a stable state after you have caught the exception. I am working on a better solution, but for now, SEH is not the answer.

Microsoft discusses a redesigned OS

Tuesday, May 29th, 2007

I’ve had a long-running argument with anyone that will listen that multi-core computing will require a fundamentally different programming model. Of course, I’m remotely the only person saying this, but it seems to be a bit of a contentious topic regardless.

I just ran across an article on ArsTechnica in which a Microsoft exec discusses a future version of Windows that deals with massively multi-core computers. There is some interesting stuff in the article. I’d heard through the grapevine that this was underway, and if I weren’t so busy with PhoneFactor, I’d love to code on it!

I worked up a lock-free doubly-linked list last summer, and had intended to try it out in a couple of drivers, but predictably enough, I ran out of time. That, and Doron Holan promised me that it was a waste of time, and who am I to argue with him. :-)

This stuff is going to have a massive impact on usermode software when it eventually happens. For more on the topic, there is a good list of podcasts over at Xerox PARC on the topic, including one by Herb Sutter of Microsoft.

Always use the latest WDK

Monday, March 19th, 2007

One of the first topics that came up at the MVP summit with the kit team was which DDK/WDK to use for which target OSes. The answer is simple: always use the latest released dev kit. You can build drivers for any supported OS with the latest WDK.

This is at variance with advice seen in various places (NTDEV, the release notes themselves (!), USENET), and there are people who even advocate adding the kit to source control so that it can be re-built at any time in the future in an identical configuration. Obviously, this has to work, since it was once the only way to do things. There are other concerns and problems with this approach, and we discussed several of them with the team.

But, for all that, the use of downlevel kits is now officially deprecated by Microsoft, so at this point, I think it makes sense to bite the bullet and upgrade kits for released drivers. There are good reasons to use newer kits as well, and regardless, I expect PSS will soon insist on troubleshooting against current kits.

NdisMRegisterDevice and Vista

Tuesday, February 20th, 2007

If you have an NDIS driver that calls NdisMRegisterDevice(), be aware of how installing under a Vista UAC elevation affects your device object’s symlink.

As Ken discusses at length, objects created by users are created, by default, in a session-local namespace. So, consider the NdisMRegisterDevice() documentation:

SymbolicName
Pointer to an NDIS_STRING type containing a Unicode string that is the Win32-visible name of the device being registered. Typically, the SymbolicName has the following format: \\DosDevices\\SymbolicName.

If you follow this advice, your driver will create its symbolic link in whatever session happens to be current. During boot-up, the right thing will happen, but during initial installation of your driver, it will be created in the namespace of whoever is doing the installing. On downlevel operating systems, this was OK too, since the person who is most likely to be running the Win32 app to connect to the device is also the user who did the installing.

But under Vista, things are different. To do the installation, users are prompted to elevate, and the link creation ends up happening in the wrong place.

The solution is to follow the advice in the WDK topic Local and Global MS-DOS Device Names:

A driver that must create its MS-DOS device names in the global \DosDevices directory can do so by creating its symbolic links in a standard driver routine that is guaranteed to run in a system thread context, such as DriverEntry. Alternatively, the global \DosDevices directory is available as \DosDevices\Global; drivers can use a name of the \DosDevices\Global\DosDeviceName to specify a name in the global directory.

So, since binary-compatible drivers (9x) are a thing of the past, and since virtually every caller of NdisMRegisterDevice() will be in a user’s context during installation (until the first reboot), it’s safest to change to always using \\DosDevices\\Global\\<devicename>.

Subverting Patchguard v2

Monday, January 15th, 2007

It looks like Ken got bored again recently, which is always bad news for Patchguard. His Subverting Patchguard v2 paper is fantastic, again. In case you missed it, his (and Matt’s) Bypassing Patchguard on Windows x64, covering v1, is a fantastic read.

If you’re lost, this knowledge base article has the background.

Path MTU

Tuesday, December 19th, 2006

Anyone who has implemented a VPN has probably had to deal with MTU issues (unless you use a managed service). We’ve had code in our products for years to handle various MTU-related cases, and I’m going through some of it now. In double-checking our implementation, I took a peek at the relevant RFC, RFC 1191. In the introduction can be found this wonderful prediction:

It is expected that future routing protocols will be able to provide accurate PMTU information within a routing area, although perhaps not across multi-level routing hierarchies. It is not clear how soon that will be ubiquitously available, so for the next several years the Internet needs a simple mechanism that discovers PMTUs without wasting resources and that works before all hosts and routers are modified.

Woulda been nice…

At any rate, this is one of those IP features that is often handled incorrectly by firewalls. If you don’t allow Datagram Too Big messages (ICMP Type 3, Code 4) through, you are effectively killing performance (or even connectivity) for any paths with a Path MTU of less than your NIC’s MTU.

This is becoming more important: tunnel interfaces are proliferating, and as security consciousness increases, we’re only going to see more of this. For non-TCP-based tunneling protocols (IPSec, PPTP, GRE, L2TP, L2F, …) the carried packets are interface MTU minus overhead, so if you add, say, 100 bytes of overhead per packet, you wind up with a tunnel interface MTU of only 1400 bytes. Unless you want to have tons of frags, you have to listen for and handle Path MTU messages.

There are two common symptoms to path MTU problems:

  1. Connection speeds are bad and/or highly variable, caused by the fact that routers have to bounce your packets off of the line card and up to the CPU to fragment them (much slower), and by the fact that you’re doubling the number of packets to send. This also introduces the possibility of out-of-order delivery and the consequent fragment queuing. And to top everything off, fragmentation paths are never as well tested as non-frag paths, so there may be OS-level perf or security bugs you’re going to tickle.
  2. Connections suddenly quit working. While there are sometimes other explanations for this, a big cause of this problem is dropped datagrams due to MTU. TCP (usually) sets the DF bit on outgoing segments. If a router receives a segment that’s too big for the egress interface, and that segment has the DF bit set, it has no choice but to drop it. It then sends back its PMTU ICMP message, but the firewall drops that message, so the sender simply thinks that that the receiver has dropped off the face of the earth. Retransmissions happen and eventually the sender gives up. This generally happens with the first big packets in a session, such as (for example) when a terminal services client is painting the desktop.

So, if you’re an implementor of a firewall (or similar), pay attention to proper handling of Path MTU messages. And if you’re an end user and notice any of these symptoms, try removing firewalls and NAT devices and see if the problem disappears. It could be due to dropped PMTUD messages.

SMB don’t need no stinkin’ NAT

Monday, December 4th, 2006

Ken has a good write-up of a feature we found while working on a customer’s SMB networking issue: SMB is incompatible with (overload) NAT.

This is amazingly broken, and seems to me like an excellent opportunity for a DoS in a terminal server environment. Regardless, Ken’s walk-through of his debugging procedure is worth a read.

Don’t make a Rembrandt

Saturday, December 2nd, 2006

I can’t remember where I first heard the terms complexifiers and simplifiers, although it looks like they were originally coined in a blog post by Scott Berkun, but I’ve found them to be more and more useful in recent weeks.

My company is working on a Great New Service™ and we’re trying to finish up the first round of design. Virtually all of the intellectual energy I’m spending in these discussions is centered on factoring out complexity whenever possible.

Complexity is amazingly expensive:

  • It’s obviously more expensive to code.
  • It costs the business in terms of things like time-to-market.
  • It takes (much) longer to test.
  • You can wind up actually sacrificing the quality of the product you were trying to improve, by rushing the implementation process or short-changing QA (or, likely, both).
  • It presents bigger attack surface.
  • It makes security architecture harder.
  • It is more prone to bugs.
  • It makes individual bug fixes harder.
  • It requires smarter/better/more experienced developers.
  • It makes it very hard to replace those developers.
  • It’s harder to communicate your marketing message to your audience.
  • It can negatively impact scalability, performance, and responsiveness to demand changes.
  • It can kill your ability to adapt your product to changing market conditions.

I have a rule related to this: a system’s overall design is limited to what the lead technical person can fit into her head. Once your system is too complex to fit between one person’s ears (at a sufficient level of detail), you have to start dividing the architecture and scaling the team in ways that have yet more expenses built in, in terms of design, management, and programmer interactions.

Put another way: You have a limited amount of complexity that can be spread across your product. Spend it carefully.

Simplification is difficult, because complexity is the default route of most bright people that I’ve met. It’s fun (and on a deeper level, very satisfying) to build complex things. That’s why simplification has to be a going-in design goal. When you re-state simplicity as a goal, the team’s creative talents can be applied to the problem of simplification, which makes everyone happy.

There are tons of examples of people battling with this question, ranging from this week’s Vista vs. OSX shutdown wars (more here and here and here) to more abstract designs like Networking Truth #12.

This point of view may be an oversimplification :-) or you may already be years into a shipping product, with few chances to apply this principle. But if you’re in the design stage of a new product or service, like I am right now, do yourself a favor and (as my grandfather used to say) don’t make a Rembrandt!

Why can’t you un-pend an IRP?

Thursday, November 30th, 2006

I was playing around with SDV and the pending bit the other day, and tried setting and clearing it in back-to-back lines in a dispatch routine. Having CSQ mark the IRP pending (which is automatic, if it succeeds at queuing the IRP) caused SDV to blow up with a very confusing error.

According to a PowerPoint slide by Adrian Oney, here’s the reason:

There is no IoUnmarkIrpPending because a driver above you can legally mark your stack location pending and return STATUS_PENDING

He goes on to say that PoCallDriver does this. News to me! I had always wondered why it was illegal (as opposed to simply immoral) to mark and un-mark an IRP as pending; it’s because in so doing, you would destroy the state of the driver above you who was depending on this.

Fifteen ways to turn off a laptop?

Tuesday, November 21st, 2006

Joel Spolsky is a bright guy. His latest blog posting about user interface design, Choices = Headaches (should that be == instead?), makes a case that I’ve been rolling around internally for our Next Big Product. Simplisticly: sometimes too many choices are worse than you’d think.

The really amazing part is that he references some research done by Barry Schwartz, a professor at Swarthmore College. Amazingly enough, I just listened to a podcast of one of his lectures on this topic this week: Too Many Choices: Who Suffers and Why. It’s really in interesting lecture, and worth an hour of anyone’s time who is tasked with product design.

I continue to be amazed at how the Blogosphere™ cross-links itself in fascinating ways.