Archive for January, 2006

Learning languages using the 80-20 rule

Tuesday, January 31st, 2006

The 80-20 rule is amazingly useful. As someone who occasionally tries to learn new languages (human languages, in this case), I had always wondered how good language textbooks really are at guessing which words and concepts to introduce in what order. So tonight, I got tired of wondering and wrote a little program to analyze word frequencies.

My latest language learning project is Latin; I’ve tried a few times and haven’t really gotten there. Seems like it’s time to try again. Which words should I learn first, though?

To find out, I downloaded the entire New Testament of the Vulgate (the Latin version of the Bible) from Wikisource. The Vulgate is relatively straightforward, as Latin goes, and has been in active use by the Church for over 1500 years. Plus, it’s in a consistent style (for the most part), which helps the analysis.

If I restrict the program to the top 10 words, by frequency, in the book of John, I get this:

Most common 10 words
These 10 distinct words (3085 instances) account for:
  0.390472471690746% of 2561 distinct words
  21.9323190672544% of 14066 total words
et (898)
in (377)
non (307)
quia (258)
est (235)
me (213)
qui (207)
autem (201)
Iesus (199)
eum (190)

Those stats are pretty impressive at a glance: 1/3 of a percent of the distinct words make up over 20% of the total words in the work. So I started wondering where the 80-20 rule cutoff is; what does it take to get 80% of the total words represented? After some playing around, I came up with this:

Most common 561 words
These 561 distinct words (11254 instances) account for:
  21.9055056618508% of 2561 distinct words
  80.0085312100099% of 14066 total words
et (898)
in (377)
non (307)
quia (258)
est (235)
me (213)
...

So, the 80-20 rule is amazingly close here: 20% of the distinct words make up 80% of the word instances. But this is a pretty small sample size; what does the whole New Testament look like?

Most common 10 words
These 10 distinct words (26230 instances) account for:
  0.0579777365491651% of 17248 distinct words
  19.6052051333797% of 133791 total words
et (9404)
in (4449)
est (2195)
autem (2128)
non (1976)
qui (1897)
cum (1125)
ut (1095)
ad (1066)
enim (895)

In this case the top 10 words still account for about 20% of the total word instances. The 80% cutoff for instances is even more dramatic, though:

Most common 2641 words
These 2641 distinct words (107032 instances) account for:
  15.3119202226345% of 17248 distinct words
  79.999402052455% of 133791 total words
et (9404)
in (4449)
est (2195)
autem (2128)
non (1976)
qui (1897)
...

In this case, you only have to know 15% of the distinct words to get to 80% of the instances, but that does represent a lot more words. But any way you slice it, my Latin textbook (Wheelock’s, which I understand is quite popular at the University level) is only in partial alignment with this word list at best. I wonder how much this would change if I threw Caesar or Cicero at it.

Don’t take this too seriously (Lies, Damn Lies, and Statistics), because this is bound to be wrong for a number of reasons. One excellent reason is the highly inflected nature of Latin, meaning that words change by just a few letters very often. There are something like 15 different endings for the adjective magnus, for example, depending on the word’s function in a sentence.

I’m curious to try this on French and maybe English, which are progressively less inflected than Latin. I’d expect the results to be even more dramatic, although I’ll have contractions to contend with.

Anyway, if you’re curious, my little perl script is posted here.

More on stack overflows

Monday, January 30th, 2006

The conversation continues regarding the stack overflow article I posted yesterday.

Quoth Matt:

The attack I was referring to was for a conventional stack overflow (return address overwrite). It’s still possible (if the stack is layed out properly) to leverage a stack overflow in the same frame that the overflow occurs. I just simplified for the conventional scenario :)

…and Ken Johnson, another Positive coder, adds:

It’s also worth pointing out that you can make the stack grow in the opposite direction on x86, just that in practice almost nobody uses that feature.

Really, a better idea is to just switch to a more modern architecture with a better designed calling convention instead of switching which way the stack grows. x64 and IA64 don’t have SEH overwrite vulnerabilities on Windows, for instance. Additionally, as I recall, IA64 doesn’t even store the return address on the primary stack, which makes it significantly harder to gain flow control through a stack overflow (though certainly not impossible if there are things like function pointers on the stack).

I’m kind of surprised that Apple didn’t just go to x64 directly and skip x86-32 entirely.

I hadn’t thought about going directly to x64 before, but now that Ken mentions it, it does make me wonder. One of the main drivers for Apple’s Intel switch was their inability to get a fast PPC chip into the laptops, due to power and heat issues. I wonder if Apple had decided that x64 wasn’t going to be ready with in time with sufficiently low-power chips.

The DOJ and Google

Sunday, January 29th, 2006

Steve Rubel has an interesting article about a hypothetical DOJ breakup of Google on his blog. Among other things, he speculates that Google may have to compromise its principle of don’t be evil in the face of pressure to justify the market’s amazing faith in its business[1].

My question is this: isn’t Google’s unique corporate structure supposed to help insulate it from this sort of thing?

[1] Interestingly, Google’s business declined in value this week, partially as a reaction to Yahoo’s bad news. It seems that they lost the entire market cap of Amazon.com in one week. Sheesh.

Security implications of Apple’s Intel move

Sunday, January 29th, 2006

Just ran across this eWeek article: Apple’s Switch to Intel Could Allow OS X Exploits. My first reaction was that this is bogus, but I guess there is a good point to be made – there are a lot more people out there who are comfortable with the x86 architecture than with the PPC architecture, at a machine code level. That matters.

The thing that might matter more, though, is endianness. Matt Miller, a co-worker at Positive Networks and frequent security presenter at various hat-related conferences, has pointed out to me that if stacks grew in the opposite direction (i.e. big endian vs. little endian), stack overflow attacks that overwrite return addresses are more difficult. PPC is big endian, and x86 is little endian. That doesn’t do anything for other kinds of buffer overflow attacks (or any other kind of attack for that matter), but ret-based attacks are common. (I don’t know anything about PPC assembly, and Matt was talking about SPARC at the time, but I assume this still makes sense.)

One area in which Microsoft really shines is in base OS. Apple’s kernel has some architectural issues that it inherited from its open-source ancestry, and it still lacks support for things like DEP.

At any rate, it will be interesting to see what happens. It’ll be hard to tell the real truth, though, if Apple really does sell a lot more Macs.

Update: Matt sent along this clarification regarding stack-based attacks on big endian architectures:

Stack-based overflows are perfectly possible on big endian architectures. Here’s a document that describes how.

The distinction revolves around the frame to which the buffer belongs to. You need one level of nesting in order for it to be possible (due to the direction the stack is growing).

x64 signing FAQ

Saturday, January 28th, 2006

Microsoft has posted a FAQ discussing the recently announced code signing requirement for x64 drivers. I’m not sure it contains anything too new and interesting, though, if you’ve been following NTDEV.

New DDKBUILD

Saturday, January 28th, 2006

‘Tis the season for new tools. Mark Roddy has posted version 3.13 of DDKBUILD on his website. Download from http://www.hollistech.com/Resources/ddkbuild/ddkbuild3_13.zip.

From Mark’s announcement:

The latest released version is 3.13. This version adds quiet mode – the script attempts to reduce all output to stdout to the minimum, simplifying the use of ddkbuild in automated build procedures. Credit goes to Beverly Brown at Mercury Computer for the quiet mode implementation. The Vista DDK (build 5270) is supported. WDF (KMDF1.0) is released and 3.13 supports that as well. DDK’s going back to the Windows 2000 DDK continue to be supported. Visual Studio 2005 is supported.

In addition a lot of bugs have been fixed. Bug fixes were suggested by Daniel Germann, Thomas Schimming, David Craig, Norman Diamond and probably other people I forgot about.

New WinDbg release

Friday, January 27th, 2006

There’s a new WinDbg release in the world, released yesterday, as version 6.6.03.5. Download the 32-bit version here.

Engineering lessons from Challenger

Friday, January 27th, 2006

I remember watching the Challenger explode on live television. I was in third grade and had just returned from the cafeteria to go pick up my lunch. We were eating in our classrooms so that we could see the live shuttle launch. There was a lot of discussion about Christa McAuliffe getting to go into space. We were all stunned when it blew up – I had a hard time believing it wasn’t just a stunt of some sort.

After the accident, Richard Feynman wrote an appendix to the government report discussing the reasons for the crash. It is recommended reading for anyone who works in an engineering field, including software engineering. It’s short and to the point, and as is usually the case with Feynman’s stuff, it’s a great read.

Happy birthday Wolfgang

Friday, January 27th, 2006

Today is the 250th birthday of Wolfgang Amadeus Mozart. So I’m curious: does that strike you as “wow, he’s old”, or as “Wow, he’s not very old”? I’m in the latter camp, myself. It’s amazing to me that he was writing during the American Revolution, which seems relatively recent to me from a world history perspective. It’s also amazing to me that he only pre-dated Jane Austen by a few decades.

I just got the urge to go rent Amadeus.

PlugFest ‘06

Thursday, January 26th, 2006

Microsoft has announced the 2006 edition of IFS PlugFest:

The Microsoft Filesystems and Filters team is pleased to announce IFS Plugfest #15. This event is scheduled to be held between April 24 and 28, 2006 at the Microsoft Campus in Redmond, WA.

If your product is using file system filters technology, this is one of the premier events where you can directly interface with the filter team from Microsoft. You also get to test for interoperability scenarios with other companies using filter drivers in their products. Please visit the Plugfest web-site to find out more about this event.

http://www.microsoft.com/whdc/driver/filterdrv/IFSPlugfest.mspx

Registration for this event is now open. http://whdc.microsoft.com/plugfest/registration.aspx

Due to the increasing number of attendees to this event every year, we urge all those interested to register as early as possible to secure a seat. A separate email will be sent out to confirm your registration and give your more information about the details of the event.

Thanks,
Microsoft Filesystems and Filters team.

A great reason to go to PlugFest is just to meet other filesystems developers. Filesystems have uniquely challenging interoperability issues, and PlugFest is a great way to build up a rolodex for times that those inevitable interops show up.