Wednesday 4 November 2009

The Line Endings Are Conspiring Against Me!

It's not paranoia if they're really out to get you!

At work we are gearing up for an important release of GATE. We released a beta last week so that we could get as much feedback as possible on the current build before making the final release. Unfortunately I managed to introduce a show-stopping bug into the build which rendered the default GATE application (ANNIE) unusable under Windows.

I'm pretty careful about testing code before I commit changes so I was mortified when it became obvious that the beta was badly broken. What follows is a verbatim copy of an e-mail I sent to the GATE group explaining how the bug was introduced and why it hadn't been spotted by any of our tests. The specific situation might not happen to many software developers but it's a good example of why we shouldn't rely only on automated tests for application stability.

I was very annoyed at myself for allowing the bug to slip into the beta build but at a complete loss as to exactly how it happened. I've been thinking about it some more and I now know how it happened and I think the reason is important enough to share with everyone.

This isn't an excuse for what happened but it should be a warning to everyone else that sometimes even when we are careful with tests etc we can get caught out. This will be quite a long e-mail as I want to fully explain the situation for those not familiar with the internals of the JAPE parser/compiler. If you want the take home message there is a simple summary at the bottom.

The code that fails was trying to split a block of Java code generated by parsing a JAPE file into separate lines and then return the line at which use of the deprecated annotations parameter had been spotted. The method I checked in looked like this:
public String getSource(int line) {
String[] lines = getSource().split(nl);
return lines[line-japeLine];
}
where nl is a class field that is initialized to System.getProperty("line.separator"). So this method should split a block of text into separate lines based on the platform specific line separator and then return a specific line from the resultant array. This should be safe as the code block is from the generated Java source, which uses the same nl field for separating lines.

I developed the code at home under Windows and it worked -- ANNIE ran with no sign of the exceptions reported against the beta. I checked the code in and a short while later Hudson checked it out, built it and tested it (including running ANNIE) under Linux (see here for the latest build/test results). Everything built and ran successfully.

At this point with it having been tested under Windows and Linux I was fairly happy that the code was stable (I was especially worried after Hamish had agreed to the change but had cautioned about adding new code so near to a release as statistically new code means new bugs - man was he right!).

Ian then built the beta release, publicized it on the mailing list and as we know ANNIE broke spectacularly for anyone trying it under Windows. The only question is why?

I'm willing to bet a beer/coffee to the first person with the time to checkout and build the beta from subversion (SVN) under Windows that it works. So what is different about the builds Ian pushed to the website?

The answer is that the line endings and subversion have conspired against me :(

When a JAPE source file is parsed the Java blocks that make up the RHS are added to the Java source code as is. And we use the platform specific line separator to build the rest of the source. When we check JAPE files into SVN we make sure that we set the svn:eol-style to native. So if you check out GATE from subversion all the ANNIE JAPE files have the native line endings and so the full Java source file that we eventually compile has the same line ending throughout and everything works. This is why ANNIE ran under both Windows and Linux when checked out of SVN.

When Ian built the betas I'm guessing that he did so under Linux (or on his Mac but for the purpose of this discussion that doesn't matter). So again the tests would run as the JAPE files would have Linux line endings and we would use the Linux line separator.

A user then downloads the beta from the website and tries running ANNIE under Windows -- it fails. The problem here is that the JAPE files in the beta builds have Linux line endings and we then use Windows line endings to assemble the code. I then use the Windows line separator again to split the code to get at single lines of source. Linux uses the single \n to represent the end of line while Windows uses the two character \r\n. So when I try and split code containing Linux line endings using the two character Windows line ending nothing happens and the array offset exception is thrown.

If we had built the beta under Windows then we may never have spotted the problem, as splitting on either platform using either line ending would have worked. The problem would have only arisen if someone created a JAPE file under Linux and then tried to use it under Windows without SVN in the middle, which conceivably might not have been until after the final release.

My fix was to simply change the offending method to
public String getSource(int line) {
String[] lines = getSource().split("\n");
return lines[line-japeLine];
}
which works on all platforms we support.

Looking back it's clear that there is no way we could have easily caught this using the automated tests. Fortunately we have caught the problem now rather than after the final release, yet I don't have any idea how we could stop similar problems occurring again in the future.

In Summary: The bug only appeared because the svn:eol-style was set to native, the beta release was built under Linux, and together this means we end up with a mixture of line endings when running under Windows. Had we built the installers under Windows the issue of running ANNIE would never have arisen but it would have bitten anyone editing JAPE under Linux then running under Windows (without SVN in the middle), which may not have happened until after the final release.

If anyone has any thoughts/suggestions on ways we could improve the testing to try and pick up such weird cases in the future please let me know and I'll try adding them to Hudson.

Sorry for both the bug and the long post but I thought it worth taking the time to explain how the problem arose so we can all (me especially) try and avoiding it happening again in the future,
4 November 2009 at 15:40 , Ian said...

Sounds like nl is a leaky abstraction.

Most people would have coded it the 'safe' way as you did. I wonder though whether there are other split string example in the code where just \n is already being used?

The only thing to take on board is the great advice "statistically new code means new bugs"!

4 November 2009 at 17:09 , Scriptor Senex said...

"Duh!" is about the most intelligent comment I can come up with.

4 November 2009 at 17:47 , Mark said...

Ian, just \n is being used in the same class. In fact it's being used in a method I wrote the day before the one that caused me all the problems.

I nearly always just assume \n will do as a line separator and then part way through coding I spotted the use of nl in the class and so decided to be consistent. Man was that a bad idea!

Post a Comment