Monday, August 08, 2005

Challenges of bio software

A set of conversations I've had recently with the people doing computational work in my lab have set me thinking about the fact that, with the increasing use and creation of software by biologists, they're increasingly going to have to face [at least] two issues that accompany software: maintainability and correctness.

The "maintainability" issue takes many forms -- can anybody other than the person who wrote the code understand it without spending weeks trying to puzzle it out [before giving up in frustration and writing his/her own version] ? Should there be minimal standards with respect to things like amount of comments, variable names, function descriptions etc ? As the grizzled veteran of many "coding standards" religious wars [eg Hungarian notation or not ? K&R style ? Tabs or spaces ? ... the list goes on], I'm well aware of how difficult it is to get people to agree on, and stick to, such standards even in a company whose main output is software. Given that, I expect the task of trying to get [non-CS] grad students, those wild, rebellious and free spirits ;-), to follow a strict coding standard to be roughly akin in difficulty to trying to get 10 cats into a filled bathtub -- somebody's gonna get hurt.

Correctness is an even tougher nut to crack. Beyond the obvious "software is hard", a couple of other field-specific reasons come to mind:

- When examining conclusions drawn from experimental work, biologists are trained to ask "What's the control experiment ?" ie what's the experiment that shows that the data that's being shown is a result of the effect being tested, and not due to a different cause than the one being claimed. From what I've seen, though, results based on computational work tend not to be as thoroughly dissected -- half the audience tunes out during the section of a paper where the authors start talking about p-values, cross-validation via data shuffling, the differential equations behind the model etc. Maybe that will change as the next generation of biologists gets more computational training.

- In many labs, the grad student or post-doc is, the person entrusted with both the responsibility of writing the code and making sure it's correct. However, very few people have the discipline to really test their own code thoroughly, an issue the software industry has long grappled with, and will continue to grapple with, recent methodologies like Test-Driven Development notwithstanding; some [many ?] software companies even have teams that are dedicated to doing nothing except trying to break code written by other people. And yet commercial software isn't exactly bug-free; so, even assuming that most bio grad students aren't writing horribly complex programs, chances are pretty good that there are lots of bugs lurking in their code that will never be found.

The obvious responses are "Ok, so what if the software isn't totally right and has a few bugs in it ? It's not like we're trying to write production-quality software" and "Biologists have enough subject-specific knowledge that they can spot anomalous results produced by the software". That's true when you're talking about egregious errors, like, say, a simulation that predicts that your little bacterial cell will grow to the size of an elephant and be able to digest a car, in which case you may want to check your boundary conditions.

Where the argument falls down is when you're dealing with trying to extract information [that you don't know in advance] from a big pile of noisy data -- beyond isolated spot checks, how do you know that your code is actually doing the right thing all the time without doing some fairly elaborate testing ? If your code is indeed doing something wrong, but only some of the time, you might end up missing something significant, so it's not a problem you can just ignore. Or, rather, it's not a problem you should ignore, yet I suspect that's what happens most of the time [in academia, at least].

I don't have any solutions to the problem of correctness [other than encouraging people to test their code thoroughly =)], but I suspect that computational biologists could probably learn something from how the high-energy physics and astronomy communities, both of whom also generate lots of data that they then have to filter, deal with this.


Post a Comment

<< Home