Systems biology is, like, hard and stuff
All that said, I did get some useful insight into what it's like to do "systems biology" by taking a large data set and trying to analyze it computationally. Basically, I was working on using Bayesian networks [check out the author of the quote at that link -- who knew Michael Jordan was into statistics and graph theory ;-) ?] to analyze some data about how T cells react to certain chemicals. Bayesian networks allow you to come up with statements like "according to this data, A affects B, C and D, but not E; D affects E and G, but not F" and know how likely it is that your statement is correct. [I'm going to refer to such statements as "networks" in the rest of the post, because that's they're networks of interactions.] These sorts of models are nice in that they are human-interpretable ie there are few-enough interactions at each level that a human can look at them, make some sense of them and figure out what experiments to do next.
A lot of my time was spent massaging the data into the right format and getting some existing code to run, but I also did get a chance to generate a bunch of networks. The tricky bit, I found, isn't generating the model, it's figuring out how much you trust it and what to do with it. The problem is that trying to find the absolute "best" network is NP-hard ie it takes an insane amount of computer time. For example, if you only have 4 nodes in the network, there are over 500 possible networks and if you have 10 actors, there are on the order of a billion billion [yes, that's "billion" twice] possible networks. So, since we had about 12 nodes, so we couldn't do an exhaustive search. The only thing to do is follow what's called a "greedy" strategy, which meant making up random networks and trying to optimize them. The more such networks you can generate, the better a chance you have of finding a good one -- it's basically a question of how much computer time you can afford to spend on the problem.
In this case, we didn't have much in the way of computing resources, so we'd run the search for a couple of hours, or overnight, and see what we got, and that's where it became tricky to figure out what to do. Our runs produced networks that didn't look much like the interactions that are known from the actual biological literature ie we had things supposedly affecting each other that were, uhm, "unsupported by the evidence", so to speak. Now, part of the appeal of the sort of "machine learning" procedures like Bayesian networks is that they're supposed to be able to uncover relationships that are there but have never been noticed before, so the question became "Geez, the network we generated says these two things are related, but there's nothing in the literature about it, should we go do some experiments to figure out whether they actually are related ?". In order to answer that question, you have to know how much you trust the generated network ie how likely it is that it's actually showing you something real that's worth investigating further. But, from an abstract perspective, you don't really know how good the network is [ie how much you should trust it] because it could well be that your random search only turned up crappy networks and there are much better ones out there. You could, of course, say that you won't trust any network that doesn't show you what's already known from the literature, but if you'll only trust a network that tells you what you already know, what's the point of doing all this in the first place ?
In our case, the question "Should we trust this network ?" wasn't really that hard to answer -- we knew that we hadn't run the search for nearly long enough and that a lot of the networks we generated were just pure bunk from a biological perspective, so there was no need to rush to the lab bench and try some more experiments. If I had to do this for real, I think the strategy I would adopt would be a combination of using the known biology to restrict the search space somewhat [without totally removing the ability to uncover unknown connections] and running the search for a long time [on the order of weeks, probably].
Of course, all this assumes that the actual data you're using is good ie that everything was measured correctly, that you've actually represented the input data to your algorithm properly etc. If that's not the case, it may well be that you run your search for years and never get anything that makes any sense. Hopefully there are some initial sanity checks you can do before wasting time and money on an extensive computer run.
So, all in all, I definitely learned something about the pains and perils of large-scale biology. My next rotation is going to be back to the small scale [in the figurative sense]: it'll be with Tom Knight [a well-known MIT computer scientist turned biologist], working on cell-to-cell communication in bacteria.