Thursday, 12 February 2009


Diversion. An interesting product, and am curious as to whether on the side you could use it for some rough and simple social network analysis:

You can get the ID of your twitter followers and who you're following in XML format, strip the format, then use to find these folks followers and who they're following.

the useful part would be to add a filter, to be able to check to see if they had at least one post with a key word in, e.g. bioinformatics, or Ruby or RoR. To pare down the graph a bit, as i'd suspect it would get horrendous after a few links deep.

Wednesday, 11 February 2009

Could you help with a Perl project?

I was wondering if anyone might be able to help a bit, with a current Perl based project i'm working on. Given some peptide fragments from Mass Spec experiments, including duplicate epxeriments on the same protein, and given the gene ID, looking to compare the found peptieds, with what wasn't picked up by Mass Spec, that should have been created by trypsin digest, and potentially shown through MS.

Given a text file with the structure





Looking to analyses this. To compare the "seen" peptides, with the "unseen" peptides, e..g physicochemical properties.
Wondering what properties might make certain peptides observable, others not so.

Looking to create perl script to determine the properties of these 2 groups - the observed peptides, and the unseen peptides. Then compare the properties of the two groups (isoelectric point, length, MW, amino acid composition etc).

Any pointers? Or an idea through some pseudocode? I can update post and include code so far.

I can use BioPerl pepstat, emows etc I'd imagine. Or collate the physicochemical properties in an array and then export to R, but i'm happy to just get figures, then do visual analysis/ some data chewing through Perl's graph tools.

Wednesday, 4 February 2009

But first, some Phylogenetics

Happily getting in outside speakers on our course, a recent one was on Phylogenetics, given ably by Julia from GSK,QSci on 27th January. The date is important in part, because the New Scientist had this cover and story for the 21st. No mention of it in the lecture, but it does throw up some points.

Cover of 24 January 2009 issue of New Scientist magazine

"Darwin was wrong". The full article is here

Phylogenetics: what, how and when

So i'm listening and watching the presentation, and reading the article at the same time, and it felt at the time that more skepticism was needed. It's sometimes hard to gauge if the speaker means it when they say they'd like questions even during the talk. I'm all for asking questions, up to the point my coursemates get annoyed, and I get close to asking too many.

So go read the article linked above - The picture in the articke is close to that used in the 1st main slide, regarding "Phylogeny -a brief history" (which stated "the display of inferred relationships as a tree can be traced back to Charles Darwin) So that's inferred relationships. This is a model. There is no spoon.

The brief history covered the
- Origin of molecular biology techniques (Immuological assays, electrophoresis, DNA hybridisation)
- Protein sequencing data available
- Data used to address questions regarding evolution
- Computers began to be used to compare sequences
- 1967 Fitch & Margolish perform the 1st study using sequence comparisons to assess phylogenetic relationships of Cytochrome C sequences in different organisms

- DNA sequencing
Matehematical algorithms formulated to understand sequences

- Contigs of sequence available
- Gene assignment based upon sequence homology
- Expansion of BLAST, alignment & phylogenetic methods

- Genomes
- Arrival of robust & fast sequencing methods
- Integration of complex mathematics into phylogeny (e.g. Bayesian)

~20 bullet points, but if you said Tufte to most scientists, they go "what?" Rather than even "who?" you say Dipity, they say doo dah.

So does the "kernel" need a rewrite for Phylogenetics? Seemingly they're gunning for a tree, partly from historical usage, and partly, because the models can't deal with other shapes yet (see also how some systems biology seemingly can't deal with feedback loops. Which are kind of important in Biological systems).

It made me think of Clay Shirky's write up on ontology here (audio here) and more specifically the File Systems and hierarchy section here. Probably the drawings - seeing the "just links, there is no filesystem set of pictures. There are plenty of articles about his talk, and lots of feedback on it e.g. here.

Seemingly, the system of phylogenetics was one of using a potentially shoddy model (knowingly), then retrofitting it - tweaking it to what the phylogenists (sp?) thought was right, then using giving it just a light dusting of scientific-ness.

A tree shape, only bifurcations, and problems with what lengths of things mean. There are several problem areas it seems. The problems of rooted vs unrooted trees. Molecular clocks? What happened to them? But decent is not exclusively vertical. Which causes problems - as the visualisation of the data, in the current way, can't show the complexity. Is this in part, a data visualisation problem? Some things, just can't be easily shown on a piece of paper in a journal.
Is phylogenetics having problems with it's pigeon holing? Doolittle's view that the history of life can't be properly represented as a tree seems to resonate for me. Why not visulise being statistically fuzzy with the lines of a tree at least? Are all changes equal in effect? Another problem.

Some other dates that could have gone in there, courtesy of the New Scientist article:

- DNA sequence of bacterial and archael genes becoming available, not just RNA. Some points, RNA saying A closer to B, but DNA saying A closer to C.

Unicellular archaea - an undiscovered major branch of the "tree of life" - previously thought of as bacteria were

So Horizontal Gene Transfer, is the Big furry Elephant sized problem in the phylogeneticist's room. When you've got people saying that Homo sapiens are an exception, there's a problem. When they're saying that in eukaryotes HGT is the rule rather than the exception, that's harder still. With bacteria and archaea and unicellular eukaryotes > 90% of life, with multicellular life just a small part of the word we live in, there's a problem. Also see endosymbiosis, and genome realignment, and presumably, several other mechanisms that'll effectively make the tree a thicket. You're back to being an archaeologist, looking for genetic fossils, to actually pin some dates and sequences down.

Then add a soupcon of the assumptions
- All mutations are independent
All mutations can reverse to a previous state
Mutation process are consistent through time
Mutations not influences by a previous mutation at that site
Lineages arise in a divergent tree-like evolution.

Fair bit on the methods at this point, which i'll add later. You've got to deal with 3rd base wobble (some changes have more or less meaning - ( i think she referenced information theory, but that's kind of hard to go over if you've not read up).

An unmentioned kicker - alignment is primary structure, not accounting for tertiary structure. So then you're playing around with mutation rates as a window-size average over the primary sequence it seems, with level of likelihood of change over a sequence.

So are the new methods the Emporer's new phylogentic garb?
Have there been any actually calculated fully tree spaces?
Is the list of alternative evolutionary processes, actually the other way round, and the current ones are the "alternative" though currently in fashion ones?

It seems there's potentially some confusion through looking at an organism at a gene by gene level, versus a genome level.

I wonder if this will come up on the exam.