Monthly Archives: September 2007

The Rich PDB interface explained

It was a while back that I caught the video on the PDB site which explained all the functionalities that its search interface has. Thanks to the screencast I became a much more efficient querier of the PDB,  especially after they adopted the new ( now almost three years old) interface.

I strongly believe that screencasting can play a role in helping us all search better.

Since I work on crystallizing  membrane proteins , I found the MPDB very useful and decided to screencast its  features.

I sincerely hope that  database creators, and users alike, take to this effective medium and screencast their tips and tricks for us all to benefit from.

.

Why are our bioinformatics workflows so complicated!

Why are our bioinformatics workflows so complicated!

Last week to answer one question I had to resort to information from several sources . A lot of them contributed immense value to my “workflow” and were also either difficult to perform or very easy. For a start I have ranked them in terms of both Value ( 1 for no value to 10 for a lot of value) to ease of use ( 1 for very complicated to 10 for very easy)

# Assembling my sequences in DNAstar (Value 10 : Ease 7 )

# Compiling my sequences and pulling them into Jalview. Ran CLUSTALW web service on edited alignments and realized that all of my clones had basically two sequences for their CDRs . . Jalviews excellent web-service CLUSTALW interface allowed me to quickly edit the 32 sequences , align them interactively and realize they belonged to two types. This got me thinking that maybe the primers I used to clone my CDRs from my mouse kappa light chains were probably mis-priming ( Value 10 : Ease 9)

# Use pubmed to look at precedents i.e analyze all possible papers which had sequenced the mouse anitbody kappa light chain CDR region as I had attempted to do and derive the sequences of the primers they had used. It took forever to get the right keywords to query and I still have only three kappa light chain primer sequences. ANd they are all different! ( Value 10 : Ease 1 ),

# Use my primer sequences , compare them with the literature and figure out how I had misprimed and why my sequences were all either of two types ( Still in progress Value immense : Ease 1 i.e still difficult to do)

# Use pubmed / NCBI genome to understand the sequence space for mouse kappa light chains ( Value 10 , Ease 4 , )

# Use EBI to get the same sequence data ( Value 10 : Ease 8 )

This is still work in progress . But to summarize –

The pubmed steps were the most painful . Pubmed search has to improve!.

Jalview contributed the most value. For a free App its a must have in any bioinformatics toolkit!. DNAstar played its role ..but for its cost ( a few thousand dollars )! It sure gave a lot less value than Jalview

All of this begs the question! ..why are bioinformatics workflows so difficult! We are a long ways away from making these things easy to do for everyone!

The RESTful NCBI query

I first caught this on Pierres blog.

NCBI it turns out can be queried along REST principles ( hence the RESTful in the title). Ever since learning about REST-based URLs , I always wished that many web APIs implemented the ideology in their design. I was excited to learn how easy and intuitive it becomes to query a database using REST principles.

Gone are queries that looked like

http://www.ncbi.nlm.nih.gov/sites/entrez?db=homologene&cmd=search&term=dystrophin

And here come queries that look like this

http://view.ncbi.nlm.nih.gov/homologene/search/dystrophin 

which look for genes that have homology to dystrophin.

Several of the web APIs like the one for connotea and del.icio.us are also implemented RESTfully, making them very easy to query. For eg to get all entries on connotea or del.icio.us with tag metagenomics you would query the URL

http://www.connotea.org/tag/metagenomics

Or on del.icio.us  the URL

http://del.icio.us/tag/metagenomics

I dont yet know how extensive the possibilities of such querying of the NCBI are,  but it looks so much easier than understanding equery.

Ref: NCBI resource locator.

NCBI oddities

I have often blogged about my trials and tribulations with the NCBI database.This morning I was trying to locate all the kappa light chain genes from the NCBI database.

I tried the following search

Immunoglobulin kappa mouse in the Genome database subsection.

The results I got were a curious mix of microbe genomes ranging from Aspergillus Niger to Salmonella enterica. Maybe I left my search skills at home or my eyes are playing tricks on me.

Addendum: Eric Jane from Uniprot showed me how to do the same query on Uniprot beta. Uniprot really rocks. Not only could I do the query , but also downloaded the results in batch mode as fasta sequences and in the xml format.Thanks eric , I would definitely recommend uniprot beta to everyone. Isabelle phan from uniprot did post an excellent screencast detailing the features of uniprot beta at this link on Bioscreencast.com . Do check it out as well as Erics comments below.

Video in Science – my foray into the SecondLife metaverse

Well I had talked about how Deepak went to SciFoo recently. It turns out that some of this years SciFoo alumni led by the indomitable Jean Claude Bradley (JCB or Horace Moody ) started the “metaverse” version of these sessions on the Nature Island on Second Life called Second Nature.

In keeping with the “non conference” format of the original, session themes at SciFoo Lives On are decided on by the attendants , in this case on the wiki that serves as its permanent home outside of Second Life. Yesterdays session was on the role of “Video in Science” and of course we were there with Deepak as Whitewizard Chemistry and myself as Vishwaroop Baroque.

As I awkwardly bumped into the attendees thanks to my terrible gaming skills , the whitewizard chemistry told the audience about bioscreencast.com. This was followed by a talk by JCB on “YouTube and the Sciences” and finally one from someone at the SciVee project.

This was my first time in Second Life. I entered as a skeptic, since I always thought second Life is just a toy for gaming geeks and uber nerds. But I must say the poster session was just like the real thing with some added benefits. Like in the real thing, the questions made the poster session come alive but this time you get a text transcript of all conversations that took place and an overall rich experience. Not to mention the fact that the poster lives on on the NPG island and does not end up in my lab storage area ( read trash can).

I came away convinced that activities like this have a great value in enriching the online scientific experience.

Bertalan, one of the attendees, live blogged the event. You can catch also read about the goings on at Deepaks bbgm blog and of course on the bioscreencast.com blog.

A full text transcript is available on Jean Claude Bradleys blog

Sequence first ask questions later?

I am little confused after reading about the metagenomics approach that identified the causative agent for the colony collapse disorder which Deepak and myself blogged about.

After trolling through pubmed , it seems like a number of the honeybee potential pathogens were already quite well known. The Kashmir bee virus and the Israeli acute Paralysis virus were also lurking among bee populations. Was is not then possible to query this with a quick microarray designed following some text and sequence mining .

Or maybe its just faster to just sequence the whole bee and then perform the in vitro RT-PCR experiments which are a little more targeted.

Maybe this does say something about the difficulty of on the fly bioinformatics driven microarray fabrication . Since the closest I have come to a microarray experiment is seeing the images on the web .. I was just wondering aloud..I am hardly an expert

Addendum: There is of course no denying the added benefits of the metagenomic approach . Like the many other conclusions the paper made possible- that mite levels in both CCD and non-CCD samples were similar , that microflora ( like the bacteria in the bee gut) among Australian and American bees are similar . So I guess the question then is ..maybe metagenomics is just so much more direct that its going to be the first choice in all such open ended questions like ” What causes infectious Disease X”

Metagenomics gives clues to a three year problem affecting American bees

Just yesterday I was reading about how metagenomics is proving to be a better choice for many questions than many traditional DNA amplification centric molecular techniques like many microarray and other PCR based approaches.

Metagenomics , which is a fancy way of saying ” sequence everything ” directly from the sample and then figure out what that everything is , now has another feather in its cap , the identification of the possible causative agent for the honeybee colony collapse disorder.

CCD is a scary disease which has been affecting bee populations across the US . CCD is the name given to the end result , a colony of bees collapses because almost its entire adult bee population essentially disappears. The collapsed hives have the queen bee and a few newly emerged adult bees and considering the social structure of bees the colony withers away.

Without bees there is no pollination and without pollination agriculture will suffer..and without agriculture ..well there will be no food. Scientists have therefore been struggling to find out why these colonies were collapsing and the bees disappearing ever since the first signs of CCD appeared in 2004.

In the paper by Cox-Foster et al in Science Magazine ( Science Express) scientists obtained by pyrosequencing DNA sequence data from CCD and non-CCD bees from the US and Australia as well as DNA from royal jelly from china . The strategy they followed was to first get all the sequence data from CCD and non-CCD bee pools and then look for unusual sequences for bacteria and viruses and other pathogens which may signal an infection. Once the sequence data was obtained they got some initial clues as to the component foreign sequences in these pools. The final pathogen quantitation was obtained by directing PCR primers designed using the asembled metagneomics derived sequence against RNA pools from both samples.

Of all candiate pathogens an as yet unclasified virus from the dicistroviridae family called the Israeli Acute Paralysis virus ( IAPV) was the only one found predomiantly in the CCD samples. The IAPV is a relative of the KBV ( Kashmir Bee virus) and ABPV ( Acute Bee Paralysis virus) and possibly represents an emergent virus that is causative agent of the colony collapse disorder.

This combination approach – i.e metagnomics to get at raw sequence data followed by precise diagnostic primers and QPCR is a powerful approach which I am sure will significantly cut down the time to identify the causative agents behind future emerging infections in all organisms including humans. The rate limiting step would be the time it takes to get the metageomic sequence data which promises to get shorter and shorter.

References: the bbgm article and the article in BioIT world .

The paper in Science by Cox-Foster DL et. al.

Other feathers in the metagenomics cap in the sequencing of large segments of the Neanderthal genome