Just yesterday I was reading tiagos blog where he requested hosting for a computational intensive bioinformatics web-app that he wrote. The application queries and sytematizes mitochondrial genome information from entrez databases, and I assume would be quite useful to animal geneticists and ecologists. Tiago is physically moving institutes and his blog posts talks of his fears of how the app might die if his personal computer goes down.
In one of my personal projects , I have been wrestling with cloning kappa light chains from several monoclonal antibodies that I generated. The cloning required a good knowledge of the anitbody light and heavy chain leader sequences . Several papers I was reading reference the Kabat and Wu database, which catalogs the thousands of sequences of antibodies and other immunological proteins from mouse and humans . Sadly the links to the Kabat and Wu database in some of these papers does not point to any meaningful location. The resulting google and pubmed searches to find this lost data greatly increased the time and effort required to design my cloning experiments.
Which brings me to my question.
In an era when we have free wiki hosting , 4 GB free email access , supercomputers that power maps , gigabyte large free image sharing applications, $6 per month, terabyte bandwidth web hosting. Why are we still so far from an advertisement supported “free” app host for meaningful scientific data ?
Perhaps its because only a few thousand people who are saving a rare turtle species somewhere on this planet will find tiagos web-app useful..Surely thats not yet worth enterprise level attention, or maybe we should all just write our web apps to run off facebook!
It was a while back that I caught the video on the PDB site which explained all the functionalities that its search interface has. Thanks to the screencast I became a much more efficient querier of the PDB, especially after they adopted the new ( now almost three years old) interface.
I strongly believe that screencasting can play a role in helping us all search better.
Since I work on crystallizing membrane proteins , I found the MPDB very useful and decided to screencast its features.
I sincerely hope that database creators, and users alike, take to this effective medium and screencast their tips and tricks for us all to benefit from.
Why are our bioinformatics workflows so complicated!
Last week to answer one question I had to resort to information from several sources . A lot of them contributed immense value to my “workflow” and were also either difficult to perform or very easy. For a start I have ranked them in terms of both Value ( 1 for no value to 10 for a lot of value) to ease of use ( 1 for very complicated to 10 for very easy)
# Assembling my sequences in DNAstar (Value 10 : Ease 7 )
# Compiling my sequences and pulling them into Jalview. Ran CLUSTALW web service on edited alignments and realized that all of my clones had basically two sequences for their CDRs . . Jalviews excellent web-service CLUSTALW interface allowed me to quickly edit the 32 sequences , align them interactively and realize they belonged to two types. This got me thinking that maybe the primers I used to clone my CDRs from my mouse kappa light chains were probably mis-priming ( Value 10 : Ease 9)
# Use pubmed to look at precedents i.e analyze all possible papers which had sequenced the mouse anitbody kappa light chain CDR region as I had attempted to do and derive the sequences of the primers they had used. It took forever to get the right keywords to query and I still have only three kappa light chain primer sequences. ANd they are all different! ( Value 10 : Ease 1 ),
# Use my primer sequences , compare them with the literature and figure out how I had misprimed and why my sequences were all either of two types ( Still in progress Value immense : Ease 1 i.e still difficult to do)
# Use pubmed / NCBI genome to understand the sequence space for mouse kappa light chains ( Value 10 , Ease 4 , )
# Use EBI to get the same sequence data ( Value 10 : Ease 8 )
This is still work in progress . But to summarize –
The pubmed steps were the most painful . Pubmed search has to improve!.
Jalview contributed the most value. For a free App its a must have in any bioinformatics toolkit!. DNAstar played its role ..but for its cost ( a few thousand dollars )! It sure gave a lot less value than Jalview
All of this begs the question! ..why are bioinformatics workflows so difficult! We are a long ways away from making these things easy to do for everyone!
I have been working a lot with alignments in Jalview and had blogged about how Google can find Uniprot IDs better than NCBI ..well it turns out that NCBI did indeed have most of the Uniprot sequences I was looking for. The fault was mine! for not using the correct form of uniprot id..
I had to say just Q57T52 instead of the Q57T52_SALCH and Q325Y4 instead of Q325Y4_SHIBS
Which brings to me to one incredible thing about google. The google suggest and spelling correction. NCBI recently added the spelling correction feature. But still does not have something that would have told me that I should try Q57T52 instead of the old style Q57T52_SALCH uniprot id query.
So all in all out of the 742 sequences that the manually curated PFAM database had used in its voltage_clc gamily alignment. I could find almost 640 of them at the NCBI using the NCBI web service. All it took was understanding the existence of the deprecated uniprot id.
When I similarly tested the EBI web service for the same 742 sequences, only 582 sequences were obtainable in the uniprotxml format from the uniprotkb database.
As a final try , looking for some of the sequences that were missing in the better performing NCBI database , by doing a google search returned a match in the first few results. So google still is quite amazing in its ability to index even probably poorly page-ranked words like Q40LF7_DESAC. Surely the day they take on bioinformatics in a formal way will be a fun day to look forward to.
references : bbgm on a Google for Bioinformatics
Powered by ScribeFire.
My good friend Deepak had a quote in his blog from Lincoln Stein about making bioinformatics as much an everyday tool to the practicing biologist as a pipettor ( a device used to dispense liquids by experimental biologists and chemists)..
I totally agree, but think we are quite far away. For example this morning I had to obtain the sequence of 772 swissprot entries ,which were part of an alignment for some downstream analysis. Of course my first choice was to query the NCBI -Entrez database. I soon realized that NCBI query box did not return any results for the first few queries I tried, all of which were probably new Uniprot/SwissProt IDs ( for eg. .sequence ids Q57T52_SALCH ,Q325Y4_SHIBS )
Disappointed , I turned to the EBI search engine. Within seconds I realized that the EBI indeed does indeed serve up all of entries. SO there are a subset of uniprot entries that the NCBI does not have in its database.
Out of sheer curiosity I entered the queries that drew a blank at the NCBI into Google.
Wonder of Wonders google pulled up all of the hard to find UniProt entries as the very first Match.
Thanks to the increasing use of publicly accessible web service APIs , Google is becoming more and more aware of a lot of very specific sequence data.
I will be very happy when I can type Q57T52_SALCH calc=MW and get an answer back from right inside google. Maybe that day bioinformatics will move one step closer to becoming just another tool.
Till then I am stuck with learning about Equery and WSDL and SOAP and so on..
Powered by ScribeFire.
I have recently become addicted to the TED talks. I caught the TED talk by Craig Venter on various projects stemming from the initiatives undertaken by the Venter Institute and his affiliated companies. One of the exciting things he talked about was the coming field of combinatorial genomics (CG). CG is basically a marriage between synthetic biology and genomics. Basically it will deal with creating “synthetic” life forms with desired properties that are obtained by screening a library of such microbes obtained from combining genes from a multitude of organisms.
This is of course possible given the following five technologies.
Knowledge of a minimal subset: Work on the “minimal genome project” resulted in the minimal set of genes required to have a living reproducing bug or virus.
The ability to synthesize large amounts of large DNA :In his talk Craig Venter talked of their work in synthesizing the genome of Phi-X174 fully in two weeks.
The final piece of the puzzle comes from being able to assemble stretches of synthesized DNA quickly and combinatorially from these pieces. Here the amazing bug deinococcus radiodurans comes to the rescue. Deinococcus radiodurans is able to re-assemble its genome thousands of small bits which result from very harsh radiation or severe drying. Exploiting its mechanism to achieve this amazing feat its should be entirely possible to fully reconstitute an intact genome from a multitude of pieces.
The final piece of the puzzle of course is the genomics toolset itself. It is possible to assemble specialized subsets for any desired function by comparing genomes that carry out a particular function with closely related ones that do not.
So given all this, Craig Venter talks of assembling million chromosomes per day and transplanting these into cells or synthetic cells and screen for a desired effect. This he dubs as the emerging field of combinatorial genomics. A few of these desired functions are the stuff of biotechs promise since its inception: making hydrogen from photosynthetic bacteria, digesting cellulose to make ethanol and making small molecules by metabolic pathway engineering.
There is more on the technological aspects of combinatorial genomics at syntheticgenomics.com , one of Craig Venters companies. The TED talk above is also an excellent listen.
The Nature podcast section on deinococcus radiodurans and mp3 file
While I was researching an article on opsins that are revolutionizing neuronal investigation, I came across a small blurb on the Boyden lab homepage with a link to addgene which said ” Request Boyden Lab plasmids through addgene”.
Addgene as its webpage states is a non-profit research support service dedicated to archiving and distributing plasmids that appear in published articles.The plasmids in addgenes database are classified on the basis of the Principal investigator deposting them and gene name. Also each plasmid has a lot of detailed information available making its use in other experiments quite easy.
It was very heartening that several leading labs had deposited their plasmids for academic use at this service. In the world of open science , services such as this are invaluable. I don’t know how long addgene has been around, but it will be quite something if the NIH made it mandatory to have all plasmids used in published work deposited at such a service.
While it is still a little involved to get plasmids from addgene, since it does involve some paperwork exchange, there is no denying the fact that a central managed repository such as addgene.org will ease the load for both end user and innovator labs as well as tech transfer offices.
Powered by ScribeFire.