We have a project in the lab that involves screening small molecule inhibitors that inhibit the transport activity of a membrane protein on a “lab scale” . Having identified one such inhibitor we intended to look for similar molecules that share the same substructure . Substructure query is a standard procedure in chemical informatics . In the past I have screencast the use of the Sigma-Aldrich service to identify molecules from sigmas catalog based on similarity . However considering the wealth of biochemically relevant information PUBCHEM offers , I was curious to try out the substructure query at PUBCHEM . This Pubchem service works great and is very feature rich ( screencast coming soon) and gave me several molecules which could be of interest in my screen.
The next step I assumed was to locate these compounds in the catalogs of the many chemical providers using a suitable lookup id . Naively I assumed this would be the CAS id which is the “unique id” associated with each molecule . An hour of googling later I woke up to the realization that CAS is a closed subscription based service which has fought many political battles against the PUBCHEM database . Also while PUBCHEM , fortunately , and I guess surprisingly allows lookups of its data by CAS ids , sadly it does not spit out CAS ids for the molecules it identifies as related ( at least as far as I could tell)
I am glad for the Entrez provided services that help lookup CID ids ( PUBCHEMs id) for CAS id and am now wishing I can go the other way i.e CID to CAS .
Its been almost 10 years since I have used the CAS abstracts since I mostly use literature search available for free at PUBMED . I guess I am finally waking up to the closed world of the chemical abstracts offered by the CAS service of the American Chemical Society. For a non-profit service to be this closed , it makes me thankful for Entrez and the NCBI being this open. With all this talk of open source drug discovery , I would think that the least we can do is make our unique id lookups freely interconvertible and public.
refs : The Ridiculous Battles ( my words ) of Pubchem vs CAS
Who has got the Bottle
I am writing to attempt to describe my opinions after reading a very insightful commentary written by Gregory Petsko in the September issue of Genome Biology ( doi 10.1186/gb-2008-9-9-110) titled “When Bubbles Burst” .
In that article Greg Petsko analyses the parallels between the current Economics bubble and the Big-science bubble ( my words). Just as we can attribute the financial bubble to the un-regulated growth of the financial industry , we can possibly attribute the many problems ailing the research establishment to the un-regulated growth of the “omics” bubble.
We all have witnessed the move of all Science into the genomic age. We have witnessed the gradual shift of federal research Dollars to consortium based science. Whether it is the cancer genome or structural genomics , there has been a pronounced shift in way we all do science : Bigger it seems is better and data gathering has gained a precedence over hypothesis testing .
The argument being made often , is that from all this data will come better hypothesis which can then be tested in the future . When the big-data prevents us from arriving at any cogent and testable hypothesis , our answer seems to be more big-data .
We have all seen good researchers get caught in their respective “omics” bubbles. And with every such bubble , small labs that dont jump onto the bandwagon tend to suffer. Of course all of this would be useless talk if funding were increasing , but as Greg Petsko states , the “pie is finite”
I think the time has come for us to rethink the way we treat fundamental research . When funding is tight , It makes sense to postpone our big-data projects and concentrate on using our infrastructure to concentrate on “smaller” science which research more manageable projects . Give individual labs the funding they need to probe the hypothesis that we have built up based on the available data.
Disband the consortia ( or leave them to industry) and divert funding back to our research labs. There is no better way in my opinion to survive the current funding crisis.
Disclaimer: These are heavily influenced by the fact that I am in an academic establishment and have never directly worked on any genome level project.
I remember being particularly amazed at Jonathan Harris and Seepandar Kamwars “We feel fine” visualization based on extracted statements from blog posts around the world. Reading about the re-emergence of dengue fever , chicken-gunya , west-nile and other viruses across South Asia, I started wondering if there are ways of keeping track of emerging pathogens using the many social networks that span the Globe.
In many countries there is no paranoia associated with sharing health information like there exists in the developed world. Even if the paranoia exists like it does in the US. We are curiously caught in a world where people reading my blog are more likely to know I have contracted the flu than any two of my healthcare providers who need that information to treat me better.
While the debate on the best way to handle health information online continues, I was wondering how open I would be to sharing information about what afflicts me, if there was a societal benefit to be derived from it. It could something as simple as monitoring allergy symptoms around where I live or something fancy like tracking an emerging pathogen.
Imagine all of us updating a common channel with “de-personalized” information on what afflicts us globally. I can imagine the system to be something like this ..I could “submit” to this service information about what ails me ..and the machine could obfuscate my details , preserving only things like my approximate geographical area and my age and sex and add it to this health information social network.
If implemented well could possibly then have daily visualizations along the lines of “We feel fine” to possibly something like “We feel chicken Gunya”
Just yesterday I was reading tiagos blog where he requested hosting for a computational intensive bioinformatics web-app that he wrote. The application queries and sytematizes mitochondrial genome information from entrez databases, and I assume would be quite useful to animal geneticists and ecologists. Tiago is physically moving institutes and his blog posts talks of his fears of how the app might die if his personal computer goes down.
In one of my personal projects , I have been wrestling with cloning kappa light chains from several monoclonal antibodies that I generated. The cloning required a good knowledge of the anitbody light and heavy chain leader sequences . Several papers I was reading reference the Kabat and Wu database, which catalogs the thousands of sequences of antibodies and other immunological proteins from mouse and humans . Sadly the links to the Kabat and Wu database in some of these papers does not point to any meaningful location. The resulting google and pubmed searches to find this lost data greatly increased the time and effort required to design my cloning experiments.
Which brings me to my question.
In an era when we have free wiki hosting , 4 GB free email access , supercomputers that power maps , gigabyte large free image sharing applications, $6 per month, terabyte bandwidth web hosting. Why are we still so far from an advertisement supported “free” app host for meaningful scientific data ?
Perhaps its because only a few thousand people who are saving a rare turtle species somewhere on this planet will find tiagos web-app useful..Surely thats not yet worth enterprise level attention, or maybe we should all just write our web apps to run off facebook!
A lot of you have heard me complain ( sometimes unfairly) about how hard it is to tie-up sequence data from NCBI with protein data from Swissprot and Uniprot.
I just saw this on the gene announce mailing list
In collaboration with UniProtKB (http://www.pir.uniprot.org/) , the RefSeq group is now reporting explicit cross-references to Swiss-Prot and TrEMBL proteins that correspond to a RefSeq protein. These correspondences are being calculated by the UniProtKB group, and will be updated every three weeks to correspond to UniProt’s release cycle. The data are being made available from several sites within NCBI:
This is a very nice development. I have always tended to look at the cross-references from within NCBI records for information on swissprot ids. But now I can easily linkout to the wealth of protein information provided at uniprot from my NCBI search results.
This simple announcment also brings to the fore once again the complex inter-relationships between a lot of life-science data and why I dont think there will ever be a single google styled life-science database.
It was a while back that I caught the video on the PDB site which explained all the functionalities that its search interface has. Thanks to the screencast I became a much more efficient querier of the PDB, especially after they adopted the new ( now almost three years old) interface.
I strongly believe that screencasting can play a role in helping us all search better.
Since I work on crystallizing membrane proteins , I found the MPDB very useful and decided to screencast its features.
I sincerely hope that database creators, and users alike, take to this effective medium and screencast their tips and tricks for us all to benefit from.
Why are our bioinformatics workflows so complicated!
Last week to answer one question I had to resort to information from several sources . A lot of them contributed immense value to my “workflow” and were also either difficult to perform or very easy. For a start I have ranked them in terms of both Value ( 1 for no value to 10 for a lot of value) to ease of use ( 1 for very complicated to 10 for very easy)
# Assembling my sequences in DNAstar (Value 10 : Ease 7 )
# Compiling my sequences and pulling them into Jalview. Ran CLUSTALW web service on edited alignments and realized that all of my clones had basically two sequences for their CDRs . . Jalviews excellent web-service CLUSTALW interface allowed me to quickly edit the 32 sequences , align them interactively and realize they belonged to two types. This got me thinking that maybe the primers I used to clone my CDRs from my mouse kappa light chains were probably mis-priming ( Value 10 : Ease 9)
# Use pubmed to look at precedents i.e analyze all possible papers which had sequenced the mouse anitbody kappa light chain CDR region as I had attempted to do and derive the sequences of the primers they had used. It took forever to get the right keywords to query and I still have only three kappa light chain primer sequences. ANd they are all different! ( Value 10 : Ease 1 ),
# Use my primer sequences , compare them with the literature and figure out how I had misprimed and why my sequences were all either of two types ( Still in progress Value immense : Ease 1 i.e still difficult to do)
# Use pubmed / NCBI genome to understand the sequence space for mouse kappa light chains ( Value 10 , Ease 4 , )
# Use EBI to get the same sequence data ( Value 10 : Ease 8 )
This is still work in progress . But to summarize –
The pubmed steps were the most painful . Pubmed search has to improve!.
Jalview contributed the most value. For a free App its a must have in any bioinformatics toolkit!. DNAstar played its role ..but for its cost ( a few thousand dollars )! It sure gave a lot less value than Jalview
All of this begs the question! ..why are bioinformatics workflows so difficult! We are a long ways away from making these things easy to do for everyone!