Category Archives: BioIT

A naive biochemist wakes up to the closed world of chemical abstracts and such

We have a project in the lab that involves screening small molecule inhibitors that inhibit the transport activity of a membrane protein on a “lab scale”  . Having identified one such inhibitor we intended to look for similar molecules that share the same substructure . Substructure query is a standard procedure in chemical informatics . In the past I have screencast the use of the Sigma-Aldrich service to identify molecules from sigmas catalog based on similarity .  However  considering the wealth of biochemically relevant  information PUBCHEM offers , I was curious to try out the substructure query at PUBCHEM .  This Pubchem service works great and is very feature rich ( screencast coming soon) and gave me several molecules which could be of interest in my screen.

The next step I assumed was to locate these compounds in the catalogs of the many chemical providers using a suitable lookup id . Naively I assumed this would be the CAS id which is the “unique id” associated with each molecule . An hour of googling later I woke up to the realization that CAS is a closed subscription based service which has fought many political battles against the PUBCHEM database . Also while PUBCHEM , fortunately ,  and I guess surprisingly allows lookups of its data by CAS ids , sadly it does not spit out CAS ids for the molecules it identifies as related ( at least as far as I could tell)

I am glad for the Entrez provided services that help lookup CID ids ( PUBCHEMs id) for CAS id  and am now wishing I can go the other way i.e CID to CAS .

Its been almost 10 years since I have used the CAS abstracts since I mostly use literature search available for free at PUBMED . I guess I am finally waking up to the closed world of the chemical abstracts offered by the CAS service of the American Chemical Society. For a non-profit service to be this closed , it makes me thankful for Entrez and the NCBI being this open. With all this talk of open source drug discovery , I would think that the least we can do is make our unique id lookups freely interconvertible and public.

refs : The Ridiculous Battles ( my words ) of Pubchem vs CAS  

Who has got the Bottle 

Of Bubbles and funding

I am writing to attempt to describe my opinions after reading a very insightful commentary written by Gregory Petsko in the September issue of Genome Biology ( doi 10.1186/gb-2008-9-9-110) titled “When Bubbles Burst” .

In that article Greg Petsko analyses the parallels between the current Economics bubble and the Big-science bubble ( my words). Just as we can attribute the financial bubble to the un-regulated growth of the financial industry , we can possibly attribute the many problems ailing the research establishment to the un-regulated growth of the “omics” bubble.

We all have witnessed the move of all Science into the genomic age. We have witnessed the gradual shift of federal research Dollars to consortium based science. Whether it is the cancer genome or structural genomics , there has been a pronounced shift in way we all do science : Bigger it seems is better and data gathering has gained a precedence over hypothesis testing .

The argument being made often , is that from all this data will come better hypothesis which can then be tested in the future . When the big-data prevents us from arriving at any cogent and testable hypothesis , our answer seems to be more big-data .

We have all seen good researchers get caught in their respective “omics” bubbles. And with every such bubble , small labs that dont jump onto the bandwagon tend to  suffer. Of course all  of this would be useless talk if funding were increasing , but as Greg Petsko states , the “pie is finite”

I think the time has come for us to rethink the way we treat fundamental research . When funding is tight , It makes sense to postpone our big-data projects and concentrate on using our infrastructure to concentrate on “smaller” science which research more manageable projects . Give individual labs the funding they need to probe the hypothesis that we have built up based on the available data.

Disband the consortia ( or leave them to industry) and divert funding  back to our research labs. There is no better way in my opinion to survive the current funding crisis.

Disclaimer: These are heavily influenced by the fact that I am in an academic establishment and have never directly worked on any genome level project.

I feel Chicken Gunya? Can social networks help track emerging diseases

I remember being particularly amazed at Jonathan Harris and Seepandar Kamwars  “We feel fine”  visualization based on extracted statements from blog posts around the world. Reading about the re-emergence of dengue fever , chicken-gunya , west-nile and other viruses across South Asia, I started wondering if there are ways of keeping track of emerging pathogens using the many social networks that span the Globe.

In many countries there is no paranoia associated with sharing health information like there exists in the developed world. Even if the paranoia exists like it does in the US. We are curiously caught in a world where people reading my blog are more likely to know I have contracted the flu than any two of my healthcare providers who need that information to treat me better.

While the debate on the best way to handle health information online continues, I was wondering how open I would be to sharing information about what afflicts me, if there was a societal benefit to be derived from it. It could something as simple as monitoring allergy symptoms around where I live or something fancy like tracking an emerging pathogen.

Imagine all of us updating a common channel with “de-personalized” information on what afflicts us globally.   I can imagine the system to be something like this ..I  could “submit” to this service information about what ails me ..and the machine could obfuscate my details , preserving only things like my approximate geographical area and my age and sex and add it to this health information social network.

If implemented well could  possibly then have daily visualizations along the lines of  “We feel fine” to possibly something like  “We feel chicken Gunya”

Of $6 web hosts and dying web apps

Just yesterday I was reading tiagos blog where he requested hosting for a computational intensive bioinformatics web-app that he wrote. The application queries and sytematizes mitochondrial genome information from entrez databases, and I assume would be quite useful to animal geneticists and ecologists. Tiago is physically moving institutes and his blog posts talks of his fears of how the app might die if his personal computer goes down.

In one of my personal projects , I have been wrestling with cloning kappa light chains from several monoclonal antibodies that I generated. The cloning required a good knowledge of the anitbody light and heavy chain leader sequences . Several papers I was reading reference the Kabat and Wu database, which catalogs the thousands of sequences of antibodies and other immunological proteins from mouse and humans . Sadly the links to the Kabat and Wu database in some of these papers does not point to any meaningful location. The resulting google and pubmed searches to find this lost data greatly increased the time and effort required to design my cloning experiments.

Which brings me to my question.

In an era when we have free wiki hosting , 4 GB free email access , supercomputers that power maps , gigabyte large free image sharing applications, $6 per month, terabyte bandwidth web hosting. Why are we still so far from an advertisement supported “free” app host for meaningful scientific data ?

Perhaps its because only a few thousand people who are saving a rare turtle species somewhere on this planet will find tiagos web-app useful..Surely thats not yet worth enterprise level attention, or maybe we should all just write our web apps to run off facebook!

Refseq and UniprotKb groups collaborate

A lot of you have heard me complain ( sometimes unfairly) about how hard it is to tie-up sequence data from NCBI with protein data from Swissprot and Uniprot.
I just saw this on the gene announce mailing list

In collaboration with UniProtKB  (http://www.pir.uniprot.org/) ,  the RefSeq group is now  reporting explicit cross-references to Swiss-Prot and  TrEMBL proteins  that correspond to a RefSeq protein. These correspondences are being calculated by the UniProtKB group, and will be updated every three weeks to correspond to UniProt’s release cycle. The data are being made available  from several sites within NCBI:

This is a very nice development. I have always tended to look at the cross-references from within NCBI records for information on swissprot ids. But now I can easily linkout to the wealth of protein information provided at uniprot from my NCBI search results.

This simple announcment also brings to the fore once again the complex inter-relationships between a lot of life-science data and why I dont think there will ever be a single google styled life-science database.

The Rich PDB interface explained

It was a while back that I caught the video on the PDB site which explained all the functionalities that its search interface has. Thanks to the screencast I became a much more efficient querier of the PDB,  especially after they adopted the new ( now almost three years old) interface.

I strongly believe that screencasting can play a role in helping us all search better.

Since I work on crystallizing  membrane proteins , I found the MPDB very useful and decided to screencast its  features.

I sincerely hope that  database creators, and users alike, take to this effective medium and screencast their tips and tricks for us all to benefit from.

.

Why are our bioinformatics workflows so complicated!

Why are our bioinformatics workflows so complicated!

Last week to answer one question I had to resort to information from several sources . A lot of them contributed immense value to my “workflow” and were also either difficult to perform or very easy. For a start I have ranked them in terms of both Value ( 1 for no value to 10 for a lot of value) to ease of use ( 1 for very complicated to 10 for very easy)

# Assembling my sequences in DNAstar (Value 10 : Ease 7 )

# Compiling my sequences and pulling them into Jalview. Ran CLUSTALW web service on edited alignments and realized that all of my clones had basically two sequences for their CDRs . . Jalviews excellent web-service CLUSTALW interface allowed me to quickly edit the 32 sequences , align them interactively and realize they belonged to two types. This got me thinking that maybe the primers I used to clone my CDRs from my mouse kappa light chains were probably mis-priming ( Value 10 : Ease 9)

# Use pubmed to look at precedents i.e analyze all possible papers which had sequenced the mouse anitbody kappa light chain CDR region as I had attempted to do and derive the sequences of the primers they had used. It took forever to get the right keywords to query and I still have only three kappa light chain primer sequences. ANd they are all different! ( Value 10 : Ease 1 ),

# Use my primer sequences , compare them with the literature and figure out how I had misprimed and why my sequences were all either of two types ( Still in progress Value immense : Ease 1 i.e still difficult to do)

# Use pubmed / NCBI genome to understand the sequence space for mouse kappa light chains ( Value 10 , Ease 4 , )

# Use EBI to get the same sequence data ( Value 10 : Ease 8 )

This is still work in progress . But to summarize –

The pubmed steps were the most painful . Pubmed search has to improve!.

Jalview contributed the most value. For a free App its a must have in any bioinformatics toolkit!. DNAstar played its role ..but for its cost ( a few thousand dollars )! It sure gave a lot less value than Jalview

All of this begs the question! ..why are bioinformatics workflows so difficult! We are a long ways away from making these things easy to do for everyone!

The RESTful NCBI query

I first caught this on Pierres blog.

NCBI it turns out can be queried along REST principles ( hence the RESTful in the title). Ever since learning about REST-based URLs , I always wished that many web APIs implemented the ideology in their design. I was excited to learn how easy and intuitive it becomes to query a database using REST principles.

Gone are queries that looked like

http://www.ncbi.nlm.nih.gov/sites/entrez?db=homologene&cmd=search&term=dystrophin

And here come queries that look like this

http://view.ncbi.nlm.nih.gov/homologene/search/dystrophin 

which look for genes that have homology to dystrophin.

Several of the web APIs like the one for connotea and del.icio.us are also implemented RESTfully, making them very easy to query. For eg to get all entries on connotea or del.icio.us with tag metagenomics you would query the URL

http://www.connotea.org/tag/metagenomics

Or on del.icio.us  the URL

http://del.icio.us/tag/metagenomics

I dont yet know how extensive the possibilities of such querying of the NCBI are,  but it looks so much easier than understanding equery.

Ref: NCBI resource locator.

NCBI oddities

I have often blogged about my trials and tribulations with the NCBI database.This morning I was trying to locate all the kappa light chain genes from the NCBI database.

I tried the following search

Immunoglobulin kappa mouse in the Genome database subsection.

The results I got were a curious mix of microbe genomes ranging from Aspergillus Niger to Salmonella enterica. Maybe I left my search skills at home or my eyes are playing tricks on me.

Addendum: Eric Jane from Uniprot showed me how to do the same query on Uniprot beta. Uniprot really rocks. Not only could I do the query , but also downloaded the results in batch mode as fasta sequences and in the xml format.Thanks eric , I would definitely recommend uniprot beta to everyone. Isabelle phan from uniprot did post an excellent screencast detailing the features of uniprot beta at this link on Bioscreencast.com . Do check it out as well as Erics comments below.

Sequence first ask questions later?

I am little confused after reading about the metagenomics approach that identified the causative agent for the colony collapse disorder which Deepak and myself blogged about.

After trolling through pubmed , it seems like a number of the honeybee potential pathogens were already quite well known. The Kashmir bee virus and the Israeli acute Paralysis virus were also lurking among bee populations. Was is not then possible to query this with a quick microarray designed following some text and sequence mining .

Or maybe its just faster to just sequence the whole bee and then perform the in vitro RT-PCR experiments which are a little more targeted.

Maybe this does say something about the difficulty of on the fly bioinformatics driven microarray fabrication . Since the closest I have come to a microarray experiment is seeing the images on the web .. I was just wondering aloud..I am hardly an expert

Addendum: There is of course no denying the added benefits of the metagenomic approach . Like the many other conclusions the paper made possible- that mite levels in both CCD and non-CCD samples were similar , that microflora ( like the bacteria in the bee gut) among Australian and American bees are similar . So I guess the question then is ..maybe metagenomics is just so much more direct that its going to be the first choice in all such open ended questions like ” What causes infectious Disease X”