A naive biochemist wakes up to the closed world of chemical abstracts and such

We have a project in the lab that involves screening small molecule inhibitors that inhibit the transport activity of a membrane protein on a “lab scale”  . Having identified one such inhibitor we intended to look for similar molecules that share the same substructure . Substructure query is a standard procedure in chemical informatics . In the past I have screencast the use of the Sigma-Aldrich service to identify molecules from sigmas catalog based on similarity .  However  considering the wealth of biochemically relevant  information PUBCHEM offers , I was curious to try out the substructure query at PUBCHEM .  This Pubchem service works great and is very feature rich ( screencast coming soon) and gave me several molecules which could be of interest in my screen.

The next step I assumed was to locate these compounds in the catalogs of the many chemical providers using a suitable lookup id . Naively I assumed this would be the CAS id which is the “unique id” associated with each molecule . An hour of googling later I woke up to the realization that CAS is a closed subscription based service which has fought many political battles against the PUBCHEM database . Also while PUBCHEM , fortunately ,  and I guess surprisingly allows lookups of its data by CAS ids , sadly it does not spit out CAS ids for the molecules it identifies as related ( at least as far as I could tell)

I am glad for the Entrez provided services that help lookup CID ids ( PUBCHEMs id) for CAS id  and am now wishing I can go the other way i.e CID to CAS .

Its been almost 10 years since I have used the CAS abstracts since I mostly use literature search available for free at PUBMED . I guess I am finally waking up to the closed world of the chemical abstracts offered by the CAS service of the American Chemical Society. For a non-profit service to be this closed , it makes me thankful for Entrez and the NCBI being this open. With all this talk of open source drug discovery , I would think that the least we can do is make our unique id lookups freely interconvertible and public.

refs : The Ridiculous Battles ( my words ) of Pubchem vs CAS  

Who has got the Bottle 

Of Bubbles and funding

I am writing to attempt to describe my opinions after reading a very insightful commentary written by Gregory Petsko in the September issue of Genome Biology ( doi 10.1186/gb-2008-9-9-110) titled “When Bubbles Burst” .

In that article Greg Petsko analyses the parallels between the current Economics bubble and the Big-science bubble ( my words). Just as we can attribute the financial bubble to the un-regulated growth of the financial industry , we can possibly attribute the many problems ailing the research establishment to the un-regulated growth of the “omics” bubble.

We all have witnessed the move of all Science into the genomic age. We have witnessed the gradual shift of federal research Dollars to consortium based science. Whether it is the cancer genome or structural genomics , there has been a pronounced shift in way we all do science : Bigger it seems is better and data gathering has gained a precedence over hypothesis testing .

The argument being made often , is that from all this data will come better hypothesis which can then be tested in the future . When the big-data prevents us from arriving at any cogent and testable hypothesis , our answer seems to be more big-data .

We have all seen good researchers get caught in their respective “omics” bubbles. And with every such bubble , small labs that dont jump onto the bandwagon tend to  suffer. Of course all  of this would be useless talk if funding were increasing , but as Greg Petsko states , the “pie is finite”

I think the time has come for us to rethink the way we treat fundamental research . When funding is tight , It makes sense to postpone our big-data projects and concentrate on using our infrastructure to concentrate on “smaller” science which research more manageable projects . Give individual labs the funding they need to probe the hypothesis that we have built up based on the available data.

Disband the consortia ( or leave them to industry) and divert funding  back to our research labs. There is no better way in my opinion to survive the current funding crisis.

Disclaimer: These are heavily influenced by the fact that I am in an academic establishment and have never directly worked on any genome level project.

My attempts at explaining personal genomics

random hapmap graphicThe other day I was talking to my parents about the fascinating world of personal genomics and I ventured to write up my description of some of the ideas behind the hapmap and personal genomics in an email.

I am reproducing the email here and hoping that I can get pointers on explaining things better ( I am after all only a biochemist/structural biologist) 

The world of personal genomics is upon us. Companies like 23andMe and deCode will run your genetic sample against a known list of variation markers and tell you things about yourself as suggested by your genes. or they will tell you what markers you share with what groups of people. Although this sounds amazing , a lot of this is very nuanced and understanding it is a fun exercise. Also all of this will change evrything or atleast it has the potential to.

Lets start at the begining , when you ask yourself to be “typed” or what do my genes have? What is this stuff all about
We are all quite different, i.e you and me will probably have several hundreds of thousands of  differences between the two of us. To actually estimate exactly how different you and me are , this will require a full sequencing of our genomes . This is quite expensive and takes a lot of time . 
Instead imagine if I told you that that scientists have figured out that these differences occur in groups i.e they are linked together. Very crudely..if one of these jumps from you to your child..it will take a few thousands of its neighbors along for the ride.  So now instead of getting information about the several hundred thousand actual differences , we can learn a lot by just looking at the tens of thousands of these labels . In each case for a particular label ( or marker or SNP) we can look at all the variation determined this far. i.e at position 59 all know human variation has either a A or a T. So you can belong to one of those two groups. Now, Once you get this or any such  label you can infer the rest that such a label is tied to. Collecting information on these labels is what projects like the hapmap do ( see hapmap.org) and it is exactly the identity of these labels that a genotyping service will provide you with ( for an example see http://www.snpedia.com/files/promethease/outputs/promethease-ngnomics.html).  

SO whats the big deal? . All of science is trying to figure out, what makes a person A die of a heart-attack before he was 20 , while person B lives till he was 80. As well all know , there are two parts to this story “Nature : or your genes” and “Nurture – or your liefetsyle”. Science can easily attempt to understand your genes. Because that is “hard” information. And in the case of person A and B , science asks the question, whats in their genes that might have led to the outcome.
So coming back to the point I made in the prevous few paras , instead of looking at the entire genome  for differences between person A and B , we can start by asking what markers or labels do they share and what do they not share. Then, taking the markers they dont share . Which among those are common with people who had heart-attacks early. So,  looking at this information may lead to some clues about which genes A or B had led to their resultant life-expectancies.

Lets take another example , say you want to test in advance  “what genes cause allergies to Sulfur drugs ” . So what you would do is give many people sulfur drugs and then check all the people who were allergic and look at what groups their genes belong to. At the end of which you ask the obvious question..all those people who came up with severe allergies to sulfur , what group ( or markers ) did they have in common. In most cases this is not a single gene or single number  , but  for simplicity, the answer you get is something like : if you have group A59C (also known as single nucleotide polymorphism or SNP) then you have a 20% chance of being allergic . Also , since most traits dont depend on only one label or marker , the answers are quite diffused and are given in terms of probabilities. Say a 20% chance of A59C may be converted to a 80% chance in combination with label G456A and a 0.2 % chance if you had F555A . Do you get the point? If you dont dont worry as you can tell it is quite complicated!

Regardless. The chances are that, the more we ask such questions , the more we learn about these probabilities and that  is what most genetic research is looking to do.
So instead of studying say 10000 mostly white americans , things become more meaningful if you study 1000,000 people from every corner of this earth . Then the numbers may all add up and give us a more clear label to associate with any given outcome , like a heart-attack . Thats what the hapmap is all about.

Anyways getting back to the point,  Such studies are what give rise to the field of  personal genomics i.e look at what markers you have and then compare them with known marker-result associations or known drug-effect associations.

As I have already hopefuly convinced you : these studies are very nuanced. People expect  cut-and-dry answers and many may  return disappointed. Also people prone to hypochondria , should definitely stay away , lest that 2% chance that you will have a heart-attack will be converted to a very high percent chance of you getting it , actually because of the stress you put yourself through as a result of this knowledge ( the nurture  or lifestyle part).

Also this points to how these studies if carried out correctly will  change a lot of things , medicine , health-care , the very nature of how we view ourselves.
These studies can attempt to answer questions like how similar are south and north Indians genetically . And as I just told you , nature is not cut and dry and neither is human history!. So interpreting especially these results with social or political  implications is a double edged sword! 

Before I end , you can check out one of these reports as is given by companies like 23andme .
http://www.snpedia.com/files/promethease/outputs/promethease-ngnomics.html

So are you excited? Do you want to find out whether you have a 10% chance of arthritis ? or a 40% chance that you have descended from a caucasian lineage?

more Reading on this topic:
http://www.hapmap.org/

For the medically inclined . You can read an article in science that talks about the implications of these studies for healthcare studies published in the science magazine

Under the hood web 2.0- PUBMED and RSS

Snap poll: Did you know that PUBMED and a lot of the NCBI gives you the ability to create custom RSS feed around just about any query

My Answer : Yes I do ..but I learnt about this feature very recently, and I have been using PUBMED for years. You can watch how this feature works from this screencast .

The reason I bring this up is my observation that the NCBI and the EBI have been putting in a lot of energy at staying current while  preserving their orginal interface. Under the hood at the NCBI you can find features like MY-NCBI which has some killer personalization options ( to borrow a web 2.0 buzzword) . The NCBI forms have started to use more and more javascript and things  have started to look “more ajaxy” . Even the collections option just about ensures that I dont use connotea as often ( other than when I want to share collections on the web).

The point is that the bio-search space has and continues to see variations to the theme of search. A lot of these offerings seem to be mere duplications of functionality already present at the EBI and NCBI. While this is a good thing and only through such attempts will we arrive at a “Google”( in the verb sense) for biosearch.

I sometimes wonder if it would just be more worthwhile for atleast the career scientists among us, to learn how to use NCBI and EBI first. Then maybe private research dollars could be diverted to tackle the harder problems in bio-search .

References: Helia seraches PUBMED/MEDLINE

I feel Chicken Gunya? Can social networks help track emerging diseases

I remember being particularly amazed at Jonathan Harris and Seepandar Kamwars  “We feel fine”  visualization based on extracted statements from blog posts around the world. Reading about the re-emergence of dengue fever , chicken-gunya , west-nile and other viruses across South Asia, I started wondering if there are ways of keeping track of emerging pathogens using the many social networks that span the Globe.

In many countries there is no paranoia associated with sharing health information like there exists in the developed world. Even if the paranoia exists like it does in the US. We are curiously caught in a world where people reading my blog are more likely to know I have contracted the flu than any two of my healthcare providers who need that information to treat me better.

While the debate on the best way to handle health information online continues, I was wondering how open I would be to sharing information about what afflicts me, if there was a societal benefit to be derived from it. It could something as simple as monitoring allergy symptoms around where I live or something fancy like tracking an emerging pathogen.

Imagine all of us updating a common channel with “de-personalized” information on what afflicts us globally.   I can imagine the system to be something like this ..I  could “submit” to this service information about what ails me ..and the machine could obfuscate my details , preserving only things like my approximate geographical area and my age and sex and add it to this health information social network.

If implemented well could  possibly then have daily visualizations along the lines of  “We feel fine” to possibly something like  “We feel chicken Gunya”

Of $6 web hosts and dying web apps

Just yesterday I was reading tiagos blog where he requested hosting for a computational intensive bioinformatics web-app that he wrote. The application queries and sytematizes mitochondrial genome information from entrez databases, and I assume would be quite useful to animal geneticists and ecologists. Tiago is physically moving institutes and his blog posts talks of his fears of how the app might die if his personal computer goes down.

In one of my personal projects , I have been wrestling with cloning kappa light chains from several monoclonal antibodies that I generated. The cloning required a good knowledge of the anitbody light and heavy chain leader sequences . Several papers I was reading reference the Kabat and Wu database, which catalogs the thousands of sequences of antibodies and other immunological proteins from mouse and humans . Sadly the links to the Kabat and Wu database in some of these papers does not point to any meaningful location. The resulting google and pubmed searches to find this lost data greatly increased the time and effort required to design my cloning experiments.

Which brings me to my question.

In an era when we have free wiki hosting , 4 GB free email access , supercomputers that power maps , gigabyte large free image sharing applications, $6 per month, terabyte bandwidth web hosting. Why are we still so far from an advertisement supported “free” app host for meaningful scientific data ?

Perhaps its because only a few thousand people who are saving a rare turtle species somewhere on this planet will find tiagos web-app useful..Surely thats not yet worth enterprise level attention, or maybe we should all just write our web apps to run off facebook!

Refseq and UniprotKb groups collaborate

A lot of you have heard me complain ( sometimes unfairly) about how hard it is to tie-up sequence data from NCBI with protein data from Swissprot and Uniprot.
I just saw this on the gene announce mailing list

In collaboration with UniProtKB  (http://www.pir.uniprot.org/) ,  the RefSeq group is now  reporting explicit cross-references to Swiss-Prot and  TrEMBL proteins  that correspond to a RefSeq protein. These correspondences are being calculated by the UniProtKB group, and will be updated every three weeks to correspond to UniProt’s release cycle. The data are being made available  from several sites within NCBI:

This is a very nice development. I have always tended to look at the cross-references from within NCBI records for information on swissprot ids. But now I can easily linkout to the wealth of protein information provided at uniprot from my NCBI search results.

This simple announcment also brings to the fore once again the complex inter-relationships between a lot of life-science data and why I dont think there will ever be a single google styled life-science database.

The Rich PDB interface explained

It was a while back that I caught the video on the PDB site which explained all the functionalities that its search interface has. Thanks to the screencast I became a much more efficient querier of the PDB,  especially after they adopted the new ( now almost three years old) interface.

I strongly believe that screencasting can play a role in helping us all search better.

Since I work on crystallizing  membrane proteins , I found the MPDB very useful and decided to screencast its  features.

I sincerely hope that  database creators, and users alike, take to this effective medium and screencast their tips and tricks for us all to benefit from.

.

Why are our bioinformatics workflows so complicated!

Why are our bioinformatics workflows so complicated!

Last week to answer one question I had to resort to information from several sources . A lot of them contributed immense value to my “workflow” and were also either difficult to perform or very easy. For a start I have ranked them in terms of both Value ( 1 for no value to 10 for a lot of value) to ease of use ( 1 for very complicated to 10 for very easy)

# Assembling my sequences in DNAstar (Value 10 : Ease 7 )

# Compiling my sequences and pulling them into Jalview. Ran CLUSTALW web service on edited alignments and realized that all of my clones had basically two sequences for their CDRs . . Jalviews excellent web-service CLUSTALW interface allowed me to quickly edit the 32 sequences , align them interactively and realize they belonged to two types. This got me thinking that maybe the primers I used to clone my CDRs from my mouse kappa light chains were probably mis-priming ( Value 10 : Ease 9)

# Use pubmed to look at precedents i.e analyze all possible papers which had sequenced the mouse anitbody kappa light chain CDR region as I had attempted to do and derive the sequences of the primers they had used. It took forever to get the right keywords to query and I still have only three kappa light chain primer sequences. ANd they are all different! ( Value 10 : Ease 1 ),

# Use my primer sequences , compare them with the literature and figure out how I had misprimed and why my sequences were all either of two types ( Still in progress Value immense : Ease 1 i.e still difficult to do)

# Use pubmed / NCBI genome to understand the sequence space for mouse kappa light chains ( Value 10 , Ease 4 , )

# Use EBI to get the same sequence data ( Value 10 : Ease 8 )

This is still work in progress . But to summarize –

The pubmed steps were the most painful . Pubmed search has to improve!.

Jalview contributed the most value. For a free App its a must have in any bioinformatics toolkit!. DNAstar played its role ..but for its cost ( a few thousand dollars )! It sure gave a lot less value than Jalview

All of this begs the question! ..why are bioinformatics workflows so difficult! We are a long ways away from making these things easy to do for everyone!

The RESTful NCBI query

I first caught this on Pierres blog.

NCBI it turns out can be queried along REST principles ( hence the RESTful in the title). Ever since learning about REST-based URLs , I always wished that many web APIs implemented the ideology in their design. I was excited to learn how easy and intuitive it becomes to query a database using REST principles.

Gone are queries that looked like

http://www.ncbi.nlm.nih.gov/sites/entrez?db=homologene&cmd=search&term=dystrophin

And here come queries that look like this

http://view.ncbi.nlm.nih.gov/homologene/search/dystrophin 

which look for genes that have homology to dystrophin.

Several of the web APIs like the one for connotea and del.icio.us are also implemented RESTfully, making them very easy to query. For eg to get all entries on connotea or del.icio.us with tag metagenomics you would query the URL

http://www.connotea.org/tag/metagenomics

Or on del.icio.us  the URL

http://del.icio.us/tag/metagenomics

I dont yet know how extensive the possibilities of such querying of the NCBI are,  but it looks so much easier than understanding equery.

Ref: NCBI resource locator.