Category Archives: Science

My attempts at explaining personal genomics

random hapmap graphicThe other day I was talking to my parents about the fascinating world of personal genomics and I ventured to write up my description of some of the ideas behind the hapmap and personal genomics in an email.

I am reproducing the email here and hoping that I can get pointers on explaining things better ( I am after all only a biochemist/structural biologist) 

The world of personal genomics is upon us. Companies like 23andMe and deCode will run your genetic sample against a known list of variation markers and tell you things about yourself as suggested by your genes. or they will tell you what markers you share with what groups of people. Although this sounds amazing , a lot of this is very nuanced and understanding it is a fun exercise. Also all of this will change evrything or atleast it has the potential to.

Lets start at the begining , when you ask yourself to be “typed” or what do my genes have? What is this stuff all about
We are all quite different, i.e you and me will probably have several hundreds of thousands of  differences between the two of us. To actually estimate exactly how different you and me are , this will require a full sequencing of our genomes . This is quite expensive and takes a lot of time . 
Instead imagine if I told you that that scientists have figured out that these differences occur in groups i.e they are linked together. Very crudely..if one of these jumps from you to your will take a few thousands of its neighbors along for the ride.  So now instead of getting information about the several hundred thousand actual differences , we can learn a lot by just looking at the tens of thousands of these labels . In each case for a particular label ( or marker or SNP) we can look at all the variation determined this far. i.e at position 59 all know human variation has either a A or a T. So you can belong to one of those two groups. Now, Once you get this or any such  label you can infer the rest that such a label is tied to. Collecting information on these labels is what projects like the hapmap do ( see and it is exactly the identity of these labels that a genotyping service will provide you with ( for an example see  

SO whats the big deal? . All of science is trying to figure out, what makes a person A die of a heart-attack before he was 20 , while person B lives till he was 80. As well all know , there are two parts to this story “Nature : or your genes” and “Nurture – or your liefetsyle”. Science can easily attempt to understand your genes. Because that is “hard” information. And in the case of person A and B , science asks the question, whats in their genes that might have led to the outcome.
So coming back to the point I made in the prevous few paras , instead of looking at the entire genome  for differences between person A and B , we can start by asking what markers or labels do they share and what do they not share. Then, taking the markers they dont share . Which among those are common with people who had heart-attacks early. So,  looking at this information may lead to some clues about which genes A or B had led to their resultant life-expectancies.

Lets take another example , say you want to test in advance  “what genes cause allergies to Sulfur drugs ” . So what you would do is give many people sulfur drugs and then check all the people who were allergic and look at what groups their genes belong to. At the end of which you ask the obvious question..all those people who came up with severe allergies to sulfur , what group ( or markers ) did they have in common. In most cases this is not a single gene or single number  , but  for simplicity, the answer you get is something like : if you have group A59C (also known as single nucleotide polymorphism or SNP) then you have a 20% chance of being allergic . Also , since most traits dont depend on only one label or marker , the answers are quite diffused and are given in terms of probabilities. Say a 20% chance of A59C may be converted to a 80% chance in combination with label G456A and a 0.2 % chance if you had F555A . Do you get the point? If you dont dont worry as you can tell it is quite complicated!

Regardless. The chances are that, the more we ask such questions , the more we learn about these probabilities and that  is what most genetic research is looking to do.
So instead of studying say 10000 mostly white americans , things become more meaningful if you study 1000,000 people from every corner of this earth . Then the numbers may all add up and give us a more clear label to associate with any given outcome , like a heart-attack . Thats what the hapmap is all about.

Anyways getting back to the point,  Such studies are what give rise to the field of  personal genomics i.e look at what markers you have and then compare them with known marker-result associations or known drug-effect associations.

As I have already hopefuly convinced you : these studies are very nuanced. People expect  cut-and-dry answers and many may  return disappointed. Also people prone to hypochondria , should definitely stay away , lest that 2% chance that you will have a heart-attack will be converted to a very high percent chance of you getting it , actually because of the stress you put yourself through as a result of this knowledge ( the nurture  or lifestyle part).

Also this points to how these studies if carried out correctly will  change a lot of things , medicine , health-care , the very nature of how we view ourselves.
These studies can attempt to answer questions like how similar are south and north Indians genetically . And as I just told you , nature is not cut and dry and neither is human history!. So interpreting especially these results with social or political  implications is a double edged sword! 

Before I end , you can check out one of these reports as is given by companies like 23andme .

So are you excited? Do you want to find out whether you have a 10% chance of arthritis ? or a 40% chance that you have descended from a caucasian lineage?

more Reading on this topic:

For the medically inclined . You can read an article in science that talks about the implications of these studies for healthcare studies published in the science magazine


Sequence first ask questions later?

I am little confused after reading about the metagenomics approach that identified the causative agent for the colony collapse disorder which Deepak and myself blogged about.

After trolling through pubmed , it seems like a number of the honeybee potential pathogens were already quite well known. The Kashmir bee virus and the Israeli acute Paralysis virus were also lurking among bee populations. Was is not then possible to query this with a quick microarray designed following some text and sequence mining .

Or maybe its just faster to just sequence the whole bee and then perform the in vitro RT-PCR experiments which are a little more targeted.

Maybe this does say something about the difficulty of on the fly bioinformatics driven microarray fabrication . Since the closest I have come to a microarray experiment is seeing the images on the web .. I was just wondering aloud..I am hardly an expert

Addendum: There is of course no denying the added benefits of the metagenomic approach . Like the many other conclusions the paper made possible- that mite levels in both CCD and non-CCD samples were similar , that microflora ( like the bacteria in the bee gut) among Australian and American bees are similar . So I guess the question then is ..maybe metagenomics is just so much more direct that its going to be the first choice in all such open ended questions like ” What causes infectious Disease X”

Back on the NCBI horse

I have been working a lot with alignments in Jalview and had blogged about how Google can find Uniprot IDs better than NCBI ..well it turns out that NCBI did indeed have most of the Uniprot sequences I was looking for. The fault was mine! for not using the correct form of uniprot id..

The catch
I had to say just Q57T52 instead of the Q57T52_SALCH and Q325Y4 instead of Q325Y4_SHIBS

Which brings to me to one incredible thing about google. The google suggest and spelling correction. NCBI recently added the spelling correction feature. But still does not have something that would have told me that I should try Q57T52 instead of the old style Q57T52_SALCH uniprot id query.

So all in all out of the 742 sequences that the manually curated PFAM database had used in its voltage_clc gamily alignment. I could find almost 640 of them at the NCBI using the NCBI web service. All it took was understanding the existence of the deprecated uniprot id.

When I similarly tested the EBI web service for the same 742 sequences, only 582 sequences were obtainable in the uniprotxml format from the uniprotkb database.

As a final try , looking for some of the sequences that were missing in the better performing NCBI database , by doing a google search returned a match in the first few results. So google still is quite amazing in its ability to index even probably poorly page-ranked words like Q40LF7_DESAC. Surely the day they take on bioinformatics in a formal way will be a fun day to look forward to.

references : bbgm on a Google for Bioinformatics

Powered by ScribeFire.

Exciting times on the science web : Timo Hannay on Nascent

I was very excited to read Timo Hannays post on the Nature Nascent blog where he reproduced an excerpt from his post for STM news on “how Oreilly and the alpha-geek crowd have influenced Nature Magazine”. Titled , web opportunity , the post talks about the great opportunities that lie in the web for all of science and science publishing.

In the very interesting post Timo talks about the democratization of audio and video and Natures experiments with the Nature podcast. The Nature podcast apparently started off as just an experiment and then grew to almost 30,000 downloads at the end of its first year.
The article talks about scientists who listen to the podcast when they are on the microscope and commuting in or exercising. In my own case, I find that thanks to the nature podcast I am now even more inclined to pick up my print copy, to follow up on something exciting I heard on the nature podcast.

Apart from the ability of audio and video to organize and nucleate communities, Timo also talks about Databases as being the conduits that enable collaborations and the role that publishers have in building communities . Towards this Natures several Gateways , are database driven community resources that aggregate content from both the community and NPG journals in several areas.

The article makes good reading and I will not paraphrase it any further

If I were to rank the web offerings from Nature in terms of their value to my current scientific ranking would go thus
1) The Nature podcast
2) Connotea
3) The Nature Omics gateway

Powered by ScribeFire. – Screencasting for the Life Sciences

For the last many months five of us have been toiling away at building a tool for the life sciences. I am very excited to announce , our attempt at building a community for the Life Science sciences.

As biology gets more and more complex. We all found that we were forced to wear many hats. The march of genomics into every area of life science, forces us to learn new skills everyday.  There is no denying , how every life scientist has to become very well versed with computational data analysis. The lines between the former day computational biologists , bio-informaticians , statisticians, crystallographers, theoreticians and bench life scientists are blurring everyday. is our attempt at building a community that can share its knowledge through the powerful medium of screencasting.

This is just a beta and needless to say we need you to give us all the feedback you can.  So Dive in , check out our blog and let us know what you think.

References: Blog

Another post from one of us (Deepak Singh)

And last but not the least The website .

Nature precedings : A great offerring from the NPG stable

Nature just launched Nature precedings , a home for Pre-publication research and preliminary findings. Within seconds of browsing through its very intuitive interface I immediately got the purpose of this offering from Nature Publishing group.

The way it works is simple- you can upload content in the form of word documents , powerpoint files and pdf files and it gets released to the community after a preliminary check for appropriateness of content and suitability for the nature precedings audience. Signed up users can then vote on the content ( a la Digg) and it gets moves up or down its category. All of the content is also search-able and link-able and citeable.

As the help pages suggest, I hope the site serves , in the least as a repository of supplementary material and science findings related to published work anywhere, which can then be commented on and discussed.

More interestingly the FAQ page, informs us that the Nature publishing group Journals do accept material that is in the un-peer-reviewed form and that has appeared in the preprint form : So if i get together a manuscript , I can first post it on Nature precedings and then send it for consideration to Nature for review separately and nature would still consider it ( if it meets its other criteria of course).

So this site could be a great place to establish the provenance of ideas, i.e I have a great new finding , I am gutsy enough to write it up in some form , post it on Nature precedings ..and then a few months later send the finished work to a print journal like a Nature publishing group journal that would accept it.

With all of this Nature precedings has the great potential to become an online repository of pre-print findings , supplementary material and other content of use to the science community…I really cant wait for the first paper to make it from Nature Precedings to the real thing , Nature itself with a citation that first appeared online!.

Powered by ScribeFire.

Why Google may be better to find Uniprot sequences than the NCBI

My good friend Deepak had a quote in his blog from Lincoln Stein about making bioinformatics as much an everyday tool to the practicing biologist as a pipettor ( a device used to dispense liquids by experimental biologists and chemists)..

I totally agree, but  think we are quite far away. For example this morning I had to obtain the sequence of 772 swissprot entries  ,which were part of  an alignment for some downstream analysis. Of course my first choice was to query the NCBI -Entrez database. I soon realized that NCBI query box did not return  any results for  the first few queries I tried, all of which were probably new Uniprot/SwissProt IDs ( for eg. .sequence ids Q57T52_SALCH ,Q325Y4_SHIBS )

Disappointed , I turned to the EBI search engine. Within seconds I realized that the EBI indeed does indeed serve up all of entries. SO there are a subset of uniprot entries that the NCBI does not have in its database.

Out of sheer curiosity I entered the queries that drew a blank at the NCBI into Google.

Wonder of Wonders google pulled up all of the hard to find UniProt entries as the very first Match.
Thanks to the increasing use of publicly accessible web service APIs , Google is becoming more and more aware of a lot of very specific sequence data.

I will be very happy when I can type Q57T52_SALCH calc=MW and get an answer back from right inside google. Maybe that day bioinformatics will move one step closer to becoming just another tool.

Till then I am stuck with learning about Equery and WSDL and SOAP and so on..

Powered by ScribeFire.