Back on the NCBI horse

I have been working a lot with alignments in Jalview and had blogged about how Google can find Uniprot IDs better than NCBI ..well it turns out that NCBI did indeed have most of the Uniprot sequences I was looking for. The fault was mine! for not using the correct form of uniprot id..

The catch
I had to say just Q57T52 instead of the Q57T52_SALCH and Q325Y4 instead of Q325Y4_SHIBS

Which brings to me to one incredible thing about google. The google suggest and spelling correction. NCBI recently added the spelling correction feature. But still does not have something that would have told me that I should try Q57T52 instead of the old style Q57T52_SALCH uniprot id query.

So all in all out of the 742 sequences that the manually curated PFAM database had used in its voltage_clc gamily alignment. I could find almost 640 of them at the NCBI using the NCBI web service. All it took was understanding the existence of the deprecated uniprot id.

When I similarly tested the EBI web service for the same 742 sequences, only 582 sequences were obtainable in the uniprotxml format from the uniprotkb database.

As a final try , looking for some of the sequences that were missing in the better performing NCBI database , by doing a google search returned a match in the first few results. So google still is quite amazing in its ability to index even probably poorly page-ranked words like Q40LF7_DESAC. Surely the day they take on bioinformatics in a formal way will be a fun day to look forward to.

references : bbgm on a Google for Bioinformatics

Powered by ScribeFire.


2 responses to “Back on the NCBI horse

  1. Q57T52_SALCH is not a “deprecated” identifier: It’s the mnemonic (or “entry name”) for Q57T52. The mnemonic system makes more sense for curated entries such as P0C0R6 (ARNA_SALCH). For unreviewed entries the first part of the mnemonic is simply the identifier… It’s still strange that NCBI indexes the mnemonics for reviewed but not unreviewed entries. In any case if you are dealing with UniProt entries it’s probably best to go to the official UniProt web site, or better, to which has a pretty good batch retrieval function!

  2. Hi Eric.
    Thanks for your pointer. The batch retrieval tool looks at looks very easy to use. Now I can conceivably go from my list to a whole bunch of sequences on file.
    Ill give this a try and blog back about it once I do.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s