I have been working a lot with alignments in Jalview and had blogged about how Google can find Uniprot IDs better than NCBI ..well it turns out that NCBI did indeed have most of the Uniprot sequences I was looking for. The fault was mine! for not using the correct form of uniprot id..
I had to say just Q57T52 instead of the Q57T52_SALCH and Q325Y4 instead of Q325Y4_SHIBS
Which brings to me to one incredible thing about google. The google suggest and spelling correction. NCBI recently added the spelling correction feature. But still does not have something that would have told me that I should try Q57T52 instead of the old style Q57T52_SALCH uniprot id query.
So all in all out of the 742 sequences that the manually curated PFAM database had used in its voltage_clc gamily alignment. I could find almost 640 of them at the NCBI using the NCBI web service. All it took was understanding the existence of the deprecated uniprot id.
When I similarly tested the EBI web service for the same 742 sequences, only 582 sequences were obtainable in the uniprotxml format from the uniprotkb database.
As a final try , looking for some of the sequences that were missing in the better performing NCBI database , by doing a google search returned a match in the first few results. So google still is quite amazing in its ability to index even probably poorly page-ranked words like Q40LF7_DESAC. Surely the day they take on bioinformatics in a formal way will be a fun day to look forward to.
references : bbgm on a Google for Bioinformatics
Powered by ScribeFire.