NCBI Entrez Interface
NCBI makes a huge amount of data available via its Entrez interface and API. Documentation for using the API can be scarce, however, and using Entrez is frequently frustrating. Please add your tips and tricks for using Entrez here!
The current examples use BioPython as the Entrez interface. NCBI has also released command-line tools with similar functionality. If you can provide examples of how to implement similar scripts with the command-line tools, please share here!
Setup
The Entrez APIs require an email address. If BioPython, you specify your email address before accessing Entrez like so:
from Bio import Entrez Entrez.email = username@host
Searching Entrez
A Basic Search
Suppose we want to search Entrez for the locus tag "CLL_A2397".
handle = Entrez.esearch(db='protein', term='CLL_A2397') results = Entrez.read(handle) handle.close()
Specify A Database
Let's specify that we want to search in RefSeq.
handle = Entrez.esearch(db='protein', term='refseq[FILTER] AND CLL_A2397') results = Entrez.read(handle) handle.close()
Specify An Organism
Suppose we want all GI numbers for humans from RefSeq. The NCBI tax id for Homo sapiens is 9606, so we will include txid9606[Organism]
in the search term.
handle = Entrez.esearch(db='protein', term='refseq[FILTER] AND txid9606[Organism]') results = Entrez.read(handle) handle.close()
Getting Accession Numbers
Once you have the results of a search (assuming you've parsed them using Entrez.read()
as above) , BioPython stores the results as a dictionary. To get the GIs retrieved, you can look under the 'IdList'
key. Suppose we searched RefSeq for the locus tag 'CLL_A2397' as in an example above:
from Bio import Entrez Entrez.email = username@host handle = Entrez.esearch(db='protein', term='refseq[FILTER] AND CLL_A2397') results = Entrez.read(handle) handle.close() print results['IdList']
would return:
['187932364']
Other Resources For Searching Entrez
List Of Filters For Entrez Searches
Fetching Records Using Entrez
Given a comma-separated list of GI numbers, you can retrieve the records using Entrez eFetch.
Retrieving Multiple Genbank Records At Once
I have had trouble parsing multiple records fetched at once, using a comma-separated list, if retrieving in Genbank format. In principle, BioPython's SeqIO.parse(handle, 'genbank')
function should return the records in a list, but I've run into trouble trying this. Whether it's an error in BioPython or my own mistake, if someone can clear this up, please do so and remove this warning.
Suppose we want to get the Genbank record for the protein with GI number '187932364':
from Bio import Entrez, SeqIO handle = Entrez.efetch(db='protein', id='187932364', rettype='gb') record = SeqIO.read(handle, 'genbank') handle.close()
Parsing Genbank Files
The record
that was parsed using SeqIO
above is a Python object. Its features
attribute is a list of BioPython SeqFeature
objects. Each feature contains a list of "qualifiers", which contain information such as the locus tag, gene (if available), db_xrefs, etc. To see the qualifiers for a CDS feature in the Genbank record, we can run this code:
for feature in record.features: if feature.type == 'CDS': print feature.qualifiers
which will print the following dictionary:
{'locus_tag': ['CLL_A2397'], 'coded_by': ['complement(NC_010674.1:2488690..2490378)'], 'transl_table': ['11'], 'note': ['identified by match to protein family HMM PF02463; match to , 'db_xref': ['GeneID:19966649'], 'gene': ['recN']}
Other Resources For Parsing Genbank Files
Great page from the Wilke Lab on parsing Genbank files
Dealing with GenBank files in Biopython by Peter Cock
eFetch And Parsing Results With BioPython [PDF]
An Example Of Finding RefSeq Gene By Locus Tag
This example script takes a tab-delimited file, where one column contains the locus tags, and prints the gene name in RefSeq for that locus tag (if the gene name is specified). If the gene name isn't specified, it simply prints -
.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.