Question

Retrieve all FASTA RefSeq files for a given entry in NCBI gene?

1

Entering edit mode

9.4 years ago

Nancy Ouyang ▴ 170

I would like to retrieve all the FASTA RefSeq files for a given Gene entry in NCBI, e.g. http://www.ncbi.nlm.nih.gov/gene/3803#reference-sequences

How should I go about this?

[EDIT: In more detail, I would like 1) the refseq independent of genome build 2) the hg38 version if it exists 3) the hg38 alt_loci versions if they exist.]

My inclination is to do some python web scraping (collect as "FASTA" links under that div, then scrape that page and write it out to a FASTA file), but possibly there an easier way to do this through NCBI Entrez Batch Entry or biopython interface to Entrez, or through filtering results in nuccore, which allows me to download all results as a single FASTA file.

My manual way is to click each "FASTA" link, click send > File > FASTA > Create File and save the file, which is not reasonable for >100 sequences (the KIR genes on hg38 and hg38 alt loci).

I tried NCBI nucleotide, but it returns more results than match NCBI gene, e.g. http://www.ncbi.nlm.nih.gov/nuccore/?term=KIR2DL2, specifically "KIR2DL2[All Fields] AND ("Homo sapiens"[Organism] AND biomol_genomic[PROP] AND refseq[filter])"

returns 38 results instead of 6. Regardless, using nuccore is still a little cumbersome, though it would reduce my work by an order of magnitude.

ncbi • 5.1k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by Nancy Ouyang ▴ 170

0

Entering edit mode

duplicate of Get Fasta File With Protein Sequences Given Entrez Gene Ids

ADD REPLY • link 9.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

it's not an exact duplicate -- that's asking for the protein sequences, I'm looking for the genomic reference seqs

ADD REPLY • link 9.4 years ago by Nancy Ouyang ▴ 170

0

Entering edit mode

Note: Turns out "(KIR2DL2[All Fields] AND ("Homo sapiens"[Organism] AND biomol_genomic[PROP] AND refseq[filter])) AND KIR2DL2[Gene Name]" gives me what I expect (7 sequences) and is hand for double-checking. The accepted answer + bash script linked in the comments really allow me to batch the process, though.

ADD REPLY • link 9.4 years ago by Nancy Ouyang ▴ 170

Ram · Answer 1 · 2014-12-05

1

Entering edit mode

9.4 years ago

5heikki 11k

With Entrez Direct, is this the output you are expecting (esl-seqstat ships with HMMER)?

epost -db gene -id 3803 | elink -target nuccore | efilter -query "Homo sapiens[Organism] AND biomol_genomic[PROP] AND refseq[filter]" | efetch -format fasta | esl-seqstat -a -
= gi|568815557|ref|NT_187686.1|   215732 Homo sapiens chromosome 19 genomic scaffold, GRCh38 alternate locus group ALT_REF_LOCI_33 HSCHR19KIR_FH13_BA2_HAP_CTG3_1
= gi|568815548|ref|NW_003571056.2|  1064304 Homo sapiens chromosome 19 genomic scaffold, GRCh38 alternate locus group ALT_REF_LOCI_3 HSCHR19LRC_LRC_I_CTG3_1
= gi|568815544|ref|NT_187675.1|   282224 Homo sapiens chromosome 19 genomic scaffold, GRCh38 alternate locus group ALT_REF_LOCI_27 HSCHR19KIR_FH05_B_HAP_CTG3_1
= gi|568815537|ref|NT_187668.1|   205194 Homo sapiens chromosome 19 genomic scaffold, GRCh38 alternate locus group ALT_REF_LOCI_20 HSCHR19KIR_RSH_BA2_HAP_CTG3_1
= gi|568815507|ref|NT_187640.1|   204239 Homo sapiens chromosome 19 genomic scaffold, GRCh38 alternate locus group ALT_REF_LOCI_14 HSCHR19KIR_G248_BA2_HAP_CTG3_1
= gi|568815503|ref|NT_187636.1|   248807 Homo sapiens chromosome 19 genomic scaffold, GRCh38 alternate locus group ALT_REF_LOCI_10 HSCHR19KIR_FH15_B_HAP_CTG3_1
= gi|148389938|ref|NG_005994.1|   196471 Homo sapiens KIR gene cluster, haplotype B (KIRB@) on chromosome 19
Format:              FASTA
Alphabet type:       DNA
Number of sequences: 7
Total # residues:    2416971
Smallest:            196471
Largest:             1064304
Average length:      345281.6

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by 5heikki 11k

1

Entering edit mode

Yes! Thanks a bunch, I was about to start on a web scraping script, now I can just write a simple bash script that is much more reliable.

ADD REPLY • link 9.4 years ago by Nancy Ouyang ▴ 170

0

Entering edit mode

Entrez Direct is truly a godsend. Just don't abuse NCBI's serves by sending numerous requests per second by

for ID in $(cat list)
do
epost -db gene -id $ID | ..
done

Instead, include like 500 IDs in the $ID string (separated by commas).

I posted A: How to get protein ID from gene ID (batch entrez) yesterday that you could easily modify for your needs.

ADD REPLY • link 9.4 years ago by 5heikki 11k

0

Entering edit mode

Thanks for the tip, I was about to do exactly that ^^;

ADD REPLY • link 9.4 years ago by Nancy Ouyang ▴ 170

0

Entering edit mode

Hmm :( Seems like this pulls out the whole sequence, e.g. 190kbp, instead of the region, e.g. 19kbp. Will keep playing around, but may have to go back to web scraping idea.

ADD REPLY • link 9.4 years ago by Nancy Ouyang ▴ 170

0

Entering edit mode

I have a feeling that you just need to change your efilter query..

ADD REPLY • link 9.4 years ago by 5heikki 11k

0

Entering edit mode

I think I have a working solution via Entrez xtract "unrecognized argument '-match'" error parsing the XML of the gene entry -> running efetch with the seq start and end. I'll report back when I get all the steps working together. Thanks for showing me how to work with entrez direct!

ADD REPLY • link 9.4 years ago by Nancy Ouyang ▴ 170