How to download from ncbi not a whole FASTA file, but only a certain interval?
4
1
Entering edit mode
4.2 years ago
hazirliver ▴ 10

Hi! I have a list with accession numbers from ncbi and intervals i need to download. Is there a way to download not a whole FASTA file, but only a certain interval with python/biopython or R (or using some other soft)? An example list is provided below.

id            acc_no         start   stop
10000002717 NZ_GG774949.1   1662245 1662896
10000003767 NZ_GG774949.1   1678553 1679990
10000003783 NZ_GG774796.1   257545  258028
FASTA GenBank Biopython R • 1.6k views
ADD COMMENT
0
Entering edit mode

it might help to add from where you want to download them, using any specific software/tool ?

Please go through [[ Please read before posting a question ::: How To Ask A Good Question ]] and then consider editing your question.

ADD REPLY
0
Entering edit mode

I mean FASTA sequences that can be downloaded from ncbi. From software I will try to deal with anyone, but it will be more convenient with python/biopython or R

ADD REPLY
3
Entering edit mode
4.2 years ago

Use ncbi efetch

$ wget -O - -q "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NZ_GG774949.1&retmode=text&rettype=fasta&seq_start=1662245&seq_stop=1662896"

>NZ_GG774949.1:1662245-1662896 Bacteroides sp. 3_1_23 supercont1.1, whole genome shotgun sequence
ACAAGTACTTGTTTTATAGTTACCCGTCCTAAATTACGGGAAGTGTTTGCTGCAAATTTCCGTGACCGTA
TTGTACAACATTGGTTGTGCCTACGCTTAGAGCCACTGTTTGAGGCACGTTTTGTTGAACACGGAAATGT
ATCATTTAACTGTCGAAAGGGTTTTGGAACATTTGCATGTATTGATCAGTTGACAAAAAATACAATTGAA
GTCTCTGATAATTATTCGCACGAGGCTTGGTATGCTCAATTTGATATTAAAGGATTTTTTATGTCAATTG
ATTGCGAACGATTATTAGAACACTTATTACCATTTATCAAAGAAAAATGGAATTATTGGAAAGGGACCAT
ATATGAACAAGATTTAGATTTAGTGCTATGGCTTACAGAAATAATTGTACGACATCGACCACAAGATGAT
TGTATACGTCAAGGAAATTTAAAATTATGGAGAATACTGCCTAAAAACAAAAGCCTGTTTTACAATGAAT
GGATGAAAGGCGAACCAATAGGAAACCTAACTAGTCAATTATTTGCCAATTTTTACATGTCATTTTTTGA
TGAATGGGCTATTAAAGCAGCAGAAGAAAGAGGAGCCAAATATGTACGTTTTGTAGATGATTTTAGCTTT
GTGTGCAAAACTAAGGAAGATG
ADD COMMENT
1
Entering edit mode
4.2 years ago
GenoMax 141k

Using Entrezutils on command line:

$ efetch -db nuccore -id "NZ_GG774949.1" -seq_start 1662245 -seq_stop 1662896 -format fasta
ADD COMMENT
1
Entering edit mode
4.2 years ago
vkkodali_ncbi ★ 3.7k

EDirect works well for a relatively small number of sequences. If the total number of accessions (column 2 of your table) you are dealing with is small, you may want to consider first downloading the entire fasta for those accessions and then using something like bedtools or seqkit to extract specific ranges from the complete fasta sequences. This approach will be quicker compared to fetching sequences for each range directly from NCBI servers.

ADD COMMENT
0
Entering edit mode

Since this is an important distinction I have moved this to a separate answer.

ADD REPLY
0
Entering edit mode
2.3 years ago
jackson9 ▴ 10

efetch specific location of an fasta ID in ncbi with Biopython

import Bio
from Bio import Entrez
Entrez.email = 'yourmail@gmail.com'
record = Entrez.efetch(db = 'nucleotide', id = 'NC_051849.1', rettype = 'fasta', retmode = 'text', seq_start = 33845728, seq_stop = 33848021)
print(len(record.read()))
ADD COMMENT

Login before adding your answer.

Traffic: 2197 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6