Retrieving specific subsequences from multiple references on NCBI?
Entering edit mode
2.1 years ago
cxr • 0


I have a set of about a thousand sequences in tsv form Accession_Number:Strand:Start:End and I'm struggling to figure out the best way to retrieve them off of NCBI's databases. I've tried Batch Entrez but can only grab the entire record for each entry rather than just my specific regions of interest. I was wondering if anyone had insight on how best to go about retrieving multiple subsequences across multiple references on NCBIs databases?

An example of the data I'm working with for context:

BA000007.3  1   5017083 5018620  
CP053370.1  1   10266   11819  
CP053370.1  1   106369  107922  
CP053370.1  1   112532  114085  
CP053370.1  1   216122  217675
ncbi reference databases dna rna • 699 views
Entering edit mode
2.1 years ago
GenoMax 143k

You can use Entrezdirect in this way. Use a loop structure to go through a list.

$ efetch -db nuccore -id CP053370.1 -seq_start 10266 -seq_stop 10300 -format fasta
>CP053370.1:10266-10300 Lysinibacillus sphaericus strain NEU 1003 chromosome
Entering edit mode

I see that the 5 rows posted comprise of only 2 unique seq-ids. The proposed efetch method will download the entire sequence for the seq-id every time. While this won't be an issue for a few sequences, it can become slow for a whole bunch of sequences. Perhaps something like this will help:

## make a bed-like file with regions
$ cat regions.txt 
BA000007.3      5017083 5018620
CP053370.1      10266   11819
CP053370.1      106369  107922
CP053370.1      112532  114085
CP053370.1      216122  217675
## use efetch to download sequences for each seq-id only once
## then use seqkit subseq to extract sequences 
$ cut -f1 regions.txt | sort -u | epost -db nuccore | efetch -format fasta | seqkit subseq --bed regions.txt

Login before adding your answer.

Traffic: 2602 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6