Question: Batch extracting various coordinates of various FASTA IDs directly from NCBI?
gravatar for yotiao
2.1 years ago by
United Kingdom
yotiao0 wrote:


here is a problem: I am running a tblastn from my command line (Mac) querying SRA WGS NCBI database with a number of protein sequences. My goal is to get aligned subject sequences from the database (in nucleotide form) and use them in the downstream analysis. However, tblastn may have a bug (or it may be that I misunderstand what it's doing) because when I try to get output with the parameters -outfmt "6 sseqid sstart send evalue pident sseq I am getting an aligned protein sequence, not nucleotide (the sseq parameter there). -outfmt "6 sseqid sstart send evalue pident qseq correctly outputs aligned query sequence (which is protein). Interestingly enough, if I repeat the procedure in a browser and download the hit, it correctly downloads the nucleotide subject sequence (even though the alignment displayed is for a protein sequence) ¯_(ツ)_/¯.

But this post is not about a potential bug in tblastn. Because of the above behaviour, I have to find a workaround. My ideal scenario is that I have a bunch of sequence IDs, position start, position end in a file (just like a BED file):

XM_006538500.2 2456 8733
NM_001136130.2 23 99
NM_000484.3 430 776

and submit it to NCBI to only get the relevant bits of each sequence.

My question: is there an easy way of doing it (ie single command input -> multi FASTA output file)? I know about BEDtools, but they require me to download the sequences first, which I'd rather not do (some of them are huge as they are sequencing contigs). R package rentrez (which uses Eutils tools, specifically entrez_fetch) does not have an option of providing start-stop coordinates to extraction.

I would be very grateful for hints on whether this is doable the easy way of whether I do have to download the sequences first and extract their coordinates second.

Thanks! yot

bed rentrez tblastn fasta ncbi • 801 views
ADD COMMENTlink modified 2.1 years ago by Pierre Lindenbaum127k • written 2.1 years ago by yotiao0
gravatar for Pierre Lindenbaum
2.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum127k wrote:

ncbi efetch has seq_start and seq_end params:

$ echo -e "XM_006538500.2 2456 8733\nNM_001136130.2 23 99" | while read -a F ; do wget -q -O - "${F[0]}&seq_start=${F[1]}&seq_stop=${F[2]}&rettype=fasta" ; done
>XM_006538500.2:2456-2813 PREDICTED: Mus musculus alkaline phosphatase, liver/bone/kidney (Alpl), transcript variant X5, mRNA

>NM_001136130.2:23-99 Homo sapiens amyloid beta precursor protein (APP), transcript variant 6, mRNA
ADD COMMENTlink written 2.1 years ago by Pierre Lindenbaum127k

Ha, good ol' command line! I knew about efetch but assumed it would only work on single IDs. while read is news to me, thanks a lot! I'll check it out and then give you a thumb up.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by yotiao0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 788 users visited in the last hour