Question

Batch extracting various coordinates of various FASTA IDs directly from NCBI?

0

Entering edit mode

6.2 years ago

yotiao • 0

Hello,

here is a problem: I am running a tblastn from my command line (Mac) querying SRA WGS NCBI database with a number of protein sequences. My goal is to get aligned subject sequences from the database (in nucleotide form) and use them in the downstream analysis. However, tblastn may have a bug (or it may be that I misunderstand what it's doing) because when I try to get output with the parameters -outfmt "6 sseqid sstart send evalue pident sseq I am getting an aligned protein sequence, not nucleotide (the sseq parameter there). -outfmt "6 sseqid sstart send evalue pident qseq correctly outputs aligned query sequence (which is protein). Interestingly enough, if I repeat the procedure in a browser and download the hit, it correctly downloads the nucleotide subject sequence (even though the alignment displayed is for a protein sequence) ¯_(ツ)_/¯.

But this post is not about a potential bug in tblastn. Because of the above behaviour, I have to find a workaround. My ideal scenario is that I have a bunch of sequence IDs, position start, position end in a file (just like a BED file):

XM_006538500.2 2456 8733
NM_001136130.2 23 99
NM_000484.3 430 776

and submit it to NCBI to only get the relevant bits of each sequence.

My question: is there an easy way of doing it (ie single command input -> multi FASTA output file)? I know about BEDtools, but they require me to download the sequences first, which I'd rather not do (some of them are huge as they are sequencing contigs). R package rentrez (which uses Eutils tools, specifically entrez_fetch) does not have an option of providing start-stop coordinates to extraction.

I would be very grateful for hints on whether this is doable the easy way of whether I do have to download the sequences first and extract their coordinates second.

Thanks! yot

ncbi fasta rentrez bed tblastn • 1.9k views

ADD COMMENT • link updated 6.2 years ago by Pierre Lindenbaum 161k • written 6.2 years ago by yotiao • 0

score 5 · Accepted Answer · 2018-02-28

ncbi efetch has seq_start and seq_end params: https://www.ncbi.nlm.nih.gov/books/NBK25499/

$ echo -e "XM_006538500.2 2456 8733\nNM_001136130.2 23 99" | while read -a F ; do wget -q -O - "https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${F[0]}&seq_start=${F[1]}&seq_stop=${F[2]}&rettype=fasta" ; done
>XM_006538500.2:2456-2813 PREDICTED: Mus musculus alkaline phosphatase, liver/bone/kidney (Alpl), transcript variant X5, mRNA
CTCTGGAACACTGGGCCCCATAGTCACGGCCAGTCCTCAAGCCCAACCCTCCCTGGGGGGAAGACCAGGT
CTGCTCAGGATGAGACTCCCAGGAAGCCACCTCCGGGGTTGGCTGTCTACCCAGGGTTGCCAAGCTGGGA
AGAACACTCCAGCCGGACAGGACACACACACACACTCCCCACCCAATTGCAGAGACTCGCCAACCCTTCA
CTGAAGTGGCTCTCCTGTTTGGAATAGCGGGGTGGGGTGGGGGAGAAGAAAGAAAGAAAGAAAAAAAATT
TTTAATTTCTCTTTTTGGTGTTGGTTAAAAGGGAACACAAGACATTTAAATAAAACATCCCAAATATTTC
TGAGGCCA

>NM_001136130.2:23-99 Homo sapiens amyloid beta precursor protein (APP), transcript variant 6, mRNA
CTGAGCCCCGCCGCCGCGCTCGGGCTCCGTCAGTTTCCTCGGCAGCGGTAGGCGAGAGCACGCGGAGGAG
CGTGCGC