Batch extracting various coordinates of various FASTA IDs directly from NCBI?
1
0
Entering edit mode
4.6 years ago
yotiao • 0

Hello,

here is a problem: I am running a tblastn from my command line (Mac) querying SRA WGS NCBI database with a number of protein sequences. My goal is to get aligned subject sequences from the database (in nucleotide form) and use them in the downstream analysis. However, tblastn may have a bug (or it may be that I misunderstand what it's doing) because when I try to get output with the parameters -outfmt "6 sseqid sstart send evalue pident sseq I am getting an aligned protein sequence, not nucleotide (the sseq parameter there). -outfmt "6 sseqid sstart send evalue pident qseq correctly outputs aligned query sequence (which is protein). Interestingly enough, if I repeat the procedure in a browser and download the hit, it correctly downloads the nucleotide subject sequence (even though the alignment displayed is for a protein sequence) ¯_(ツ)_/¯.

But this post is not about a potential bug in tblastn. Because of the above behaviour, I have to find a workaround. My ideal scenario is that I have a bunch of sequence IDs, position start, position end in a file (just like a BED file):

XM_006538500.2 2456 8733
NM_001136130.2 23 99
NM_000484.3 430 776


and submit it to NCBI to only get the relevant bits of each sequence.

My question: is there an easy way of doing it (ie single command input -> multi FASTA output file)? I know about BEDtools, but they require me to download the sequences first, which I'd rather not do (some of them are huge as they are sequencing contigs). R package rentrez (which uses Eutils tools, specifically entrez_fetch) does not have an option of providing start-stop coordinates to extraction.

I would be very grateful for hints on whether this is doable the easy way of whether I do have to download the sequences first and extract their coordinates second.

Thanks! yot

ncbi fasta rentrez bed tblastn • 1.6k views
5
Entering edit mode
4.6 years ago

ncbi efetch has seq_start and seq_end params: https://www.ncbi.nlm.nih.gov/books/NBK25499/

$echo -e "XM_006538500.2 2456 8733\nNM_001136130.2 23 99" | while read -a F ; do wget -q -O - "https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${F[0]}&seq_start=${F[1]}&seq_stop=${F[2]}&rettype=fasta" ; done
>XM_006538500.2:2456-2813 PREDICTED: Mus musculus alkaline phosphatase, liver/bone/kidney (Alpl), transcript variant X5, mRNA
CTCTGGAACACTGGGCCCCATAGTCACGGCCAGTCCTCAAGCCCAACCCTCCCTGGGGGGAAGACCAGGT
CTGCTCAGGATGAGACTCCCAGGAAGCCACCTCCGGGGTTGGCTGTCTACCCAGGGTTGCCAAGCTGGGA
AGAACACTCCAGCCGGACAGGACACACACACACACTCCCCACCCAATTGCAGAGACTCGCCAACCCTTCA
CTGAAGTGGCTCTCCTGTTTGGAATAGCGGGGTGGGGTGGGGGAGAAGAAAGAAAGAAAGAAAAAAAATT
TTTAATTTCTCTTTTTGGTGTTGGTTAAAAGGGAACACAAGACATTTAAATAAAACATCCCAAATATTTC
TGAGGCCA

>NM_001136130.2:23-99 Homo sapiens amyloid beta precursor protein (APP), transcript variant 6, mRNA
CTGAGCCCCGCCGCCGCGCTCGGGCTCCGTCAGTTTCCTCGGCAGCGGTAGGCGAGAGCACGCGGAGGAG
CGTGCGC

0
Entering edit mode

Ha, good ol' command line! I knew about efetch but assumed it would only work on single IDs. while read is news to me, thanks a lot! I'll check it out and then give you a thumb up.