Batch extracting various coordinates of various FASTA IDs directly from NCBI?
1
0
Entering edit mode
6.2 years ago
yotiao • 0

Hello,

here is a problem: I am running a tblastn from my command line (Mac) querying SRA WGS NCBI database with a number of protein sequences. My goal is to get aligned subject sequences from the database (in nucleotide form) and use them in the downstream analysis. However, tblastn may have a bug (or it may be that I misunderstand what it's doing) because when I try to get output with the parameters -outfmt "6 sseqid sstart send evalue pident sseq I am getting an aligned protein sequence, not nucleotide (the sseq parameter there). -outfmt "6 sseqid sstart send evalue pident qseq correctly outputs aligned query sequence (which is protein). Interestingly enough, if I repeat the procedure in a browser and download the hit, it correctly downloads the nucleotide subject sequence (even though the alignment displayed is for a protein sequence) ¯_(ツ)_/¯.

But this post is not about a potential bug in tblastn. Because of the above behaviour, I have to find a workaround. My ideal scenario is that I have a bunch of sequence IDs, position start, position end in a file (just like a BED file):

XM_006538500.2 2456 8733
NM_001136130.2 23 99
NM_000484.3 430 776

and submit it to NCBI to only get the relevant bits of each sequence.

My question: is there an easy way of doing it (ie single command input -> multi FASTA output file)? I know about BEDtools, but they require me to download the sequences first, which I'd rather not do (some of them are huge as they are sequencing contigs). R package rentrez (which uses Eutils tools, specifically entrez_fetch) does not have an option of providing start-stop coordinates to extraction.

I would be very grateful for hints on whether this is doable the easy way of whether I do have to download the sequences first and extract their coordinates second.

Thanks! yot

ncbi fasta rentrez bed tblastn • 1.9k views
ADD COMMENT
5
Entering edit mode
6.2 years ago

ncbi efetch has seq_start and seq_end params: https://www.ncbi.nlm.nih.gov/books/NBK25499/

$ echo -e "XM_006538500.2 2456 8733\nNM_001136130.2 23 99" | while read -a F ; do wget -q -O - "https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${F[0]}&seq_start=${F[1]}&seq_stop=${F[2]}&rettype=fasta" ; done
>XM_006538500.2:2456-2813 PREDICTED: Mus musculus alkaline phosphatase, liver/bone/kidney (Alpl), transcript variant X5, mRNA
CTCTGGAACACTGGGCCCCATAGTCACGGCCAGTCCTCAAGCCCAACCCTCCCTGGGGGGAAGACCAGGT
CTGCTCAGGATGAGACTCCCAGGAAGCCACCTCCGGGGTTGGCTGTCTACCCAGGGTTGCCAAGCTGGGA
AGAACACTCCAGCCGGACAGGACACACACACACACTCCCCACCCAATTGCAGAGACTCGCCAACCCTTCA
CTGAAGTGGCTCTCCTGTTTGGAATAGCGGGGTGGGGTGGGGGAGAAGAAAGAAAGAAAGAAAAAAAATT
TTTAATTTCTCTTTTTGGTGTTGGTTAAAAGGGAACACAAGACATTTAAATAAAACATCCCAAATATTTC
TGAGGCCA

>NM_001136130.2:23-99 Homo sapiens amyloid beta precursor protein (APP), transcript variant 6, mRNA
CTGAGCCCCGCCGCCGCGCTCGGGCTCCGTCAGTTTCCTCGGCAGCGGTAGGCGAGAGCACGCGGAGGAG
CGTGCGC
ADD COMMENT
0
Entering edit mode

Ha, good ol' command line! I knew about efetch but assumed it would only work on single IDs. while read is news to me, thanks a lot! I'll check it out and then give you a thumb up.

ADD REPLY

Login before adding your answer.

Traffic: 2371 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6