here is a problem: I am running a tblastn from my command line (Mac) querying SRA WGS NCBI database with a number of protein sequences. My goal is to get aligned subject sequences from the database (in nucleotide form) and use them in the downstream analysis. However, tblastn may have a bug (or it may be that I misunderstand what it's doing) because when I try to get output with the parameters
-outfmt "6 sseqid sstart send evalue pident sseq I am getting an aligned protein sequence, not nucleotide (the
sseq parameter there).
-outfmt "6 sseqid sstart send evalue pident qseq correctly outputs aligned query sequence (which is protein). Interestingly enough, if I repeat the procedure in a browser and download the hit, it correctly downloads the nucleotide subject sequence (even though the alignment displayed is for a protein sequence) ¯_(ツ)_/¯.
But this post is not about a potential bug in tblastn. Because of the above behaviour, I have to find a workaround. My ideal scenario is that I have a bunch of sequence IDs, position start, position end in a file (just like a BED file):
XM_006538500.2 2456 8733 NM_001136130.2 23 99 NM_000484.3 430 776
and submit it to NCBI to only get the relevant bits of each sequence.
My question: is there an easy way of doing it (ie single command input -> multi FASTA output file)? I know about
BEDtools, but they require me to download the sequences first, which I'd rather not do (some of them are huge as they are sequencing contigs). R package
rentrez (which uses
Eutils tools, specifically
entrez_fetch) does not have an option of providing start-stop coordinates to extraction.
I would be very grateful for hints on whether this is doable the easy way of whether I do have to download the sequences first and extract their coordinates second.