Hello,
here is a problem: I am running a tblastn from my command line (Mac) querying SRA WGS NCBI database with a number of protein sequences. My goal is to get aligned subject sequences from the database (in nucleotide form) and use them in the downstream analysis. However, tblastn may have a bug (or it may be that I misunderstand what it's doing) because when I try to get output with the parameters -outfmt "6 sseqid sstart send evalue pident sseq
I am getting an aligned protein sequence, not nucleotide (the sseq
parameter there). -outfmt "6 sseqid sstart send evalue pident qseq
correctly outputs aligned query sequence (which is protein). Interestingly enough, if I repeat the procedure in a browser and download the hit, it correctly downloads the nucleotide subject sequence (even though the alignment displayed is for a protein sequence) ¯_(ツ)_/¯.
But this post is not about a potential bug in tblastn. Because of the above behaviour, I have to find a workaround. My ideal scenario is that I have a bunch of sequence IDs, position start, position end in a file (just like a BED file):
XM_006538500.2 2456 8733
NM_001136130.2 23 99
NM_000484.3 430 776
and submit it to NCBI to only get the relevant bits of each sequence.
My question: is there an easy way of doing it (ie single command input -> multi FASTA output file)? I know about BEDtools
, but they require me to download the sequences first, which I'd rather not do (some of them are huge as they are sequencing contigs). R package rentrez
(which uses Eutils
tools, specifically entrez_fetch
) does not have an option of providing start-stop coordinates to extraction.
I would be very grateful for hints on whether this is doable the easy way of whether I do have to download the sequences first and extract their coordinates second.
Thanks! yot
Ha, good ol' command line! I knew about
efetch
but assumed it would only work on single IDs.while read
is news to me, thanks a lot! I'll check it out and then give you a thumb up.