I run salmon
with --writeUnmappedNames
option; now I want to extract and analyze those unmapped reads - so I want to learn what's the best way to do it.
First of all, what tool to use for extraction. Now I start with something like seqtk subseq $fN1z $fnLst > $fnOut1
(where $fN1z
is filename of fasta file and $fnLst
is aux_info/unmapped_names.txt
from salmon output). This works, but is a bit slow so I am wondering if there's any better tool then seqtk
for this task.
Second, I want to make sure I understand correctly the output for pair-end reads. In the manual, I see different cases of reads being unmapped, like "u = entire pair", "m1 = left orphan", etc. On my understanding, I should extract separately from both fastq files: from *_1.fq
I extract reads marked u
and m1
, and from *_2.fq
I extract reads marked u
and m2
(for now let's ignore m12
case). Is it correct?
P.S. lastly, slightly offtopic (purely unix) question: what would be the fastest way to sort those extracted reads by their abundance? I run the following: sed -n 'n;p;n;n' $fnOut1 | sort | uniq -c | sort -k1 -r -n > $fnSort
- it does the job but takes quite some time...