Question

Closed:Extract unmapped reads from salmon output

0

Entering edit mode

5.3 years ago

vaushev ▴ 20

I run salmon with --writeUnmappedNames option; now I want to extract and analyze those unmapped reads - so I want to learn what's the best way to do it.
First of all, what tool to use for extraction. Now I start with something like seqtk subseq $fN1z $fnLst > $fnOut1 (where $fN1z is filename of fasta file and $fnLst is aux_info/unmapped_names.txt from salmon output). This works, but is a bit slow so I am wondering if there's any better tool then seqtk for this task.
Second, I want to make sure I understand correctly the output for pair-end reads. In the manual, I see different cases of reads being unmapped, like "u = entire pair", "m1 = left orphan", etc. On my understanding, I should extract separately from both fastq files: from *_1.fq I extract reads marked u and m1, and from *_2.fq I extract reads marked u and m2 (for now let's ignore m12 case). Is it correct?

P.S. lastly, slightly offtopic (purely unix) question: what would be the fastest way to sort those extracted reads by their abundance? I run the following: sed -n 'n;p;n;n' $fnOut1 | sort | uniq -c | sort -k1 -r -n > $fnSort - it does the job but takes quite some time...

RNA-Seq rna-seq salmon • 265 views

ADD COMMENT • link 5.3 years ago by vaushev ▴ 20