extract one-side unmapped reads from sam
1
0
Entering edit mode
3.4 years ago
liorglic ▴ 440

Hello,
I've ran TopHat and aligned paired end (PE) reads to a reference genome.
TopHat outputs two bam files: accepted_hits.bam and unmapped.bam.
I'd like to extract from unmapped.bam only those reads where only one side of the pair (R1 or R2) was not mapped, i.e. exclude records where both R1 and R2 are unmapped.
An interesting fact that may have to do with the specifics of TopHat, is that the only sam flags in the file are 69 and 133. According to this nice tool, this means:
69 - read paired (0x1), read unmapped (0x4), first in pair (0x40)
133 - read paired (0x1), read unmapped (0x4), second in pair (0x80)
Pairs where both reads are unmapped have two records, one with flag 69 and one with flag 133. For example:

SOLEXA1:1:1:10:1400     69      *       0       255     *       *       0       0       GGTGCATCTACTTCCCTAATTTTGGCTGAAACTTCAATCCGAANGGTGTTNNTCCGAGATC   AB8CBBBBBBBBBABBBBABBBBBB@ABABA@AAABBB@AA?8%+=7:>=%%<?A=@>@;;
SOLEXA1:1:1:10:1400     133     *       0       255     *       *       0       0       NGGAACAANACCNTTCGGATTNAAGTTTNAGNCNAANTTAGGGNAGTT        %/;9;;;/%-;0%0;9;;;#############################  ZT:A:N


While pairs where only one read is unmapped have only one record with either 69 or 133:

SOLEXA1:1:1:10:141      133     *       0       255     *       *       0       0       NGGTACTACAAAAATTGAAGTNGACTATTACNGTCACTCTCCTGTTTC        %.;8;;9;;;9;;9;;;;;9,%/99;;;;;/%.8;;9;##########


Is there any way, using samtools (or any other tool) to extract the pairs I'm interested in? I guess I could just script this, but I suspect there might be a better way to go.
Thanks a lot!

sam bam samtools tophat • 1.1k views
0
Entering edit mode

You should know that the old 'Tuxedo' pipeline of Tophat(2) and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using salmon.

0
Entering edit mode

Aligners will write un-mapped reads to separate files (saving you this step). BBMap (outu=) and HISTAT2 (--un option) are a couple of options. STAR (--outReadsUnmapped Fastx) can also do this. So consider using a newer aligner.

0
Entering edit mode
3.4 years ago

I'd like to extract from unmapped.bam only those reads where only one side of the pair (R1 or R2) was not mapped, i.e. exclude records where both R1 and R2 are unmapped

That would be :

samtools view -f 5 -F 8


Which mean that you take all read that are paired and unmapped (-f 5) whose mates are not unmapped (-F 8).

PS: if your mate-related flags are uncorrectly set, you can use samtools fixmate before.