Question: extract one-side unmapped reads from sam
0
gravatar for liorglic
10 months ago by
liorglic50
liorglic50 wrote:

Hello,
I've ran TopHat and aligned paired end (PE) reads to a reference genome.
TopHat outputs two bam files: accepted_hits.bam and unmapped.bam.
I'd like to extract from unmapped.bam only those reads where only one side of the pair (R1 or R2) was not mapped, i.e. exclude records where both R1 and R2 are unmapped.
An interesting fact that may have to do with the specifics of TopHat, is that the only sam flags in the file are 69 and 133. According to this nice tool, this means:
69 - read paired (0x1), read unmapped (0x4), first in pair (0x40)
133 - read paired (0x1), read unmapped (0x4), second in pair (0x80)
Pairs where both reads are unmapped have two records, one with flag 69 and one with flag 133. For example:

SOLEXA1:1:1:10:1400     69      *       0       255     *       *       0       0       GGTGCATCTACTTCCCTAATTTTGGCTGAAACTTCAATCCGAANGGTGTTNNTCCGAGATC   AB8CBBBBBBBBBABBBBABBBBBB@ABABA@AAABBB@AA?8%+=7:>=%%<?A=@>@;;
SOLEXA1:1:1:10:1400     133     *       0       255     *       *       0       0       NGGAACAANACCNTTCGGATTNAAGTTTNAGNCNAANTTAGGGNAGTT        %/;9;;;/%-;0%0;9;;;#############################  ZT:A:N

While pairs where only one read is unmapped have only one record with either 69 or 133:

SOLEXA1:1:1:10:141      133     *       0       255     *       *       0       0       NGGTACTACAAAAATTGAAGTNGACTATTACNGTCACTCTCCTGTTTC        %.;8;;9;;;9;;9;;;;;9,%/99;;;;;/%.8;;9;##########

Is there any way, using samtools (or any other tool) to extract the pairs I'm interested in? I guess I could just script this, but I suspect there might be a better way to go.
Thanks a lot!

sam samtools bam tophat • 362 views
ADD COMMENTlink modified 10 months ago by Carlo Yague4.4k • written 10 months ago by liorglic50

You should know that the old 'Tuxedo' pipeline of Tophat(2) and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using salmon.

ADD REPLYlink written 10 months ago by WouterDeCoster38k

Aligners will write un-mapped reads to separate files (saving you this step). BBMap (outu=) and HISTAT2 (--un option) are a couple of options. STAR (--outReadsUnmapped Fastx) can also do this. So consider using a newer aligner.

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax65k
0
gravatar for Carlo Yague
10 months ago by
Carlo Yague4.4k
Belgium
Carlo Yague4.4k wrote:

I'd like to extract from unmapped.bam only those reads where only one side of the pair (R1 or R2) was not mapped, i.e. exclude records where both R1 and R2 are unmapped

That would be :

samtools view -f 5 -F 8

Which mean that you take all read that are paired and unmapped (-f 5) whose mates are not unmapped (-F 8).

PS: if your mate-related flags are uncorrectly set, you can use samtools fixmate before.

ADD COMMENTlink modified 10 months ago • written 10 months ago by Carlo Yague4.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2071 users visited in the last hour