Question: extract one-side unmapped reads from sam
0
gravatar for liorglic
2.1 years ago by
liorglic310
liorglic310 wrote:

Hello,
I've ran TopHat and aligned paired end (PE) reads to a reference genome.
TopHat outputs two bam files: accepted_hits.bam and unmapped.bam.
I'd like to extract from unmapped.bam only those reads where only one side of the pair (R1 or R2) was not mapped, i.e. exclude records where both R1 and R2 are unmapped.
An interesting fact that may have to do with the specifics of TopHat, is that the only sam flags in the file are 69 and 133. According to this nice tool, this means:
69 - read paired (0x1), read unmapped (0x4), first in pair (0x40)
133 - read paired (0x1), read unmapped (0x4), second in pair (0x80)
Pairs where both reads are unmapped have two records, one with flag 69 and one with flag 133. For example:

SOLEXA1:1:1:10:1400     69      *       0       255     *       *       0       0       GGTGCATCTACTTCCCTAATTTTGGCTGAAACTTCAATCCGAANGGTGTTNNTCCGAGATC   AB8CBBBBBBBBBABBBBABBBBBB@ABABA@AAABBB@AA?8%+=7:>=%%<?A=@>@;;
SOLEXA1:1:1:10:1400     133     *       0       255     *       *       0       0       NGGAACAANACCNTTCGGATTNAAGTTTNAGNCNAANTTAGGGNAGTT        %/;9;;;/%-;0%0;9;;;#############################  ZT:A:N

While pairs where only one read is unmapped have only one record with either 69 or 133:

SOLEXA1:1:1:10:141      133     *       0       255     *       *       0       0       NGGTACTACAAAAATTGAAGTNGACTATTACNGTCACTCTCCTGTTTC        %.;8;;9;;;9;;9;;;;;9,%/99;;;;;/%.8;;9;##########

Is there any way, using samtools (or any other tool) to extract the pairs I'm interested in? I guess I could just script this, but I suspect there might be a better way to go.
Thanks a lot!

sam samtools bam tophat • 733 views
ADD COMMENTlink modified 2.1 years ago by Carlo Yague5.0k • written 2.1 years ago by liorglic310

You should know that the old 'Tuxedo' pipeline of Tophat(2) and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using salmon.

ADD REPLYlink written 2.1 years ago by WouterDeCoster44k

Aligners will write un-mapped reads to separate files (saving you this step). BBMap (outu=) and HISTAT2 (--un option) are a couple of options. STAR (--outReadsUnmapped Fastx) can also do this. So consider using a newer aligner.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by genomax85k
0
gravatar for Carlo Yague
2.1 years ago by
Carlo Yague5.0k
Canada
Carlo Yague5.0k wrote:

I'd like to extract from unmapped.bam only those reads where only one side of the pair (R1 or R2) was not mapped, i.e. exclude records where both R1 and R2 are unmapped

That would be :

samtools view -f 5 -F 8

Which mean that you take all read that are paired and unmapped (-f 5) whose mates are not unmapped (-F 8).

PS: if your mate-related flags are uncorrectly set, you can use samtools fixmate before.

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Carlo Yague5.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 809 users visited in the last hour