Hi, I am new to bioinformatics and I am struggling with filtering out my bam file from STAR alignment (RNA-Seq).
After going through a few previous filtering process (properly paired, and uniquely mapping), the current bam file head looks like:
A00521:147:H2H22DSXY:1:2140:19777:35430 99 chr1 4691088 255 100M = 4691193 205 CCTTCTCTCTCCTTGGCCATTGACTCTTTAACAACCCTGGAATGGTCTATATAAAAGCAACACTCCCCAGCTAAGGCAGCACGTAGCCCTCCTATTCTGC :FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:198 nM:i:0
A00521:147:H2H22DSXY:1:2140:19777:35430 147 chr1 4691193 255 100M = 4691088 -205 TACTTCTGACAATGAAGTCCGTGATTCTTGAAGGTGACTGACGGAAGTTTCTATTTTCTCTACATCTAAATCTATGGCTGCCCTCAGACTGTCATAATTT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF: NH:i:1 HI:i:1 AS:i:198 nM:i:0
A00521:147:H2H22DSXY:1:2202:32841:28307 99 chr1 4692954 255 1S100M = 4693013 158 GTCCTCTTCGAACTGCAACTTTTCTACGTTATCCCTTTTTTTTCTCTTCCCTAATCTCAGTTGGTGTTTTTTCTTGAAAGGCACAGCCATTCCAGGACTGG F,FF,F:,FF:,FFFF,,F,,,::,,F,,F,FFFFFFF,,F:FFFFF:FFF,F::F:FFF,,::,FFF,,:FF::,F:F:FF,F:,FF,FF:,,FF,,F:F NH:i:1 HI:i:1 AS:i:185 nM:i:6
A00521:147:H2H22DSXY:1:2202:32841:28307 147 chr1 4693013 255 99M = 4692954 -158 TTGGTGTTTTTTCGTGAAAGGCACAGCCTGTCCAGGAATGGAAGCTGCCATTCCCGTGTCGTTGGTTTCCTGTACTACCTCCACTATGGTGGCTCATAC :::F::FFFF,:F,F:FFFFFF:FFFFF,,:FFF:FF,:F,FFFF,:F,F,,FFFF:::F::,::F,:FF,:,:F,FF:,FF:F:,,F,,F,FFFF,:F NH:i:1 HI:i:1 AS:i:185 nM:i:6
A00521:147:H2H22DSXY:1:2419:5240:32737 99 chr1 4724721 255 1S98M = 4724765 144 ATCCCCCGTTGCACCCAGAATTTCTCTGAACTTAAGTCAGCTCAGCAAAAATGGAAAGTGAATGTCAGGGCAAAACCTGTACTAGAACCGTGGTAATAC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:196 nM:i:0
A00521:147:H2H22DSXY:1:2419:5240:32737 147 chr1 4724765 255 100M = 4724721 -144 CAAAAATGGAAAGTGAATGTCAGGGCAAAACCTGTACTAGAACCGTGGTAATACAGAGTAAATGAGTCCCTGACTCTTACGGGTCTTTCATTCTAGTGAG FFFFFFFFFFFFFFFF,FFFF:FFFFFFFFFFFFFFFFFF,:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFFFFF:FFFFF NH:i:1 HI:i:1 AS:i:196 nM:i:0
A00521:147:H2H22DSXY:1:2453:29595:28166 99 chr1 4740099 255 100M = 4740159 160 GTGTCTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGTGTGACATATACATGAGTATCCACAGAAACCAGAAGAGGGTCTTGAATACGTGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF,FFF::FF::,FFF:FFFFFF:FFFFFFFF:F::FF:F,:FF: NH:i:1 HI:i:1 AS:i:198 nM:i:0
A00521:147:H2H22DSXY:1:1428:4065:14011 99 chr1 4740101 255 101M = 4740159 158 GTGTCTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGTGTGACATATACATGAGTATCCACAGAAACCAGAAGAGGGTCTTGAATACGTGGAGT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,F,FF:FF:FF:FFF,,F:FFFFFFFFFFFFFF:F,FFFFFFF NH:i:1 HI:i:1 AS:i:195 nM:i:2
A00521:147:H2H22DSXY:1:1428:4065:14011 147 chr1 4740159 255 100M = 4740101 -158 GAGTATCCACAGAAACCAGAAGAGGGTCTTGAATACGTGGAGTGGGAAGGAGCGGAGGTTGAGAGTTTCCTGCGGTAGGAACTGGGAACTGAAGTTGGGT FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:195 nM:i:2
A00521:147:H2H22DSXY:1:2453:29595:28166 147 chr1 4740159 255 100M = 4740099 -160 GAGTATCCACAGAAACCAGAAGAGGGTCTTGAATACGTGGAGTGGGAAGGAGCGGAGGTTGAGAGTTTCCTGCGGTAGGAACTGGGAACTGAAGTTGGGT FFFFFFFFFFFFF,FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFF:FF:FFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:198
And the current bam file flagstat looks like:
32151192 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
32151192 + 0 mapped (100.00% : N/A)
32151192 + 0 paired in sequencing
16075596 + 0 read1
16075596 + 0 read2
32151192 + 0 properly paired (100.00% : N/A)
32151192 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
From here, I am now trying to only keep reads based on following condition: For a read on forward strand, the start position of mate should greater than or equal to the start position of read, and conversely for reads on reverse strand.
For forward strand, I tried
samtools view -F 16 original.bam | awk '{if ($4<$8 || $4 ==$8) print $0}'
For reverse strand, I tried
samtools view -f 16 original.bam | awk '{if ($4>$8 || $4 ==$8) print $0}'
This works fine separately, but in this case, I need to merge two bam files at the end, since the original bam file was filtered separately.
Is there any way that I can apply those conditions all at once to the original bam file and output the new filtered bam file??
Or, is merging the two separated filtered bam files the best way?
Your help/assistance will be greatly appreciated.. Thank you so much.
Are you really finding a significant number of reads that are "properly paired" that fail your awk filter?
Yeah, I checked that there are still many reads that don't meet the conditions after filtered with "properly-paired".. I want to know how to pipe (combine) the two codes to generate one filtered bam file..