I am using BWA to align RNA-seq data from a miRNA experiment.
I want to compare myself to another article - so I am using the same techniques its authors used.
That is using fastq-mcf to clip the adapters, and BWA (Burrows Wheeler Aligner) to align the reads.
The reads are not aligned to the entire genome. Instead they are aligned to the mature miRNA forms, downloaded as fasta files from miRBase (example below).
After doing the alignment I get a SAM file (example below).
I want to get the count of each miRNA from that SAM file.
The way I thought of doing the count, is to count the number of rows that map to each miRNA in the SAM file. I would only take the rows in which FLAG (second column) & 4 == 0. That is - the read is mapped. The miRNA identity will be computed from the RNAME (third column).
Is this the right way to do the counting?
To begin with, I thought of using htseq-count, but the SAM file doesn't really contain the coordinates of the miRNAs in the reference sequence, so htseq cannot work.
Samples lines from the fasta file of the mature miRNA forms:
>mmu-let-7g-5p MIMAT0000121 Mus musculus let-7g-5p
TGAGGTAGTAGTTTGTACAGTT
ACTGTACAGGCCACTGCCTTGC
>mmu-let-7g-3p MIMAT0004519 Mus musculus let-7g-3p
Sample lines from the SAM assembly file:
HISEQ:47:C6FUWANXX:1:1101:2080:1992 4 * 0 0 * * 0 0 TTGGCAATGGTAGAACTCACACCGAAA <BBBFFFFFFFFFFFFFFFFFFFFFFF
HISEQ:47:C6FUWANXX:1:1101:8020:1999 0 mmu-miR-182-5p 2 37 24M * 0 0 TTGGCAATGGTAGAACTCACACCG <BBBFFBFF<FBFFF/</F<F<FF XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:
24
HISEQ:47:C6FUWANXX:1:1101:8660:1996 4 * 0 0 * * 0 0 CGGATCCGTCTGAGCTTGGCTTT <BBBFBFFFFFFFFF<FBBBFFB
HISEQ:47:C6FUWANXX:1:1101:16161:2000 4 * 0 0 * * 0 0 TTTCCGTAGTGTAGTGGTTATCACGTTCGCCTC <BBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
HISEQ:47:C6FUWANXX:1:1101:19163:2000 0 mmu-miR-182-5p 2 37 23M * 0 0 TTGGCAATGGTAGAACTCACACC <BBBFFFFFFFFFFFFFFFFFFF XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:23
HISEQ:47:C6FUWANXX:1:1101:20578:2000 0 mmu-miR-182-5p 2 37 22M * 0 0 TTGGCAATGGTAGAACTCACAC <BBBFFFFFFFFFFFFFFFFFF XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:22
HISEQ:47:C6FUWANXX:1:1101:1200:2043 4 * 0 0 * * 0 0 CACAAATGATGAACCTTTTGACGGGCGGACAGAAACTCTGTGCTGAATGTC BBBBBFFFFFFFFFFFFFFB<<FFFFFFFFFFFFFFFFFFFFFFFFFFFFF
HISEQ:47:C6FUWANXX:1:1101:1212:2073 4 mmu-miR-92a-3p 1 25 22M * 0 0 TATTGCACTTGTCCCGGCCTGT BBBBBFFFFFFFFFFFFFFFFF XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:21C0
HISEQ:47:C6FUWANXX:1:1101:1139:2100 4 mmu-miR-92a-3p 1 25 22M * 0 0 TATTGCACTTGTCCCGGCCTGT BBBBBFFFFFFFFFFFF/FFFF XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:21C0
Devon keeps beating me to it today. Using datamash and assuming it's a sorted bam (else add '-s' to the datamash command) ...
I'm an hour ahead, so I have a bit of an advantage :P
Hi Devon, I have small RNAs reads from 18 - 30 nt. How can I count number of mapped reads according to their length? Thanks!
Please post things like this as new questions in the future. However, one possible method would be:
I know you can do arrays in awk and that that'd be faster than sorting, but for a one off sort of thing this is quick enough.
Thanks a lot Devon! I'll post another relative question in a new one.
Shouldn't
samtools flagstat
give the same result? I just run this and see that it gives exact counts in the third row from 'flagstat' output.e.g. "919853 + 0 mapped (96.88%:-nan%)"When I run the above command, I get an error: