Counting mapped reads from a SAM file
3
6
Entering edit mode
9.4 years ago
kobipe3 ▴ 90

I am using BWA to align RNA-seq data from a miRNA experiment.

I want to compare myself to another article - so I am using the same techniques its authors used.

That is using fastq-mcf to clip the adapters, and BWA (Burrows Wheeler Aligner) to align the reads.

The reads are not aligned to the entire genome. Instead they are aligned to the mature miRNA forms, downloaded as fasta files from miRBase (example below).

After doing the alignment I get a SAM file (example below).

I want to get the count of each miRNA from that SAM file.

The way I thought of doing the count, is to count the number of rows that map to each miRNA in the SAM file. I would only take the rows in which FLAG (second column) & 4 == 0. That is - the read is mapped. The miRNA identity will be computed from the RNAME (third column).

Is this the right way to do the counting?

To begin with, I thought of using htseq-count, but the SAM file doesn't really contain the coordinates of the miRNAs in the reference sequence, so htseq cannot work.

Samples lines from the fasta file of the mature miRNA forms:

>mmu-let-7g-5p MIMAT0000121 Mus musculus let-7g-5p
TGAGGTAGTAGTTTGTACAGTT
ACTGTACAGGCCACTGCCTTGC
>mmu-let-7g-3p MIMAT0004519 Mus musculus let-7g-3p

Sample lines from the SAM assembly file:

HISEQ:47:C6FUWANXX:1:1101:2080:1992     4       *       0       0       *       *       0       0       TTGGCAATGGTAGAACTCACACCGAAA     <BBBFFFFFFFFFFFFFFFFFFFFFFF
HISEQ:47:C6FUWANXX:1:1101:8020:1999     0       mmu-miR-182-5p  2       37      24M     *       0       0       TTGGCAATGGTAGAACTCACACCG        <BBBFFBFF<FBFFF/</F<F<FF        XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:
24
HISEQ:47:C6FUWANXX:1:1101:8660:1996     4       *       0       0       *       *       0       0       CGGATCCGTCTGAGCTTGGCTTT <BBBFBFFFFFFFFF<FBBBFFB
HISEQ:47:C6FUWANXX:1:1101:16161:2000    4       *       0       0       *       *       0       0       TTTCCGTAGTGTAGTGGTTATCACGTTCGCCTC       <BBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
HISEQ:47:C6FUWANXX:1:1101:19163:2000    0       mmu-miR-182-5p  2       37      23M     *       0       0       TTGGCAATGGTAGAACTCACACC <BBBFFFFFFFFFFFFFFFFFFF XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:23
HISEQ:47:C6FUWANXX:1:1101:20578:2000    0       mmu-miR-182-5p  2       37      22M     *       0       0       TTGGCAATGGTAGAACTCACAC  <BBBFFFFFFFFFFFFFFFFFF  XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:22
HISEQ:47:C6FUWANXX:1:1101:1200:2043     4       *       0       0       *       *       0       0       CACAAATGATGAACCTTTTGACGGGCGGACAGAAACTCTGTGCTGAATGTC     BBBBBFFFFFFFFFFFFFFB<<FFFFFFFFFFFFFFFFFFFFFFFFFFFFF
HISEQ:47:C6FUWANXX:1:1101:1212:2073     4       mmu-miR-92a-3p  1       25      22M     *       0       0       TATTGCACTTGTCCCGGCCTGT  BBBBBFFFFFFFFFFFFFFFFF  XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:21C0
HISEQ:47:C6FUWANXX:1:1101:1139:2100     4       mmu-miR-92a-3p  1       25      22M     *       0       0       TATTGCACTTGTCCCGGCCTGT  BBBBBFFFFFFFFFFFF/FFFF  XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:21C0
RNA-Seq • 21k views
ADD COMMENT
6
Entering edit mode
9.4 years ago

You probably want to ignore unmapped and secondary alignments:

samtools view -F 260 foo.bam | cut -f 3 | sort | uniq -c | awk '{printf("%s\t%s\n", $2, $1)}' > counts.txt

or something along those lines.

ADD COMMENT
1
Entering edit mode

Devon keeps beating me to it today. Using datamash and assuming it's a sorted bam (else add '-s' to the datamash command) ...

samtools view -F260 <yourbam> | cut -f3 | datamash -g 1 count 1 > counts
ADD REPLY
1
Entering edit mode

I'm an hour ahead, so I have a bit of an advantage :P

ADD REPLY
0
Entering edit mode

Hi Devon, I have small RNAs reads from 18 - 30 nt. How can I count number of mapped reads according to their length? Thanks!

ADD REPLY
0
Entering edit mode

Please post things like this as new questions in the future. However, one possible method would be:

samtools view -F 260 foo.bam | awk '{print length($10)}' | sort | uniq -c

I know you can do arrays in awk and that that'd be faster than sorting, but for a one off sort of thing this is quick enough.

ADD REPLY
0
Entering edit mode

Thanks a lot Devon! I'll post another relative question in a new one.

ADD REPLY
0
Entering edit mode

Shouldn't samtools flagstat give the same result? I just run this and see that it gives exact counts in the third row from 'flagstat' output.e.g. "919853 + 0 mapped (96.88%:-nan%)"

ADD REPLY
0
Entering edit mode

When I run the above command, I get an error:

[E::sam_parse1] SEQ and QUAL are of different length 
[W::sam_read1] Parse error at line 3054 
[main_samview] truncated file.
ADD REPLY
2
Entering edit mode
9.4 years ago
michael.ante ★ 3.9k

Since you mapped to the miRNA-nome, a simple samtools idxstats call should be enough:

samtools index foo.bam
samtools idxstats foo.bam
ADD COMMENT
2
Entering edit mode
9.4 years ago

You can also do this with pileup from the BBMap package, which accepts unsorted sam files:

pileup.sh in=mapped.sam out=coverage.txt rpkm=rpkm.txt
ADD COMMENT

Login before adding your answer.

Traffic: 826 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6