Question: Counting mapped reads from a SAM file
6
gravatar for kobipe3
4.0 years ago by
kobipe380
Israel
kobipe380 wrote:

I am using BWA to align RNA-seq data from a miRNA experiment.

I want to compare myself to another article - so I am using the same techniques its authors used.

That is using fastq-mcf to clip the adapters, and BWA (Burrows Wheeler Aligner) to align the reads.

The reads are not aligned to the entire genome. Instead they are aligned to the mature miRNA forms, downloaded as fasta files from miRBase (example below).

After doing the alignment I get a SAM file (example below).

I want to get the count of each miRNA from that SAM file.

The way I thought of doing the count, is to count the number of rows that map to each miRNA in the SAM file. I would only take the rows in which FLAG (second column) & 4 == 0. That is - the read is mapped. The miRNA identity will be computed from the RNAME (third column).

Is this the right way to do the counting?

To begin with, I thought of using htseq-count, but the SAM file doesn't really contain the coordinates of the miRNAs in the reference sequence, so htseq cannot work.

 

Samples lines from the fasta file of the mature miRNA forms:

>mmu-let-7g-5p MIMAT0000121 Mus musculus let-7g-5p

TGAGGTAGTAGTTTGTACAGTT

ACTGTACAGGCCACTGCCTTGC

>mmu-let-7g-3p MIMAT0004519 Mus musculus let-7g-3p

 

Sample lines from the SAM assembly file:

HISEQ:47:C6FUWANXX:1:1101:2080:1992     4       *       0       0       *       *       0       0       TTGGCAATGGTAGAACTCACACCGAAA     <BBBFFFFFFFFFFFFFFFFFFFFFFF
HISEQ:47:C6FUWANXX:1:1101:8020:1999     0       mmu-miR-182-5p  2       37      24M     *       0       0       TTGGCAATGGTAGAACTCACACCG        <BBBFFBFF<FBFFF/</F<F<FF        XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:
24
HISEQ:47:C6FUWANXX:1:1101:8660:1996     4       *       0       0       *       *       0       0       CGGATCCGTCTGAGCTTGGCTTT <BBBFBFFFFFFFFF<FBBBFFB
HISEQ:47:C6FUWANXX:1:1101:16161:2000    4       *       0       0       *       *       0       0       TTTCCGTAGTGTAGTGGTTATCACGTTCGCCTC       <BBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
HISEQ:47:C6FUWANXX:1:1101:19163:2000    0       mmu-miR-182-5p  2       37      23M     *       0       0       TTGGCAATGGTAGAACTCACACC <BBBFFFFFFFFFFFFFFFFFFF XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:23
HISEQ:47:C6FUWANXX:1:1101:20578:2000    0       mmu-miR-182-5p  2       37      22M     *       0       0       TTGGCAATGGTAGAACTCACAC  <BBBFFFFFFFFFFFFFFFFFF  XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:22
HISEQ:47:C6FUWANXX:1:1101:1200:2043     4       *       0       0       *       *       0       0       CACAAATGATGAACCTTTTGACGGGCGGACAGAAACTCTGTGCTGAATGTC     BBBBBFFFFFFFFFFFFFFB<<FFFFFFFFFFFFFFFFFFFFFFFFFFFFF
HISEQ:47:C6FUWANXX:1:1101:1212:2073     4       mmu-miR-92a-3p  1       25      22M     *       0       0       TATTGCACTTGTCCCGGCCTGT  BBBBBFFFFFFFFFFFFFFFFF  XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:21C0
HISEQ:47:C6FUWANXX:1:1101:1139:2100     4       mmu-miR-92a-3p  1       25      22M     *       0       0       TATTGCACTTGTCCCGGCCTGT  BBBBBFFFFFFFFFFFF/FFFF  XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:21C0

 

rna-seq • 9.5k views
ADD COMMENTlink modified 4.0 years ago by Brian Bushnell16k • written 4.0 years ago by kobipe380
5
gravatar for Devon Ryan
4.0 years ago by
Devon Ryan90k
Freiburg, Germany
Devon Ryan90k wrote:

You probably want to ignore unmapped and secondary alignments:

samtools view -F 260 foo.bam | cut -f 3 | sort | uniq -c | awk '{printf("%s\t%s\n", $2, $1)}' > counts.txt

or something along those lines.

ADD COMMENTlink written 4.0 years ago by Devon Ryan90k
1

Devon keeps beating me to it today.  Using datamash and assuming it's a sorted bam (else add '-s' to the datamash command) ...

samtools view -F260 <yourbam> | cut -f3 | datamash -g 1 count 1 > counts
ADD REPLYlink written 4.0 years ago by george.ry1.1k
1

I'm an hour ahead, so I have a bit of an advantage :P

ADD REPLYlink written 4.0 years ago by Devon Ryan90k

Hi Devon, I have small RNAs reads from 18 - 30 nt. How can I count number of mapped reads according to their length? Thanks!

ADD REPLYlink written 2.6 years ago by SpreeFU0

Please post things like this as new questions in the future. However, one possible method would be:

samtools view -F 260 foo.bam | awk '{print length($10)}' | sort | uniq -c

I know you can do arrays in awk and that that'd be faster than sorting, but for a one off sort of thing this is quick enough.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Devon Ryan90k

Thanks a lot Devon! I'll post another relative question in a new one.

ADD REPLYlink written 2.6 years ago by SpreeFU0

Shouldn't samtools flagstat give the same result? I just run this and see that it gives exact counts in the third row from 'flagstat' output.e.g. "919853 + 0 mapped (96.88%:-nan%)"

ADD REPLYlink written 22 months ago by bioinfo710

When I run the above command, I get an error:

[E::sam_parse1] SEQ and QUAL are of different length 
[W::sam_read1] Parse error at line 3054 
[main_samview] truncated file.
ADD REPLYlink modified 7 months ago • written 7 months ago by glady250
2
gravatar for michael.ante
4.0 years ago by
michael.ante3.3k
Austria/Vienna
michael.ante3.3k wrote:

Since you mapped to the miRNA-nome, a simple samtools idxstats call should be enough:

samtools index foo.bam
samtools idxstats foo.bam
ADD COMMENTlink written 4.0 years ago by michael.ante3.3k
2
gravatar for Brian Bushnell
4.0 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

You can also do this with pileup from the BBMap package, which accepts unsorted sam files:

pileup.sh in=mapped.sam out=coverage.txt rpkm=rpkm.txt

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Brian Bushnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1487 users visited in the last hour