How to interpret reads mapped to genome from samtools flagstat vs TopHat align_summary.txt ?
0
1
Entering edit mode
5.0 years ago
Vasu ▴ 530

Hi all,

I am now working on bam files of paired-end RNA-Seq samples. And I want to extract paired end reads mapped to genome.

Basically, from the summary of tophat below, I want to get the mapped pairs reads.

Left reads:
Input     :  41017921
Mapped   :  39931054 (97.4% of input)
of these:   3803778 ( 9.5%) have multiple alignments (71292 have >20)
Input     :  41017921
Mapped   :  39409647 (96.1% of input)
of these:   3751917 ( 9.5%) have multiple alignments (71338 have >20)

Aligned pairs:  38811976
of these:   3700400 ( 9.5%) have multiple alignments
2438200 ( 6.3%) are discordant alignments
88.7% concordant pair alignment rate.


Is "Aligned pairs" is number of reads mapped to genome? Because when I did the same with "samtools" I got a different number.

samtools flagstat accepted_hits.bam

111314375 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
111314375 + 0 mapped (100.00%:-nan%)
111314375 + 0 paired in sequencing
47776282 + 0 properly paired (42.92%:-nan%)
109061240 + 0 with itself and mate mapped
2253135 + 0 singletons (2.02%:-nan%)
15869532 + 0 with mate mapped to a different chr
2586750 + 0 with mate mapped to a different chr (mapQ>=5)


Please tell me Which is the right one and also give some suggestions how to extract reads mapped to rRNA regions.

Thank you!

RNA-Seq • 2.9k views
0
Entering edit mode

Basically, from the summary of tophat below, I want to get the mapped pairs reads.

That is what should be in the accepted_hits.bam file that TopHat generated. Do you want to extract just those reads from the original raw data file? Do you expect reads just from rRNA and nothing else to be present?

BTW: There is some disconnect between the title of your post and what you are asking in the body of the post.

0
Entering edit mode

0
Entering edit mode

You could take the accepted_hits.bam files and use samtools fastq to generate the fastq files.

0
Entering edit mode

I have fastq files....Please check my question. The summary of tophat shows alignedpairs (mapped reads) and with the bame file I used samtools to check mapped reads. Both number are different. Which is the right one?

1
Entering edit mode

0
Entering edit mode

Thank you very much !!

0
Entering edit mode

Could you please tell me how to count the reads mapped to rRNA regions?

0
Entering edit mode

This will depend on the species. Some reference genomes (e.g., that from mouse) lack rRNA sequences. Others, such as the fruit fly, have a separate contig just for rRNA.

0
Entering edit mode

This is aligned against human genome.

0
Entering edit mode

For human samples I separately align against Rn45S. There's sequence in the GRCh38 reference, but I think there are multiple copies or various integrity, so quantification is difficult otherwise.

0
Entering edit mode

As @Devon said, if you have a GTF file with annotation for rRNA you could use that to count. Otherwise you may need to align separately to rRNA repeat sequence for your genome.
If these samples were ribo-depleted hopefully you don't have many reads that will fall into this category.

0
Entering edit mode

Yes the samples are rRNA depleted but still there may be some present. So, how can I count the reads aligned to rRNA regions?

0
Entering edit mode

If you wish to be absolutely sure then align the reads to human rDNA repeat.

If you had aligned your data using Ensembl version of human genome then that may have rRNA annotated in the GTF file. In that case you can use featureCounts (and that GTF file) to count the reads aligning to rRNA. UCSC version of the human genome does not have rRNA annotated.

0
Entering edit mode

yes I have GTF file and also bam file. what should be the parameters I need to give to use featurecounts? Or Can I use bedtools?

0
Entering edit mode

Here is a link for how to use featureCounts. Remember this will only work if your GTF file has entries for rRNA. Otherwise you will get read counts for other genes.