Question: How to interpret reads mapped to genome from samtools flagstat vs TopHat align_summary.txt ?
1
gravatar for Vasu
3.6 years ago by
Vasu410
Vasu410 wrote:

Hi all,

I am now working on bam files of paired-end RNA-Seq samples. And I want to extract paired end reads mapped to genome.

Basically, from the summary of tophat below, I want to get the mapped pairs reads.

Left reads:
      Input     :  41017921
       Mapped   :  39931054 (97.4% of input)
        of these:   3803778 ( 9.5%) have multiple alignments (71292 have >20)
Right reads:
      Input     :  41017921
       Mapped   :  39409647 (96.1% of input)
        of these:   3751917 ( 9.5%) have multiple alignments (71338 have >20)
96.7% overall read mapping rate.

Aligned pairs:  38811976
 of these:   3700400 ( 9.5%) have multiple alignments
             2438200 ( 6.3%) are discordant alignments
88.7% concordant pair alignment rate.

Is "Aligned pairs" is number of reads mapped to genome? Because when I did the same with "samtools" I got a different number.

samtools flagstat accepted_hits.bam

111314375 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
111314375 + 0 mapped (100.00%:-nan%)
111314375 + 0 paired in sequencing
55986752 + 0 read1
55327623 + 0 read2
47776282 + 0 properly paired (42.92%:-nan%)
109061240 + 0 with itself and mate mapped
2253135 + 0 singletons (2.02%:-nan%)
15869532 + 0 with mate mapped to a different chr
2586750 + 0 with mate mapped to a different chr (mapQ>=5)

Please tell me Which is the right one and also give some suggestions how to extract reads mapped to rRNA regions.

Thank you!

rna-seq • 2.3k views
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Vasu410

Basically, from the summary of tophat below, I want to get the mapped pairs reads.

That is what should be in the accepted_hits.bam file that TopHat generated. Do you want to extract just those reads from the original raw data file? Do you expect reads just from rRNA and nothing else to be present?

BTW: There is some disconnect between the title of your post and what you are asking in the body of the post.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by genomax75k

Sorry I had updated now.

ADD REPLYlink written 3.6 years ago by Vasu410

You could take the accepted_hits.bam files and use samtools fastq to generate the fastq files.

ADD REPLYlink written 3.6 years ago by genomax75k

I have fastq files....Please check my question. The summary of tophat shows alignedpairs (mapped reads) and with the bame file I used samtools to check mapped reads. Both number are different. Which is the right one?

ADD REPLYlink written 3.6 years ago by Vasu410
1

See this thread for additional explanation: http://seqanswers.com/forums/showthread.php?t=44433

Also this: How To Interpret/Reconcile The Counts Obtained Via Samtools Flagstat

ADD REPLYlink written 3.6 years ago by genomax75k

Thank you very much !!

ADD REPLYlink written 3.6 years ago by Vasu410

Could you please tell me how to count the reads mapped to rRNA regions?

ADD REPLYlink written 3.6 years ago by Vasu410

This will depend on the species. Some reference genomes (e.g., that from mouse) lack rRNA sequences. Others, such as the fruit fly, have a separate contig just for rRNA.

ADD REPLYlink written 3.6 years ago by Devon Ryan93k

This is aligned against human genome.

ADD REPLYlink written 3.6 years ago by Vasu410

For human samples I separately align against Rn45S. There's sequence in the GRCh38 reference, but I think there are multiple copies or various integrity, so quantification is difficult otherwise.

ADD REPLYlink written 3.6 years ago by Devon Ryan93k

As @Devon said, if you have a GTF file with annotation for rRNA you could use that to count. Otherwise you may need to align separately to rRNA repeat sequence for your genome.
If these samples were ribo-depleted hopefully you don't have many reads that will fall into this category.

ADD REPLYlink written 3.6 years ago by genomax75k

Yes the samples are rRNA depleted but still there may be some present. So, how can I count the reads aligned to rRNA regions?

ADD REPLYlink written 3.6 years ago by Vasu410

If you wish to be absolutely sure then align the reads to human rDNA repeat.

If you had aligned your data using Ensembl version of human genome then that may have rRNA annotated in the GTF file. In that case you can use featureCounts (and that GTF file) to count the reads aligning to rRNA. UCSC version of the human genome does not have rRNA annotated.

ADD REPLYlink written 3.6 years ago by genomax75k

yes I have GTF file and also bam file. what should be the parameters I need to give to use featurecounts? Or Can I use bedtools?

ADD REPLYlink written 3.6 years ago by Vasu410

Here is a link for how to use featureCounts. Remember this will only work if your GTF file has entries for rRNA. Otherwise you will get read counts for other genes.

ADD REPLYlink written 3.6 years ago by genomax75k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1617 users visited in the last hour