Hello,
I am working with a paired-end data set of rhesus macaque reads aligned with STAR against a current rhesus .fasta reference and .gtf annotation. I am interested in doing some quality control measures on my .bam file with RNAseqC to determine numbers of reads that map to non-genic, exonic, or intronic regions.
For now, I'm trying to bypass the command-line execution and opting to perform the run on the web-based GenePattern interface offered by Broad institute. For those familiar, there are a number of inputs required for RNAseqC, all of which I believe that I have provided:
-indexed .bam with read groups created with Picard tools -.gtf file -indexed .fasta -dictionary of .fasta created with Picard tools
Unfortunately there is no accompanying 18S rRNA information for rhesus I am able to provide.
Anyways, the program seems to complete successfully, however for all of my transcripts I get this message in the output:
WARNING: Transcript has no coverage: ACTB_transcript_01
This is peculiar to me, because if I load the .bam, .gtf, and .fasta in IGV, there are clearly an abundant number of reads mapping to the ACTB gene.
I am wondering why RNAseqC is not picking up any reads to exons, or anything really. If anyone has encountered similar issues it would be greatly appreciated. From what I have seen, the chromosomes are annotated the same in both the .gtf and .bam file.
thank you!
For what it's worth, I'm getting the following ERRORs in the output log when running the GATK depth of coverage analysis:
What is the result of Picard ValidateSamFile?
INFO 2016-11-01 15:31:26 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:00:31s. Time for last 10,000,000: 31s. Last read position: chr03:117,226,161
INFO 2016-11-01 15:31:56 SamFileValidator Validated Read 20,000,000 records. Elapsed time: 00:01:01s. Time for last 10,000,000: 29s. Last read position: chr06:167,478,910
INFO 2016-11-01 15:32:26 SamFileValidator Validated Read 30,000,000 records. Elapsed time: 00:01:31s. Time for last 10,000,000: 29s. Last read position: chr11:54,297,193
INFO 2016-11-01 15:32:55 SamFileValidator Validated Read 40,000,000 records. Elapsed time: 00:01:59s. Time for last 10,000,000: 28s. Last read position: chr16:52,185,704
*HISTOGRAM java.lang.String Error Type Count ERROR:MATE_NOT_FOUND 3717657 WARNING:MISSING_TAG_NM 45041887*
I haven't a clue why RNAseqC is doing that, but you might instead just use
plotEnrichment
from deepTools. It's available on the main Galaxy site. Just upload your BAM and GTF files. The plots won't include "introns" and "non-genic", but the former is essentially "genes" - "exons" and the later 100% - "genes".OK...thank you so much for the reply.
What I'm actually interested in is where the intronic or non-genic reads are mapping to. The reason being, although my alignment stats were good (~90% reads mapped with STAR), performing read counts on the .bam file with HT-seq count yielded a high number of reads mapping to _no_feature. This is in spite of correctly nam-sorting the .bam and also making sure the stranded information is correct.
This is about 70% of the aligned reads, and where might they be going? This is a total RNAseq data set, so one might expect some no feature reads if the annotation is only strong for gene-coding regions. But still, I am just interested in understanding my data set, and I do not know if this is a red flag or not. If you know of any other alternative analysis to determine where the _no_feature pairs are mapping to, it would be very helpful. Thank you again!
If this is total RNAseq then likely you have a lot of rRNA in your data. Only way to conclusively prove this would be to align the data to rDNA repeat for Rhesus which I assume must be available in GenBank.
Thank you. Pardon my naivety but if I can get the rDNA repeat sequence from Rhesus, how might I align an entire .fastq sample to it.
Same as you would to anything else, index the fasta file and align against it.
Why not look at the reads in IGV (maybe make a bigWig file so you don't have to zoom in). As genomax mentioned, it's likely that you have a LOT of rRNA or something like that.
Thanks for the reply. I have suspected 18S contamination. Unfortunately the gtf I am using is not annotated for 18S, although retrieving the Rhesus 18S sequence, BLATing it to find genomic coordindates, and then going to the location in IGV doesn't seem to yield any reads mapping to putative 18S sites. I think the question is still unresolved.
I thought 18S genes were present at multiple sites of the genome. I don't know if you are familiar with htseq-count, but wouldn't htseq count any 18S reads as "ambiguous" rather than "no_feature", given that they would map to more than one location?
Just find the regions of high non-genic alignment in IGV and then look at those regions in UCSC or somewhere else that has a repeatmasker track (or load that in IGV).