Question: RNAseqC: WARNING: Transcript has no coverage
0
gravatar for jcm136
3.6 years ago by
jcm1360
jcm1360 wrote:

Hello,

I am working with a paired-end data set of rhesus macaque reads aligned with STAR against a current rhesus .fasta reference and .gtf annotation. I am interested in doing some quality control measures on my .bam file with RNAseqC to determine numbers of reads that map to non-genic, exonic, or intronic regions.

For now, I'm trying to bypass the command-line execution and opting to perform the run on the web-based GenePattern interface offered by Broad institute. For those familiar, there are a number of inputs required for RNAseqC, all of which I believe that I have provided:

-indexed .bam with read groups created with Picard tools -.gtf file -indexed .fasta -dictionary of .fasta created with Picard tools

Unfortunately there is no accompanying 18S rRNA information for rhesus I am able to provide.

Anyways, the program seems to complete successfully, however for all of my transcripts I get this message in the output:

WARNING: Transcript has no coverage: ACTB_transcript_01

This is peculiar to me, because if I load the .bam, .gtf, and .fasta in IGV, there are clearly an abundant number of reads mapping to the ACTB gene.

I am wondering why RNAseqC is not picking up any reads to exons, or anything really. If anyone has encountered similar issues it would be greatly appreciated. From what I have seen, the chromosomes are annotated the same in both the .gtf and .bam file.

thank you!

rna-seq software error • 1.1k views
ADD COMMENTlink written 3.6 years ago by jcm1360

For what it's worth, I'm getting the following ERRORs in the output log when running the GATK depth of coverage analysis:

Running GATK Depth of Coverage Analysis ....
Arguments:  -T DepthOfCoverage -R ref_input/MacaM_Rhesus_Genome_v7.fasta -I bam_input/JB37_sorted_ReadGroup.bam -o .//JB37_sorted_ReadGroup.bam/lowexpr//perBaseDoC.out -L .//JB37_sorted_ReadGroup.bam/lowexpr/intervals.list -l ERROR
Arguments Array:    [-T, DepthOfCoverage, -R, ref_input/MacaM_Rhesus_Genome_v7.fasta, -I, bam_input/JB37_sorted_ReadGroup.bam, -o, .//JB37_sorted_ReadGroup.bam/lowexpr//perBaseDoC.out, -L, .//JB37_sorted_ReadGroup.bam/lowexpr/intervals.list, -l, ERROR]
GATK command result code: 0
Depth of Coverage run time: 1 min
     ... GATK Depth of Coverage Analysis DONE
Running GATK Depth of Coverage Analysis ....
Arguments:  -T DepthOfCoverage -R ref_input/MacaM_Rhesus_Genome_v7.fasta -I bam_input/JB37_sorted_ReadGroup.bam -o .//JB37_sorted_ReadGroup.bam/medexpr//perBaseDoC.out -L .//JB37_sorted_ReadGroup.bam/medexpr/intervals.list -l ERROR
Arguments Array:    [-T, DepthOfCoverage, -R, ref_input/MacaM_Rhesus_Genome_v7.fasta, -I, bam_input/JB37_sorted_ReadGroup.bam, -o, .//JB37_sorted_ReadGroup.bam/medexpr//perBaseDoC.out, -L, .//JB37_sorted_ReadGroup.bam/medexpr/intervals.list, -l, ERROR]
GATK command result code: 0
Depth of Coverage run time: 2 min
     ... GATK Depth of Coverage Analysis DONE
Running GATK Depth of Coverage Analysis ....
Arguments:  -T DepthOfCoverage -R ref_input/MacaM_Rhesus_Genome_v7.fasta -I bam_input/JB37_sorted_ReadGroup.bam -o .//JB37_sorted_ReadGroup.bam/highexpr//perBaseDoC.out -L .//JB37_sorted_ReadGroup.bam/highexpr/intervals.list -l ERROR
Arguments Array:    [-T, DepthOfCoverage, -R, ref_input/MacaM_Rhesus_Genome_v7.fasta, -I, bam_input/JB37_sorted_ReadGroup.bam, -o, .//JB37_sorted_ReadGroup.bam/highexpr//perBaseDoC.out, -L, .//JB37_sorted_ReadGroup.bam/highexpr/intervals.list, -l, ERROR]
GATK command result code: 0
Depth of Coverage run time: 7 min
     ... GATK Depth of Coverage Analysis DONE
ADD REPLYlink modified 3.6 years ago by Devon Ryan95k • written 3.6 years ago by jcm1360

What is the result of Picard ValidateSamFile?

ADD REPLYlink written 3.6 years ago by WouterDeCoster43k

INFO 2016-11-01 15:31:26 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:00:31s. Time for last 10,000,000: 31s. Last read position: chr03:117,226,161

INFO 2016-11-01 15:31:56 SamFileValidator Validated Read 20,000,000 records. Elapsed time: 00:01:01s. Time for last 10,000,000: 29s. Last read position: chr06:167,478,910

INFO 2016-11-01 15:32:26 SamFileValidator Validated Read 30,000,000 records. Elapsed time: 00:01:31s. Time for last 10,000,000: 29s. Last read position: chr11:54,297,193

INFO 2016-11-01 15:32:55 SamFileValidator Validated Read 40,000,000 records. Elapsed time: 00:01:59s. Time for last 10,000,000: 28s. Last read position: chr16:52,185,704

*HISTOGRAM java.lang.String Error Type Count ERROR:MATE_NOT_FOUND 3717657 WARNING:MISSING_TAG_NM 45041887*

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by jcm1360

I haven't a clue why RNAseqC is doing that, but you might instead just use plotEnrichment from deepTools. It's available on the main Galaxy site. Just upload your BAM and GTF files. The plots won't include "introns" and "non-genic", but the former is essentially "genes" - "exons" and the later 100% - "genes".

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Devon Ryan95k

OK...thank you so much for the reply.

What I'm actually interested in is where the intronic or non-genic reads are mapping to. The reason being, although my alignment stats were good (~90% reads mapped with STAR), performing read counts on the .bam file with HT-seq count yielded a high number of reads mapping to _no_feature. This is in spite of correctly nam-sorting the .bam and also making sure the stranded information is correct.

28177302 SAM alignment pairs processed.
__no_feature    17724101
__ambiguous 67176
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  0

This is about 70% of the aligned reads, and where might they be going? This is a total RNAseq data set, so one might expect some no feature reads if the annotation is only strong for gene-coding regions. But still, I am just interested in understanding my data set, and I do not know if this is a red flag or not. If you know of any other alternative analysis to determine where the _no_feature pairs are mapping to, it would be very helpful. Thank you again!

ADD REPLYlink modified 3.6 years ago by Devon Ryan95k • written 3.6 years ago by jcm1360

If this is total RNAseq then likely you have a lot of rRNA in your data. Only way to conclusively prove this would be to align the data to rDNA repeat for Rhesus which I assume must be available in GenBank.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by genomax83k

Thank you. Pardon my naivety but if I can get the rDNA repeat sequence from Rhesus, how might I align an entire .fastq sample to it.

ADD REPLYlink written 3.6 years ago by jcm1360

Same as you would to anything else, index the fasta file and align against it.

ADD REPLYlink written 3.6 years ago by Devon Ryan95k

Why not look at the reads in IGV (maybe make a bigWig file so you don't have to zoom in). As genomax mentioned, it's likely that you have a LOT of rRNA or something like that.

ADD REPLYlink written 3.6 years ago by Devon Ryan95k

Thanks for the reply. I have suspected 18S contamination. Unfortunately the gtf I am using is not annotated for 18S, although retrieving the Rhesus 18S sequence, BLATing it to find genomic coordindates, and then going to the location in IGV doesn't seem to yield any reads mapping to putative 18S sites. I think the question is still unresolved.

I thought 18S genes were present at multiple sites of the genome. I don't know if you are familiar with htseq-count, but wouldn't htseq count any 18S reads as "ambiguous" rather than "no_feature", given that they would map to more than one location?

ADD REPLYlink written 3.6 years ago by jcm1360

Just find the regions of high non-genic alignment in IGV and then look at those regions in UCSC or somewhere else that has a repeatmasker track (or load that in IGV).

ADD REPLYlink written 3.6 years ago by Devon Ryan95k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1004 users visited in the last hour