Easiest way to compute RNA-Seq mapping stats (exons, introns, intergenic)?
3
3
Entering edit mode
8.1 years ago
Nick ▴ 290

I posted this question a few days ago on Seqanswers but wasn't able to get an answer that solved my problem.

Here is my use case:

I aligned RNA-Seq data to a genome using tophat2. Now I have the bam file and I am looking for a straight-forward way to produce mapping statistics - how many reads map onto exons, introns and intergenic regions.

So my input consists of a bam file and the gtf file which I used with tophat2. What's the easiest way to get from this input to the output I need?

The suggestions I got so far involved tools that were not taking GTF file as input (CollectRnaSeqMetrics from Piccard toolset) and were not computing the statistics on the intergenic intervals (RSeQC). CollectRnaSeqMetrics seems a good candidate but so far I could not find information/manual on converting GTF to RefFlat format.

So I would welcome suggestions that specify the processing steps that start with a BAM file and a GTF file and lead to statistics on the number of reads mapping to exons, introns and intergenic intervals.

RNA-Seq • 13k views
1
Entering edit mode

I just replied on your other SEQanswers thread on how to use gtfToGenePred. That'll be the simplest solution.

0
Entering edit mode

Thanks, Devon - I just replied to your post. It seems I am not able to get gtfToGenePred running.

0
Entering edit mode

I was able to get gtfToGenePred running. It took the Ensembl GTF as input and produced a refFlat file which I tried to use with CollectRnaSeqMetrics. However, the latter produced an error message:

Exception in thread "main" picard.annotation.AnnotationException: Wrong number of fields in refFlat file mm75.genePred at line 1
at picard.analysis.CollectRnaSeqMetrics.setup(CollectRnaSeqMetrics.java:105)
at picard.analysis.SinglePassSamProgram.makeItSo(SinglePassSamProgram.java:98)
at picard.analysis.SinglePassSamProgram.doWork(SinglePassSamProgram.java:53)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)


Apparently, the refFlat file produced is not valid. Have you, actually, used gtfToGenePred successfully?

2
Entering edit mode

Apparently gtfToGenePred leaves out the first (or second) column. You can fix that with:

awk 'BEGIN{OFS="\t"}{print $1,$0}' mm75.genePred > mm75.genePred2

0
Entering edit mode

Yay! This seems to have fixed it. Thanks!

0
Entering edit mode

I have the same problem

The fastq files are paired end, aligned using STAR and GRCh38 index.

I need the REF_FLAT file compatible for GRCh38. I know there is an inconsistency with the chromosome naming, between the ensembl build and picard tools.

Please don't provide an answer unless you actually know how to solve this problem. I have read >100 answers to this problem and most are not useful at all.

0
Entering edit mode

I moved this to a comment because it's not an answer to this thread

Please don't provide an answer unless you actually know how to solve this problem

Okay, won't try to help then. Good luck.

2
Entering edit mode
8.1 years ago

Another great tool is RSeQC

2
Entering edit mode
8.1 years ago
Manvendra Singh ★ 2.2k

I agree with Ashutosh Pandey

RSeQC works good for me

read-distribution-py -i your bam -r Refseq.bed


refseq.bed also you can download from their site according to the genome version you have

hth

0
Entering edit mode

I wanna try this but I don't understand how to get the Refseq.bed file? I have a fasta file and a gtf file for a de-novo assembled organism.

1
Entering edit mode
8.1 years ago
A. Domingues ★ 2.6k

I use RNA-SeQC from the Broad Institute.

As you pointed out, this being a Broad tool it is quite finicky with its requirements of what a Bam should look like. So this is what I do to prepare bams from Tophat for RNA-SeQC:

1. Index Fasta:

## I need to do this in the cluster I use to add the libraries to path
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64/jre
export PATH=/usr/lib/jvm/java-6-openjdk-amd64/jre/bin:$PATH fasta="coolgenome.fa" java -jar ~/bin/picard-tools-1.119/CreateSequenceDictionary.jar R=$fasta O=${fasta/.fa/.dict}  2. Add read groups and index Bam aln="somemapping.bam" exp=$(basename ${aln%%.bam}) java -Xmx6g -jar ~/bin/picard-tools-1.119/AddOrReplaceReadGroups.jar \ I=$aln \
O= ${aln%%.bam}.so.rg.bam \ SO= coordinate \ PL=illumina \ PU= sample \ LB=$exp \
RGSM= $exp 2>$aln.sort.err

# index bam
samtools index ${aln%%.bam}.so.rg.bam  3. Pre-run Checklist # Are the contig names consistent between your BAM, Reference, and the GTF file? # Is your BAM indexed? (use samtools index) # Is your reference Indexed? (use samtools faidx) # Does your reference have a dict (dictionary) file? (use CreateSequenceDictionary.jar)</pre>  4. prepare sampleFile.txt printf "Sample ID \t Bam File \t Notes\n" > sampleFile.txt for b in */*.so.rg.bam; do sample=$(basename echo $b | cut -f1 -d".") echo$samplef
printf "$sample \t$b \t \n"  >> sampleFile.txt
done

5. Run the thing

gtfFile="thegeneannotation.gtf"

java -jar -Xmx5g ~/bin/RNA-SeQC_v1.1.8.jar -s sampleFile.txt -t $gtfFile -ttype 2 -r$fasta -o ./qcReport -gatkFlags "-S SILENT"


Some of these steps are needed only once the others I have included in my mapping scripts - I always output sorted and indexed bams for instance. I hope it works.

0
Entering edit mode

This is an interesting tool, however, according to various sites I can't use it directly with my tophat2 BAMs. The various recipes I found differ but they list several pre-processing steps in (roughly) this order:

2. samtools index
3. samtools faidx
4. CreateSequenceDictionary.jar
5. ReorderSam.jar

The problem is that the recipes differ not just in respect to the tools used but also regarding their parameters. What is your recipe?

1
Entering edit mode

0
Entering edit mode

Thanks for the details. I managed to go through all the steps so your recipe almost worked for me. However, the last step (RNA-SeQC) caused an error message. Apparently, RNA-SeQC doesn't like the GTF file (which I downloaded from Ensembl). Do I need to pre-process the GTF, too? Here is the error message:

RNA-SeQC v1.1.8.1 07/11/14
Additional GATK flags provided: -S SILENT
Creating rRNA Interval List based on given GTF annotations
Retriving contig names from reference
contig names in reference: 80
The required transcript_id attribute was not found on line 1    pseudogene    gene    3054233    3054733    .    +    .    gene_id "ENSMUSG00000090025"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene";


==EDIT:

It seems RNA-SeQC is complaining that some lines in the GTF file do not have transcript_id. As it happens, the first line in the GTF is a pseudogene so it doesn't have a transcript_id. This is a standard GTF file which I used also with tophat2 to get my BAMs. If RNA-SeQC cannot work with standard GTF files this would be a major issue.

Also, it seems, RNA-SeQC is producing its stats based on the top 1000 highly expressed genes. This is a default parameter that can be changed but I am puzzled that it at all exists. The reason I want to use RNA-SeQC is that I want to get stats on reads mapping to introns and intergeneic reasons (I get the exons stat via htseq-count which is what I normally use for differential expression analysis). Any reads mapping onto introns and intergenic regions are not likely to be considered part of genes (duh!) so I wonder what's the point of using the top 1000 most highly expressed genes to compute this statistics. It does not make any sense to me.

0
Entering edit mode

0
Entering edit mode

I went to their website but couldn't find any. Can you provide a link?

0
Entering edit mode

It's here.

0
Entering edit mode
0
Entering edit mode

Thanks, but this is RSeQC, not RNA-SeQC I was talking about in my post above. These are completely different tools. I decided not to use RSeQC as it is not taking into account the whole intergenic regions.

0
Entering edit mode

Bo be honest, I don't recall having problems with the GTFs I use (iGenomes), either some modified version of the ensembl gtf, the original or the output of cuffmerge. They all appear to be 'standard GTF files'. I looked up my genes.gtf from Zv9 and GRCh37, and all lines contain the 'transcript_id' entry. So no idea of what is happening with that gtf.

Also, it seems, RNA-SeQC is producing its stats based on the top 1000 highly expressed genes.

It does calculate some stats with the top expressed genes, for speed I guess, but not for everything. Looking back at the results for intronic/exonic/intergenic reads counts I am not sure what is happening. The table contains mapping rate for these, not total read counts. It makes sense for QC to have an idea of what is happening, and the top XXX expressed genes are probably enough for that. However, if what you want is total alignment counts for introns, intergenic regions, then the best option might be to create a bed file with those regions and do read count with for instance bedtools multicov. I have not used this strategy though.

0
Entering edit mode

Thanks. It seems Ensembl produces GTF files containing non-coding genes (which makes sense) but these don't have transcript_id (which also makes sense). Apparently, such GTFs still conform to the standard format. As you said - I need to produce interval files from the GTF for the three categories (intergenic, exons, introns).

2
Entering edit mode

grep transcript_id ensembl_gtf_file > rnaseqc_gtf_file

0
Entering edit mode

Yes, sadly there seems to be no tool that can be used off the shelf for it.

0
Entering edit mode

hi, do you get your problems resolved? I have the same error to yours.Can you give some advice? thank you for your help in advance.