Question: Easiest way to compute RNA-Seq mapping stats (exons, introns, intergenic)?
3
gravatar for Nick
4.6 years ago by
Nick260
Spain
Nick260 wrote:

I posted this question a few days ago on Seqanswers but wasn't able to get an answer that solved my problem.

Here is my use case:

I aligned RNA-Seq data to a genome using tophat2. Now I have the bam file and I am looking for a straight-forward way to produce mapping statistics - how many reads map onto exons, introns and intergenic regions. 

So my input consists of a bam file and the gtf file which I used with tophat2. What's the easiest way to get from this input to the output I need? 

The suggestions I got so far involved tools that were not taking GTF file as input (CollectRnaSeqMetrics from Piccard toolset) and were not computing the statistics on the intergenic intervals (RSeQC). CollectRnaSeqMetrics seems a good candidate but so far I could not find information/manual on converting GTF to RefFlat format.

So I would welcome suggestions that specify the processing steps that start with a BAM file and a GTF file and lead to statistics on the number of reads mapping to exons, introns and intergenic intervals. 

 

 

 

rna-seq • 9.8k views
ADD COMMENTlink modified 6 months ago by Biostar ♦♦ 20 • written 4.6 years ago by Nick260
1

I just replied on your other SEQanswers thread on how to use gtfToGenePred. That'll be the simplest solution.

ADD REPLYlink written 4.6 years ago by Devon Ryan91k

Thanks, Devon - I just replied to your post. It seems I am not able to get gtfToGenePred running.  

ADD REPLYlink written 4.6 years ago by Nick260

I was able to get gtfToGenePred running. It took the Ensembl GTF as input and produced a refFlat file which I tried to use with CollectRnaSeqMetrics. However, the latter produced an error message:

 

Exception in thread "main" picard.annotation.AnnotationException: Wrong number of fields in refFlat file mm75.genePred at line 1 at picard.annotation.RefFlatReader.load(RefFlatReader.java:80) at picard.annotation.RefFlatReader.load(RefFlatReader.java:66) at picard.annotation.GeneAnnotationReader.loadRefFlat(GeneAnnotationReader.java:37) at picard.analysis.CollectRnaSeqMetrics.setup(CollectRnaSeqMetrics.java:105) at picard.analysis.SinglePassSamProgram.makeItSo(SinglePassSamProgram.java:98) at picard.analysis.SinglePassSamProgram.doWork(SinglePassSamProgram.java:53) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

Apparently, the refFlat file produced is not valid. Have you, actually, used gtfToGenePred successfully?

ADD REPLYlink written 4.6 years ago by Nick260
2

Apparently gtfToGenePred leaves out the first (or second) column. You can fix that with:

awk 'BEGIN{OFS="\t"}{print $1,$0}' mm75.genePred > mm75.genePred2
ADD REPLYlink written 4.6 years ago by Devon Ryan91k

Yay! This seems to have fixed it. Thanks!

ADD REPLYlink written 4.6 years ago by Nick260

I have the same problem the fastq files are paired end, aligned using STAR and GRCh38 index. I need the REF_FLAT file compatible for GRCh38. I know there is an inconsistency with the chromosome naming, between the ensembl build and picard tools. Please don't provide an answer unless you actually know how to solve this problem. I have read >100 answers to this problem and most are not useful at all.

ADD REPLYlink written 2.5 years ago by DataFanatic140

I moved this to a comment because it's not an answer to this thread

Please don't provide an answer unless you actually know how to solve this problem

Okay, won't try to help then. Good luck.

ADD REPLYlink written 2.5 years ago by WouterDeCoster40k
2
gravatar for Ashutosh Pandey
4.6 years ago by
Philadelphia
Ashutosh Pandey11k wrote:

Another great tool is RSeQC: http://rseqc.sourceforge.net/#read-distribution-py

ADD COMMENTlink written 4.6 years ago by Ashutosh Pandey11k
2
gravatar for Manvendra Singh
4.6 years ago by
Manvendra Singh2.1k
Berlin, Germany
Manvendra Singh2.1k wrote:

I agree with Ashutosh Pandey 

RSeQC works good for me

you need to download it run with 

read-distribution-py -i your bam -r Refseq.bed

refseq.bed also you can download from their site according to the genome version you have

 

hth

ADD COMMENTlink written 4.6 years ago by Manvendra Singh2.1k

I wanna try this but I don't understand how to get the Refseq.bed file? I have a fasta file and a gtf file for a de-novo assembled organism.

ADD REPLYlink written 5 months ago by O.rka120
1
gravatar for A. Domingues
4.6 years ago by
A. Domingues2.1k
Dresden, Germany
A. Domingues2.1k wrote:

I use RNA-SeQC from the Broad Institute:

http://www.broadinstitute.org/cancer/cga/rna-seqc


As you pointed out, this being a Broad tool it is quite finicky with its requirements of what a Bam should look like. So this is what I do to prepare bams from Tophat for RNA-SeQC:

1. Index Fasta:

## I need to do this in the cluster I use to add the libraries to path
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64/jre
export PATH=/usr/lib/jvm/java-6-openjdk-amd64/jre/bin:$PATH

fasta="coolgenome.fa"

java -jar ~/bin/picard-tools-1.119/CreateSequenceDictionary.jar R=$fasta O=${fasta/.fa/.dict}

 

2. Add read groups and index Bam

aln="somemapping.bam"
exp=$(basename ${aln%%.bam})

java -Xmx6g -jar ~/bin/picard-tools-1.119/AddOrReplaceReadGroups.jar \
  I= $aln \
  O= ${aln%%.bam}.so.rg.bam \
  SO= coordinate \
  PL=illumina \
  PU= sample \
  LB= $exp \
  RGSM= $exp  2> $aln.sort.err

# index bam
samtools index ${aln%%.bam}.so.rg.bam

 

3. Pre-run Checklist

#    Are the contig names consistent between your BAM, Reference, and the GTF file?
#    Is your BAM indexed? (use samtools index)
#    Is your reference Indexed? (use samtools faidx)
#    Does your reference have a dict (dictionary) file? (use CreateSequenceDictionary.jar)

 

3. prepare sampleFile.txt

printf "Sample ID \t  Bam File \t  Notes\n" > sampleFile.txt
for b in */*.so.rg.bam; do
   sample=$(basename `echo $b | cut -f1 -d"."`)
   echo $samplef
   printf "$sample \t  $b \t \n"  >> sampleFile.txt
done

 

4. Run the thing

gtfFile="thegeneannotation.gtf"

java -jar -Xmx5g ~/bin/RNA-SeQC_v1.1.8.jar -s sampleFile.txt -t $gtfFile -ttype 2 -r $fasta -o ./qcReport -gatkFlags "-S SILENT"

 

Some of these steps are needed only once the others I have included in my mapping scripts - I always output sorted and indexed bams for instance. I hope it works.

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by A. Domingues2.1k

This is an interesting tool, however, according to various sites I can't use it directly with my tophat2 BAMs. The various recipes I found differ but they list several pre-processing steps in (roughly) this order:

1. AddOrReplaceReadGroups.jar

2. samtools index

3. samtools faidx

4. CreateSequenceDictionary.jar 

5. ReorderSam.jar

The problem is that the recipes differ not just in respect to the tools used but also regarding their parameters. What is your recipe?

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Nick260
1

Good point. I have edited my answer to add that information.
 

ADD REPLYlink written 4.6 years ago by A. Domingues2.1k

Thanks for the details. I managed to go through all the steps so your recipe almost worked for me. However, the last step (RNA-SeQC) caused an error message. Apparently, RNA-SeQC doesn't like the GTF file (which I downloaded from Ensembl). Do I need to pre-process the GTF, too? Here is the error message:

RNA-SeQC v1.1.8.1 07/11/14
Additional GATK flags provided: -S SILENT
Creating rRNA Interval List based on given GTF annotations
Retriving contig names from reference
     contig names in reference: 80
Loading GTF for Read Counting
The required transcript_id attribute was not found on line 1    pseudogene    gene    3054233    3054733    .    +    .    gene_id "ENSMUSG00000090025"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene";

==EDIT:

It seems RNA-SeQC is complaining that some lines in the GTF file do not have transcript_id. As it happens, the first line in the GTF is a pseudogene so it doesn't have a transcript_id. This is a standard GTF file which I used also with tophat2 to get my BAMs. If RNA-SeQC cannot work with standard GTF files this would be a major issue. 

Also, it seems, RNA-SeQC is producing its stats based on the top 1000 highly expressed genes. This is a default parameter that can be changed but I am puzzled that it at all exists. The reason I want to use RNA-SeQC is that I want to get stats on reads mapping to introns and intergeneic reasons (I get the exons stat via htseq-count which is what I normally use for differential expression analysis). Any reads mapping onto introns and intergenic regions are not likely to be considered part of genes (duh!) so I wonder  what's the point of using the top 1000 most highly expressed genes to compute this statistics. It does not make any sense to me.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Nick260

bettter to download gtf and bed files from RNA-SeQC website

ADD REPLYlink written 4.6 years ago by Manvendra Singh2.1k

I went to their website but couldn't find any. Can you provide a link?

ADD REPLYlink written 4.6 years ago by Nick260

Its here

http://rseqc.sourceforge.net/

ADD REPLYlink written 4.6 years ago by Manvendra Singh2.1k

To be more precise :-)

http://sourceforge.net/projects/rseqc/files/BED/

ADD REPLYlink written 4.6 years ago by Ashutosh Pandey11k

Thanks, but this is RSeQC, not RNA-SeQC I was talking about in my post above. These are completely different tools. I decided not to use RSeQC as it is not taking into account the whole intergenic regions. 

ADD REPLYlink written 4.6 years ago by Nick260

Bo be honest, I don't recall having problems with the GTFs I use (iGenomes), either some modified version of the ensembl gtf, the original or the output of cuffmerge. They all appear to be 'standard GTF files'. I looked up my genes.gtf from Zv9 and GRCh37, and all lines contain the 'transcript_id' entry. So no idea of what is happening with that gtf.

Also, it seems, RNA-SeQC is producing its stats based on the top 1000 highly expressed genes.

It does calculate some stats with the top expressed genes, for speed I guess, but not for everything. Looking back at the results for intronic/exonic/intergenic reads counts I am not sure what is happening. The table contains mapping rate for these, not total read counts. It makes sense for QC to have an idea of what is happening, and the top XXX expressed genes are probably enough for that. However, if what you want is *total alignment counts* for introns, intergenic regions, then the best option might be to create a bed file with those regions and do read count with for instance bedtools multicov. I have not used this strategy though.

 

 

ADD REPLYlink written 4.6 years ago by A. Domingues2.1k

Thanks. It seems Ensembl produces GTF files containing non-coding genes (which makes sense) but these don't have transcript_id (which also makes sense). Apparently, such GTFs still conform to the standard format. As you said - I need to produce interval files from the GTF for the three categories (intergenic, exons, introns).

ADD REPLYlink written 4.6 years ago by Nick260
2

grep transcript_id ensembl_gtf_file > rnaseqc_gtf_file

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by ablanchetcohen1.2k

Yes, sadly there seems to be no tool that can be used off the shelf for it.

ADD REPLYlink written 4.6 years ago by A. Domingues2.1k

hi, do you get your problems resolved? I have the same error to yours.Can you give some advice? thank you for your help in advance.

ADD REPLYlink written 3.5 years ago by huanglingling0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1157 users visited in the last hour