Question

RNASeQC - Low number of genes detected on run

0

Entering edit mode

10.1 years ago

ivanvishnu • 0

I ran RNASeQC analysis on processed BAM files. These are the steps I followed for processing and the commands for run.

for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out
do
  java -jar ./picard/dist/picard.jar ReorderSam INPUT=$i/accepted_hits.bam OUTPUT=$i/accepted_hits_RO.bam REFERENCE=/work/agron_2/Vishnu/tophat-2.0.14/bin/Zea_mays.AGPv3.26.dna.genome.fa
done

j=1
for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out
do
  java -jar picard/dist/picard.jar AddOrReplaceReadGroups INPUT=$i/accepted_hits_RO.bam OUTPUT=$i/accepted_hits_RG.bam SORT_ORDER=coordinate RGID="Sample#$j" RGLB=#$j RGPL=illumina RGPU=#$j RGSM=#$j
  let j+=1
done

for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out
do
  java -jar picard/dist/picard.jar MarkDuplicates INPUT=$i/accepted_hits_RG.bam OUTPUT=$i/accepted_hits_Deduped.bam METRICS_FILE=metrics.txt
done

for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out;
do
  rm $i/accepted_hits_RG.bam
done

for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out
do
  rm $i/accepted_hits_RO.bam
done

for i in tophat-2.0.14/bin/*Out
do
  java -jar picard/dist/picard.jar ValidateSamFile INPUT=$i/accepted_hits_Deduped.bam
done > validateDedup

for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out
do
  java -jar picard/dist/picard.jar BuildBamIndex INPUT=$i/accepted_hits_Deduped.bam
done

j=1
for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out
do
  java -jar /work/agron_2/Vishnu/tophat-2.0.14/bin/RNA-SeQC_v1.1.8.jar -r Zea_mays.AGPv3.26.dna.genome.fa -s "Sample1|$i/accepted_hits_Deduped.bam|trial$j" -t ./Zea_mays.AGPv3.26.gtf -o $i/SeqQC;let j+=1
done

While the number of transcripts detected is around 45,267, I consistently get very low gene count around 268 for all the alignment files. I'm not sure where the problem might be... Any suggestion would be helpful..

RNA-Seq RNASeQC • 3.2k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by ivanvishnu • 0

0

Entering edit mode

@ivanvishnu would you be able to edit you question and mark all your code as code. Its a bit hard to reads and follow what's going on. I can't really see it in your code, but I guess you have done since I don't think RNA-SeQC would have run. As check list though, have you:

picard CreateSequenceDictionary on your reference fasta file
index your fasta files (with samtools perhaps)

all must be located in the same directory

Also you can give CREATE_INDEX=true at anytime, but obviously you want it during mark dups run instead of running BuildBamIndex

And finally I think there is some confusion about RNA-SeQC output. I don't think it outputs gene counts. To count how many reads have aligned to gene you need to run htseq-count or featureCounts. Either way samtools flagstat is good tool to run as it gives the same information about mapped reads and unmapped reads and in general you can fish out a lot of info from bam files using different -f option in samtools e.g samtools -c -f 4 will provide all unmapped reads in the bam file

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by Kirill Tsyganov ▴ 370

0

Entering edit mode

@Kirill: Thanks for the suggestion. I'd both the .dict file and .fai file in the same directory. The number I mentioned was not gene count. As summary statistic, RNASeqQC reports the number of transcripts and genes detected. The numbers as I mentioned in the post are very different.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by ivanvishnu • 0

Ram · Answer 1 · 2015-06-16

From a bit of googling around and talking to others I think the problem is in you GTF file. I had a pick at your GTF file. This is the head

5       protein_coding  exon    1579    3920    .       -       .        gene_id "GRMZM2G356204"; gene_version "1"; transcript_id "GRMZM2G356204_T01"; transcript_version "1"; exon_number "1"; seqedit "false";  
5       protein_coding  CDS     1684    3903    .       -       0        gene_id "GRMZM2G356204"; gene_version "1"; transcript_id "GRMZM2G356204_T01"; transcript_version "1"; exon_number "1"; protein_id "GRMZM2G356204_P01";  
5       protein_coding  start_codon     3901    3903    .       -       0        gene_id "GRMZM2G356204"; gene_version "1"; transcript_id "GRMZM2G356204_T01"; transcript_version "1"; exon_number "1";  
5       protein_coding  stop_codon      1681    1683    .       -       0        gene_id "GRMZM2G356204"; gene_version "1"; transcript_id "GRMZM2G356204_T01"; transcript_version "1"; exon_number "1";  
5       protein_coding  exon    22933   23193   .       -       .        gene_id "GRMZM2G054378"; gene_version "1"; transcript_id "GRMZM2G054378_T02"; transcript_version "1"; exon_number "1"; seqedit "false";  
5       protein_coding  CDS     22933   23193   .       -       0        gene_id "GRMZM2G054378"; gene_version "1"; transcript_id "GRMZM2G054378_T02"; transcript_version "1"; exon_number "1"; protein_id "GRMZM2G054378_P02";  
5       protein_coding  start_codon     23191   23193   .       -       0        gene_id "GRMZM2G054378"; gene_version "1"; transcript_id "GRMZM2G054378_T02"; transcript_version "1"; exon_number "1";  
5       protein_coding  exon    22723   22798   .       -       .        gene_id "GRMZM2G054378"; gene_version "1"; transcript_id "GRMZM2G054378_T02"; transcript_version "1"; exon_number "2"; seqedit "false";  
5       protein_coding  CDS     22723   22798   .       -       0        gene_id "GRMZM2G054378"; gene_version "1"; transcript_id "GRMZM2G054378_T02"; transcript_version "1"; exon_number "2"; protein_id "GRMZM2G054378_P02";  
5       protein_coding  exon    22554   22633   .       -       .        gene_id "GRMZM2G054378"; gene_version "1"; transcript_id "GRMZM2G054378_T02"; transcript_version "1"; exon_number "3"; seqedit "false";

And this is the head from mouse gtf file for example

#!genome-build GRCm38.p3
#!genome-version GRCm38
#!genome-date 2012-01
#!genome-build-accession NCBI:GCA_000001635.5
#!genebuild-last-updated 2015-04
1       havana  gene    3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC";
1       havana  transcript      3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; transcript_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; transcript_name "4933401J01Rik-001"; transcript_source "havana"; transcript_biotype "TEC"; tag "basic";
1       havana  exon    3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; transcript_version "1"; exon_number "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; transcript_name "4933401J01Rik-001"; transcript_source "havana"; transcript_biotype "TEC"; exon_id "ENSMUSE00001343744"; exon_version "1"; tag "basic";
1       ensembl gene    3102016 3102125 .       +       .       gene_id "ENSMUSG00000064842"; gene_version "1"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA";
1       ensembl transcript      3102016 3102125 .       +       .       gene_id "ENSMUSG00000064842"; gene_version "1"; transcript_id "ENSMUST00000082908"; transcript_version "1"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm26206-201"; transcript_source "ensembl"; transcript_biotype "snRNA"; tag "basic"; transcript_support_level "NA";

Here is link to GTF specification in general. However RNA-SeQC do say that that is the format the tools expects. In your GTF column 2 isn't "annotation source" rather I think it is biotype (at least that's what it looks like to me). Also in mouse GTF if you count how many transcripts_id vs gene_id in the transcript feature are there you will find ~100,00 to ~45,000. If RNA-SeQC using those tag to count transcript/gene then the number will be different. But I think the problem is in the GTF format that you have. I suggest you have a closer look at you GTF file.

It would be great to have a look at the source of RNA-SeQC to see what's going on, however I couldn't find source anywhere. Maybe ask them directly. If you do find out how RNA-SeQC detects transcripts/gene I'd love to know, because I do use this tool fair bit.

p.s I do find RNA-SeQC very finicky..

Cheers,