Question: RNASeqQC - Low number of genes detected on run
0
gravatar for ivanvishnu
3.8 years ago by
ivanvishnu0
United States
ivanvishnu0 wrote:

    I ran RNASeqQC analysis on processed BAM files. These are the steps I followed for processing and the commands for run. 

############################################################################################################################ Code ################################################################

 1)  for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out; do java -jar ./picard/dist/picard.jar ReorderSam              INPUT=$i/accepted_hits.bam OUTPUT=$i/accepted_hits_RO.bam REFERENCE=/work/agron_2/Vishnu/tophat-2.0.14/bin/Zea_mays.AGPv3.26.dna.genome.fa;done;


2) j=1;for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out;do java -jar picard/dist/picard.jar AddOrReplaceReadGroups INPUT=$i/accepted_hits_RO.bam OUTPUT=$i/accepted_hits_RG.bam SORT_ORDER=coordinate RGID="Sample#$j" RGLB=#$j RGPL=illumina RGPU=#$j RGSM=#$j;let j+=1;done;

3) for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out; do java -jar picard/dist/picard.jar MarkDuplicates INPUT=$i/accepted_hits_RG.bam OUTPUT=$i/accepted_hits_Deduped.bam METRICS_FILE=metrics.txt; done;

4) for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out; do rm $i/accepted_hits_RG.bam; done;

5) for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out; do rm $i/accepted_hits_RO.bam; done;

6) for i in tophat-2.0.14/bin/*Out; do java -jar picard/dist/picard.jar ValidateSamFile INPUT=$i/accepted_hits_Deduped.bam; done > validateDedup

7) for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out; do java -jar picard/dist/picard.jar BuildBamIndex INPUT=$i/accepted_hits_Deduped.bam; done;

8) j=1;for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out; do java -jar /work/agron_2/Vishnu/tophat-2.0.14/bin/RNA-SeQC_v1.1.8.jar -r Zea_mays.AGPv3.26.dna.genome.fa -s "Sample1|$i/accepted_hits_Deduped.bam|trial$j" -t ./Zea_mays.AGPv3.26.gtf -o $i/SeqQC;let j+=1; done;

##############################################################################################

While the number of transcripts detected is around 45,267, I consistently get very low gene count around 268 for all the alignment files. I'm not sure where the problem might be... Any suggestion would be helpful..

 

 

 

rna-seq rnaseqqc • 1.4k views
ADD COMMENTlink modified 3.8 years ago by Kirill260 • written 3.8 years ago by ivanvishnu0

@ivanvishnu would you be able to edit you question and mark all your code as code. Its a bit hard  to reads and follow what's going on. I can't really see it in your code, but I guess you have done since I don't think RNA-SeQC would have run. As check list though, have you:

- `picard CreateSequenceDictionary​` on your reference fasta file

- index your fasta files (with `samtools` perhaps)

all must be located in the same directory

Also you can give `CREATE_INDEX=true` at anytime, but obviously you want it during mark dups run instead of running BuildBamIndex

And finally I think there is some confusion about RNA-SeQC output. I don't think it outputs gene counts. To count how many reads have aligned to gene you need to run htseq-count or featureCounts. Either way `samtools flagstat` is good tool to run as it gives the same information about mapped reads and unmapped reads and in general you can fish out alot of info from bam files using different -f option in samtools e.g `samtools -c -f 4` will provide all unmapped reads in the bam file

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Kirill260

@Kirill: Thanks for the suggestion. I'd both the .dict file and .fai file in the same directory. The number I mentioned was not gene count. As summary statistic, RNASeqQC reports the number of transcripts and genes detected. The numbers as I mentioned in the post are very different. 

ADD REPLYlink written 3.8 years ago by ivanvishnu0
0
gravatar for Kirill
3.8 years ago by
Kirill260
Australia
Kirill260 wrote:

From a bit of googling around and talking to others I think the problem is in you GTF file. I had a pick at your GTF file. This is the head

5       protein_coding  exon    1579    3920    .       -       .        gene_id "GRMZM2G356204"; gene_version "1"; transcript_id "GRMZM2G356204_T01"; transcript_version "1"; exon_number "1"; seqedit "false";  
5       protein_coding  CDS     1684    3903    .       -       0        gene_id "GRMZM2G356204"; gene_version "1"; transcript_id "GRMZM2G356204_T01"; transcript_version "1"; exon_number "1"; protein_id "GRMZM2G356204_P01";  
5       protein_coding  start_codon     3901    3903    .       -       0        gene_id "GRMZM2G356204"; gene_version "1"; transcript_id "GRMZM2G356204_T01"; transcript_version "1"; exon_number "1";  
5       protein_coding  stop_codon      1681    1683    .       -       0        gene_id "GRMZM2G356204"; gene_version "1"; transcript_id "GRMZM2G356204_T01"; transcript_version "1"; exon_number "1";  
5       protein_coding  exon    22933   23193   .       -       .        gene_id "GRMZM2G054378"; gene_version "1"; transcript_id "GRMZM2G054378_T02"; transcript_version "1"; exon_number "1"; seqedit "false";  
5       protein_coding  CDS     22933   23193   .       -       0        gene_id "GRMZM2G054378"; gene_version "1"; transcript_id "GRMZM2G054378_T02"; transcript_version "1"; exon_number "1"; protein_id "GRMZM2G054378_P02";  
5       protein_coding  start_codon     23191   23193   .       -       0        gene_id "GRMZM2G054378"; gene_version "1"; transcript_id "GRMZM2G054378_T02"; transcript_version "1"; exon_number "1";  
5       protein_coding  exon    22723   22798   .       -       .        gene_id "GRMZM2G054378"; gene_version "1"; transcript_id "GRMZM2G054378_T02"; transcript_version "1"; exon_number "2"; seqedit "false";  
5       protein_coding  CDS     22723   22798   .       -       0        gene_id "GRMZM2G054378"; gene_version "1"; transcript_id "GRMZM2G054378_T02"; transcript_version "1"; exon_number "2"; protein_id "GRMZM2G054378_P02";  
5       protein_coding  exon    22554   22633   .       -       .        gene_id "GRMZM2G054378"; gene_version "1"; transcript_id "GRMZM2G054378_T02"; transcript_version "1"; exon_number "3"; seqedit "false";  

And this is the head from mouse gtf file for example

#!genome-build GRCm38.p3
#!genome-version GRCm38
#!genome-date 2012-01
#!genome-build-accession NCBI:GCA_000001635.5
#!genebuild-last-updated 2015-04
1       havana  gene    3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC";
1       havana  transcript      3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; transcript_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; transcript_name "4933401J01Rik-001"; transcript_source "havana"; transcript_biotype "TEC"; tag "basic";
1       havana  exon    3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; transcript_version "1"; exon_number "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; transcript_name "4933401J01Rik-001"; transcript_source "havana"; transcript_biotype "TEC"; exon_id "ENSMUSE00001343744"; exon_version "1"; tag "basic";
1       ensembl gene    3102016 3102125 .       +       .       gene_id "ENSMUSG00000064842"; gene_version "1"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA";
1       ensembl transcript      3102016 3102125 .       +       .       gene_id "ENSMUSG00000064842"; gene_version "1"; transcript_id "ENSMUST00000082908"; transcript_version "1"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm26206-201"; transcript_source "ensembl"; transcript_biotype "snRNA"; tag "basic"; transcript_support_level "NA";

Here is link to GTF specification in general http://www.gencodegenes.org/data_format.html. However RNA-SeQC do say that that is the format the tools expects. In your GTF column 2 isn't "annotation source" rather I think it is biotype (at least that's what it looks like to me). Also in mouse GTF if you count how many transcripts_id vs gene_id in the transcript feature are there you will find ~100,00 to ~45,000. If RNA-SeQC using those tag to count transcript/gene then the number will be different. But I think the problem is in the GTF format that you have. I suggest you have a closer look at you GTF file.

It would be great to have a look at the source of RNA-SeQC to see what's going on, however I couldn't find source anywhere. Maybe ask them directly. If you do find out how RNA-SeQC detects transcripts/gene I'd love to know, because I do use this tool fair bit.

p.s I do find RNA-SeQC very finicky..

Cheers, 

 

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Kirill260

@Kirill: Thanks a lot. I'll look into this and let you know how it goes. 

ADD REPLYlink written 3.8 years ago by ivanvishnu0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1766 users visited in the last hour