I ran RNASeQC analysis on processed BAM files. These are the steps I followed for processing and the commands for run.
for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out
do
java -jar ./picard/dist/picard.jar ReorderSam INPUT=$i/accepted_hits.bam OUTPUT=$i/accepted_hits_RO.bam REFERENCE=/work/agron_2/Vishnu/tophat-2.0.14/bin/Zea_mays.AGPv3.26.dna.genome.fa
done
j=1
for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out
do
java -jar picard/dist/picard.jar AddOrReplaceReadGroups INPUT=$i/accepted_hits_RO.bam OUTPUT=$i/accepted_hits_RG.bam SORT_ORDER=coordinate RGID="Sample#$j" RGLB=#$j RGPL=illumina RGPU=#$j RGSM=#$j
let j+=1
done
for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out
do
java -jar picard/dist/picard.jar MarkDuplicates INPUT=$i/accepted_hits_RG.bam OUTPUT=$i/accepted_hits_Deduped.bam METRICS_FILE=metrics.txt
done
for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out;
do
rm $i/accepted_hits_RG.bam
done
for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out
do
rm $i/accepted_hits_RO.bam
done
for i in tophat-2.0.14/bin/*Out
do
java -jar picard/dist/picard.jar ValidateSamFile INPUT=$i/accepted_hits_Deduped.bam
done > validateDedup
for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out
do
java -jar picard/dist/picard.jar BuildBamIndex INPUT=$i/accepted_hits_Deduped.bam
done
j=1
for i in /work/agron_2/Vishnu/tophat-2.0.14/bin/*Out
do
java -jar /work/agron_2/Vishnu/tophat-2.0.14/bin/RNA-SeQC_v1.1.8.jar -r Zea_mays.AGPv3.26.dna.genome.fa -s "Sample1|$i/accepted_hits_Deduped.bam|trial$j" -t ./Zea_mays.AGPv3.26.gtf -o $i/SeqQC;let j+=1
done
While the number of transcripts detected is around 45,267, I consistently get very low gene count around 268 for all the alignment files. I'm not sure where the problem might be... Any suggestion would be helpful..
@ivanvishnu would you be able to edit you question and mark all your code as code. Its a bit hard to reads and follow what's going on. I can't really see it in your code, but I guess you have done since I don't think RNA-SeQC would have run. As check list though, have you:
picard CreateSequenceDictionary​ on your reference fasta file
index your fasta files (with samtools perhaps)
all must be located in the same directory
Also you can give CREATE_INDEX=true at anytime, but obviously you want it during mark dups run instead of running BuildBamIndex
And finally I think there is some confusion about RNA-SeQC output. I don't think it outputs gene counts. To count how many reads have aligned to gene you need to run htseq-count or featureCounts. Either way samtools flagstat is good tool to run as it gives the same information about mapped reads and unmapped reads and in general you can fish out a lot of info from bam files using different -f option in samtools e.g samtools -c -f 4 will provide all unmapped reads in the bam file
@Kirill: Thanks for the suggestion. I'd both the .dict file and .fai file in the same directory. The number I mentioned was not gene count. As summary statistic, RNASeqQC reports the number of transcripts and genes detected. The numbers as I mentioned in the post are very different.
Here is link to GTF specification in general. However RNA-SeQC do say that that is the format the tools expects. In your GTF column 2 isn't "annotation source" rather I think it is biotype (at least that's what it looks like to me). Also in mouse GTF if you count how many transcripts_id vs gene_id in the transcript feature are there you will find ~100,00 to ~45,000. If RNA-SeQC using those tag to count transcript/gene then the number will be different. But I think the problem is in the GTF format that you have. I suggest you have a closer look at you GTF file.
It would be great to have a look at the source of RNA-SeQC to see what's going on, however I couldn't find source anywhere. Maybe ask them directly. If you do find out how RNA-SeQC detects transcripts/gene I'd love to know, because I do use this tool fair bit.
@ivanvishnu would you be able to edit you question and mark all your code as code. Its a bit hard to reads and follow what's going on. I can't really see it in your code, but I guess you have done since I don't think RNA-SeQC would have run. As check list though, have you:
picard CreateSequenceDictionary​on your reference fasta filesamtoolsperhaps)all must be located in the same directory
Also you can give
CREATE_INDEX=trueat anytime, but obviously you want it during mark dups run instead of running BuildBamIndexAnd finally I think there is some confusion about RNA-SeQC output. I don't think it outputs gene counts. To count how many reads have aligned to gene you need to run
htseq-countorfeatureCounts. Either waysamtools flagstatis good tool to run as it gives the same information about mapped reads and unmapped reads and in general you can fish out a lot of info from bam files using different-foption in samtools e.gsamtools -c -f 4will provide all unmapped reads in the bam file@Kirill: Thanks for the suggestion. I'd both the .dict file and .fai file in the same directory. The number I mentioned was not gene count. As summary statistic, RNASeqQC reports the number of transcripts and genes detected. The numbers as I mentioned in the post are very different.