Question

Issues with Rhesus Monkey RNASeq Analysis -- featureCounts has very low assignment rates

0

Entering edit mode

5.4 years ago

harithaa • 0

Hello,

I am working with non-human RNA-seq for the first time and this is with the Rhesus Monkey. I am having very low alignment rates after featureCounts. The out put is like so:

> Load annotation file rheMac10refGene.gtf.sorted ...                     
   Features : 47357                                              
    Meta-features : 6331                                  
Chromosomes/contigs : 46                                                

> Process BAM file *****.bam...              
Single-end reads are included.                                          
    Assign alignments to features...                                        
    Total alignments : 84767064                                             
    Successfully assigned alignments : 8615189 (10.2%)                      
    Running time : 4.04 minutes

So the steps I've done so far:

Downloaded the entire assembly from ftp://hgdownload.soe.ucsc.edu/goldenPath/rheMac10/bigZips/*
Converted the 2bit format to .fa using twoBitToFa tool from rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/

-- this is the file, which after indexing was the reference file for HISAT2

Next, for the GTF file I downloaded the rheMac10.refGene.txt and converted it to a GTF file using genePredToGtf tool. Which after cleaning, supplied as the gtf file for featureCounts.

Is this the way to go? My mapping rates after HISAT2 are high and comparable to the human genome ones. What am I doing wrong?

RNA-Seq rna-seq • 1.1k views

ADD COMMENT • link 5.4 years ago by harithaa • 0

0

Entering edit mode

featureCounts has very low alignment rate

You have low assignment rates.

Can you provide actual featureCounts command you are using? Did you actually use a matched reference and annotation when you did all your analysis? You can't mix and match those from different sources.

ADD REPLY • link 5.4 years ago by GenoMax 152k

0

Entering edit mode

for sortedbam in $(ls *.bam)
do
name=${sortedbam##*/} #removes the filepath name up until the / of the file name
base=${name%.bam}

echo $base && featureCounts -T 12  -a /ref/rheMac10refGene.gtf.sorted \
-o /counts/"${base}".txt $sortedbam

done

ADD REPLY • link updated 5.4 years ago by GenoMax 152k • written 5.4 years ago by harithaa • 0

0

Entering edit mode

Yes, my reference was also from rheMac10 and the annotation (derived from rheMac10.refGene.txt) is also from the same. -- The reference genome (.fa) is only 2.9G and the GTF file 17M. This is also a part that confuses me. Is something wrong with this? This is the reference file I used

rheMac10.2bit - contains the complete rhesus/rheMac10 genome sequence in the 2bit file format. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case. The utility program, twoBitToFa (available from the kent src tree), can be used to extract .fa file(s) from this file. A pre-compiled version of the command line tool can be found at: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/

ADD REPLY • link 5.4 years ago by harithaa • 0

0

Entering edit mode

A couple of things to check first. If you have rRNA contamination in the data. It is possible that a lot of your reads are multi-mapping. These are not counted by featureCounts by default. You should also add -p (if this was a paired end dataset). You would want to count at exon level, but summarize at gene level (something like -t exon -g gene_id ). I don't recollect immediately if that is the default.

ADD REPLY • link 5.4 years ago by GenoMax 152k