featureCounts does not summarise counts from added/reporter genes
1
1
Entering edit mode
15 months ago
eduardogccm ▴ 10

Hi,

I am currently analysing a RNAseq dataset of human lymphocytes containing a vector (which has mCHERRY as reporter gene and a different gene A).

In order to include the annotation for the new genes I have followed the instructions of Alexander Dobin (https://groups.google.com/forum/#!topic/rna-star/FGQRotrCB1Q) and it seems like STAR (v2.7.3a) is detecting them and including them in the BAM files.

However, when I try summarising the BAM files using featureCounts (I am using Rsubread 1.34.4 in my university's computer cluster and subread 2.0.0 at my local machine) these genes are not listed (not even with 0 counts). I have also tried adding an additional line for each gene changing "exon" for "gene", but it does not solve the issue.

Do you have any suggestions of why this might be?

This is the code I am using in the Rsubread:

feature_counts <- featureCounts( nthreads = 6,
isGTFAnnotationFile = TRUE,
GTF.featureType = "exon",
annot.ext = "genome/gencode.v31.primary_assembly.annotation.gtf",
GTF.attrType = "gene_id",
primaryOnly = TRUE,
isPairedEnd = TRUE,
files =  paste( "alignment/",
list.files( path = "alignment/",
pattern = ".bam\$",
recursive = TRUE,
include.dirs = TRUE )[-1],
sep = "" ) )


featureCounts -T 2 -t exon -a ./gencode.v31.primary_assembly.annotation.gtf -t exon -g gene_id -M --primary -p -o ./count_matrix/counts.txt \./alignment_test/AG*/*.bam


Many thanks!!

Eduardo

0
Entering edit mode

Did you grep for the added "chromosome" to be sure it is in the GTF? Please show the respective line of that file.

0
Entering edit mode

When I check the chrName.txt the name of the chromosome is there (mCHERRY) and when I look at the GTF file I get this:

tail -3 gencode.v31.primary_assembly.annotation.gtf

KI270750.1  ENSEMBL transcript  148668  148843  .   +   .   gene_id "ENSG00000277374.1"; transcript_id "ENST00000612925.1"; gene_type "snRNA"; gene_name "RF00003"; transcript_type "snRNA"; transcript_name "RF00003.38-201"; level 3; transcript_support_level "NA"; tag "basic";
KI270750.1  ENSEMBL exon    148668  148843  .   +   .   gene_id "ENSG00000277374.1"; transcript_id "ENST00000612925.1"; gene_type "snRNA"; gene_name "RF00003"; transcript_type "snRNA"; transcript_name "RF00003.38-201"; exon_number 1; exon_id "ENSE00003728746.1"; level 3; transcript_support_level "NA"; tag "basic";
mCHERRY      AddedGenes      exon    1 708   .   +   .   gene_id "mCHERRY"; transcript_id "mCHERRY";


Edit: only the mCHERRY line, fo it to be clearer:

mCHERRY      AddedGenes      exon    1 708   .   +   .   gene_id "mCHERRY"; transcript_id "mCHERRY";

0
Entering edit mode

Can you try changing AddedGenes to ENSEMBL to be consistent with rest of the file? The entry for that gene does have the same name >mCHERRY in your reference genome fasta and it is present in your alignment file?

0
Entering edit mode

Hi, I am generating now the genome indices changing AddedGenes to ENSEMBL.

As for the FASTA, the name is the same in the FASTA file name, in the first line of the FASTA (>mCHERRY) and in the GTF.

After STAR alignment (without changing AddedGenes) I was getting mCHERRY in the the Log.out file as a chromosome:

192 KI270755.1  36723   3137601536
193 KI270756.1  79590   3137863680
194 KI270757.1  71251   3138125824
195 mCHERRY 708 3138387968


And these are the last two lines when grepping for mCHERRY in the BAM file after alignment:

samtools view alignment/Aligned.sortedByCoord.out.bam | grep "mCHERRY"

J00130:134:HFVFJBBXY:4:1108:6441:20885  147 mCHERRY 623 255 50M28S  =   568 -105    CCCACAACGAGGACTACACCATCGTGGAACAGTACGAACGCGCCGAGGGCCGCCGTCCTTCAGCTTCAGCCTCTGCTT  JJJJJFFJJFJJJJJJJJJJJJJJJFFJJJ<JFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA  NH:i:1  HI:i:1  AS:i:106    nM:i:10
J00130:134:HFVFJBBXY:8:2111:8126:8611   147 mCHERRY 624 255 49M29S  =   544 -129    CCACAACGAGGACTACACCATCGTGGAACAGTACGAACGCGCCGAGGGCCGCCGTCCTTCAGCTTCAGCCTCTGCTTG  FAFJJFJJJJJJJJJJFJFFAJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA  NH:i:1  HI:i:1  AS:i:105    nM:i:10


I will wait to see if changing AddedGenes for ENSEMBL makes a difference.

0
Entering edit mode

Great. Looks like everything else seems to be in order.

0
Entering edit mode

Hi,

I have repeated the genome indexing, the alignment and the summarisation with featureCounts after changing AddedGenes to ENSEMBL and still the additional genes are not summarised in featureCounts.

I also tried to convert my "old"/AddedGenes GTF to SAF using the flattenGTF function (from Rsubread) in case it would make a difference. The genes I included (e.g. mCHERRY) are not in the SAF produced by the function, which makes me think that it is a problem of the Subread package interpreting the GTF file (or at least the modifications I made). Could it be?

1
Entering edit mode
15 months ago
eduardogccm ▴ 10

After checking a bit more carefully at the GTF file I found that even though in the instructions from Alenxander Dobin it suggests adding a space between the start and end of the gene (e.g. 1 and 708) and this seems to work for STAR, this should be a tabulation and not a space. After changing this it seems like featureCounts has no problems with including the new genes in the summary.

1
Entering edit mode

It may help to run GTF files through validators. Not all GTF files are created correctly, it seems: A: BEDOPS gtf2bed conversion error with Ensembl GTF

0
Entering edit mode

It was my first time modifying a GTF file so I'll keep in mind all the advise!