Question

Counting RNA-seq reads for more features than GTF files contain

0

Entering edit mode

9 months ago

bade279 • 0

Hi all,

I have an RNA-seq dataset, and I would like to generate read counts for a wide variety of features, for example, transposable elements, telomeres, centromeres, multiple RNA subtypes, and introns. As far as I can tell, GTF files only permit a very restricted number of features in their third column (Genes, Exons, Start/Stop Codons etc) and only permits me to count one feature at a time (e.g. by specifying 'exon').

I have a GFF file that contains more features, which I have edited to reduce the features down to a subset I am happy with, for example by collapsing centromere elements into one annotation:

awk '$3 == "centromere_DNA_Element_I" || $3 == "centromere_DNA_Element_II" || $3 == "centromere_DNA_Element_III" || $3 == "centromere"' saccharomyces_cerevisiae_R64-3-1_20210421.gff > scere_reduced.gff

Screenshot of GFF file prior to any edits

I also want to count reads which map to unannotated regions. To do this, I used Bedtools compliment to find the inverse of all the coordinates and combined that output with the original gff, so all of the annotation file 'gaps' are covered.

My issue is that FeatureCounts doesn't accept a GFF file that doesn't have gene_ids in column 9, like a GTF file does. How can I use this annotation to get read counts for these features, either using FeatureCounts or another tool?

Additional information:

I have tried manipulating this annotation file in various ways, but these feel like trying to 'game' the tool, so I expect this is not the correct approach.

This involved removing all column 9 information, adding in the feature type for that row as a 'gene_id', changing all features in column three to the word 'Feature', then running featurecounts, counting by feature.

featureCounts -T 2 -f -p --countReadPairs -t Feature -a scere_reduced.gff -o counttable rnaseq_Aligned.sortedByCoord.out.bam

The idea is that this gives an output with gene_id, chr, start, end, read_count, which is what I want:

FeatureCounts Output by Feature

But the vast majority have 0 reads, and there are a huge number of multimapped or ambiguous reads:

FeatureCountsOutputSummary

I then tried variations on this to try to solve the multimapping issue:

Excluding rows in the GFF file that are exact duplicates.
Excluding the complement unannotated regions.
Cycling through Features in column three with 'transcript', 'start_codon', and 'stop_codon', and running FeatureCounts three times, counting each of these, in case having only one 'feature' is problematic (In this case, the column three 'features' are just fake placeholder names - the correct features are as before, in column 9 under Gene_ID).
Aligning with BWA-mem instead of STAR.

All of these led to similar outcomes. I think the multimapping may be due to features commonly having overlapping coordinates in the GFF file, but I'm at a loss how to solve that.

Ideally, I'm hoping there is a simpler solution to count multiple and varied feature types that doesn't involve this gerry-rigging approach.

Many thanks in advance for your advice!

P.s. This is my first post - I have tried to be concise while giving relevant information. If you have constructive feedback please let me know and I will happily edit my post.

gff FeatureCounts RNA-seq gtf • 1.0k views

ADD COMMENT • link 8 months ago by bade279 • 0

1

Entering edit mode

As long as you have some sort of identifier in column 9, you can use that to aggregate reads over a certain type of feature. At least in my version, this is specified with:

  -g <string>         Specify attribute type in GTF annotation. 'gene_id' by
                  default. Meta-features used for read counting will be
                  extracted from annotation using the provided value.

If you want resolution at the individual feature level (e.g. individual exons instead of aggregating all exons per gene), you can add

  -f                  Perform read counting at feature level (eg. counting
                      reads for exons rather than genes).

Alternatively, you can look into the SAF format.

You can have featureCounts count multimappers. Multimappers are defined in the bam file. So if one read maps to 10 genomic loci it will probably have the NH:i:10 tag. featureCounts sees this is more than 1 and discards the read as multimapper unless you specify -M. If you add this option, the featureCounts will count that read for each of the 10 loci. Meaning +1 counts to each of the 10 features. You can add the --fraction option to make it add fractional counts, in our case it would add 1/10 counts to each of the 10 loci (even if all 10 loci are not in your GTF/GFF, it will add 1/10 counts to the loci that are present since it only cares about the NH tag).

Since you are looking at repetitive regions, when there are only multimapping reads, it is impossible to say if they actually come from one region or another. For this reason, you may choose to counts all reads of a repetitive family together. For example, in mouse, there are 3,000+ annotated features for transposon MMERVK10C. Using genomic mapped reads, I treated each of these 3,000 features as an exon and then use the final count to determine expression of the MMERVK10C element. However, I cannot say if regions in chr3 are expressed compared to regions in chr6, but I could determine between conditions if MMERVK10C loci are more active. I don't know if this approach would make sense for you though.

If features are overlapping in the GTF files, these would be the ambiguous reads. You can solve this by counting different sets of features in different runs, or by setting the -O option. This option will work similar to -M and is also affected by the --fraction tag in a similar manner. However, I think you need to sort out your initial problems first, then you may not need to worry about ambiguous reads. For example, due to your attempted workaround, featureCounts did not distinquish between mRNA/gene/coding_region/unannotated. These will all overlap and so will be counted as ambiguous.

If it helps, I was able to successfully use the below annotations in a short test. featureCounts appeared to work as normal

chr1    HAVANA  gene    3073253 3074322 .       +       .       nonsense_id=testGene;
chr1    HAVANA  transcript      3073253 3074322 .       +       .       nonsense_id=testGene;
chr1    HAVANA  exon    3073253 3074322 .       +       .       nonsense_id=testGene;
chr1    ENSEMBL gene    3102016 3102125 .       +       .       nonsense_id=testGene;
chr1    ENSEMBL transcript      3102016 3102125 .       +       .       nonsense_id=testGene;

ADD REPLY • link 9 months ago by rfran010 ▴ 900

0

Entering edit mode

Thanks for this detailed input - It's particularly useful to learn how multimappers and ambiguous reads are defined. My current setup now loops over FeatureCounts for each feature in my GTF as you mentioned in your last paragraph. But you're right, I first have to work on reducing ambiguities in the annotation file. I'm doing that by deleting features with vastly overlapping coordinates and gene IDs like CDS/mRNA/Exon/Gene, retaining only one.

ADD REPLY • link 8 months ago by bade279 • 0

score 1 · Answer 1 · 2023-07-27

1

Entering edit mode

9 months ago

ATpoint 82k

Multimappers are defined during alignment, so in the bam file. It happens when a read matches equally well to > 1 location. There is nothing you can do about that, and it's a common limitation in short-read sequencing for reads from repetitive elements or for short regions as simply by change short elements (such as miRNAs) have a good chance (by combinatorics) to have some regions somewhere in the genome with similar sequence. Changing aligner or modifying GTF will not solve this, as it's fundamental. As for the rest, check the SAF format of featureCounts in case you want to make custom annotations.

ADD COMMENT • link 9 months ago by ATpoint 82k

0

Entering edit mode

That's really good to know and explains why I can't seem to get a different result. My initial assumptions about what to expect were wrong. Thanks for your advice!

ADD REPLY • link 9 months ago by bade279 • 0

score 1 · Answer 2 · 2023-07-27

1

Entering edit mode

9 months ago

Juke34 8.5k

agat_convert_sp_gff2gtf.pl from AGAT can help you, it will create a file that has GFF and GTF relationship specifications.