Question

Non-overlapping exon set for exon counting (Illumina RNA-seq)

0

Entering edit mode

4.1 years ago

andreismolnikov00 ▴ 30

Hi everyone,

I have a set of Illumina short read RNA-seq files which I'm hoping to use to analyse exon usage patterns within a set of specific genes.

The workflow is fastqc -> quality trimming and adapter removal -> alignment with STAR -> sorting and indexing to produce BAM file -> count alignments to annotated exons using featureCounts in subread.

However, I'm running into issues as the GENCODE annotation (GTF) file I used for alignment and feature counting (Release 33 (GRCh38.p13) Comprehensive gene annotation) has:

instances where the same genomic region is annotated with different ENSEMBL exon IDs; and
instances where two different exon annotations overlap one another for the same gene (presumably varying 5' and 3' splice sites??)

In any event, in order to do feature counting properly I need a non-redundant and non-overlapping set of exon annotations. This could possibly be the set of exons in the inferred metatranscript of the gene, or the set of exons where each exon is the largest possible set of overlapping genomic coordinates that are annotated as exons. However, I'm not quite sure how to go about producing something like this, and at which point to incorporate such an annotation into my analyses.

If I attempt to reinterpret the featureCounts table with new coordinates after it's already been produced, it's possible I'll miss some reads that haven't been counted.

I'd really appreciate some help in terms of how to go about doing this. Thanks!

RNA-Seq exon featurecounts subread gtf • 1.7k views

ADD COMMENT • link updated 4.1 years ago by i.sudbery 19k • written 4.1 years ago by andreismolnikov00 ▴ 30

score 1 · Answer 1 · 2020-03-10

A standard way to do this is to collapse all exonic regions for each individual gene and use those as your 'transcriptome' reference. In this case you are assigning these reads to the exonic regions of genes and they are your "meta-features".

Obviously there are cases where you will have multiple genes with overlapping exonic regions. The standard for this is to discard reads in those regions - because you cannot be sure from which gene they originate.

This is done automatically and is clearly stated on the featureCounts "webpage".

Here is what is says:

Overlap between reads and features A read is said to overlap a feature if at least one read base is found to overlap the feature. For paired-end data, a fragment (or template) is said to overlap a feature if any of the two reads from that fragment is found to overlap the feature.

> By default, featureCounts does not count reads overlapping with more than one feature (or more than one meta-feature when summarizing at meta-feature level). Users can use the -O option to instruct featureCounts to count such reads (they will be assigned to all their overlapping features or meta-features).

Note that, when counting at the meta-feature level, reads that overlap multiple features of the same meta-feature are always counted exactly once for that meta-feature, provided there is no overlap with any other meta-feature. For example, an exon-spanning read will be counted only once for the corresponding gene even if it overlaps with more than one exon.

One additional suggestion is that you could potentially remove genes that you think are not expressed at all - which would potentially allow you to recover some counts.

score 1 · Answer 2 · 2020-03-10

1

Entering edit mode

4.1 years ago

i.sudbery 19k

We use the smallest set of non-overlapping "chunks" that can be combined to generate any transcript of the gene.

you can get this with

$ cgat gtf2gtf --method=genes-to-unique-chunks -I input.gtf.gz > output.gtf

from cgat-apps (installable via bioconda), but the meat of the code is:

def gene_to_blocks(gene):
'''Given a list of all exons in a gene, as a list of CGAT.GTF entries, create a seperate exon
for each unqiue part of a exon, as well as one for introns. '''

    exons = [(e.start, e.end)
             for e in gene if e.feature == "exon"]

    exons = list(set(sum(exons, ())))
    exons.sort()

    entry = GTF.Entry()
    entry = entry.copy(gene[0])
    entry.transcript_id = "merged"
    entry.feature = "exon"
    entry.source = "merged"

    for i in range(len(exons)-1):
        entry.start = exons[i]
        entry.end = exons[i+1]
        entry.attributes["exon_id"] = str(i + 1)
        yield entry

ADD COMMENT • link 4.1 years ago by i.sudbery 19k

0

Entering edit mode

Thanks very much for this! However, I'm struggling to find documentation on the gtf2gtf tool in cgat, and in particular the flags that the tool accepts. --help gives an error:

$ cgat gtf2gtf --help
Traceback (most recent call last):
   File "/home/oates_binf_1/software/miniconda3/bin/cgat", line 11, in <module>
      sys.exit(main())
   File "/home/oates_binf_1/software/miniconda3/lib/python3.7/site-packages/cgat/cgat.py", line 132, in main
      module.main(sys.argv)
   File "/home/oates_binf_1/software/miniconda3/lib/python3.7/site-packages/cgat/tools/gtf2gtf.py", line 362, in main
     parser = E.ArgumentParser(description=__doc__)
AttributeError: module 'cgatcore.experiment' has no attribute 'ArgumentParser'

ADD REPLY • link 4.1 years ago by andreismolnikov00 ▴ 30

0

Entering edit mode

I've pinged the maintainers, but from a quick look at the code, it looks like you've got a mismatch between the version of cgatcore and cgat-apps. Try updating cgat-core with $ conda install -c conda-forge -c biconda cgatcore=0.6.5