Question: Non-overlapping exon set for exon counting (Illumina RNA-seq)
0
gravatar for andreismolnikov00
12 weeks ago by
andreismolnikov000 wrote:

Hi everyone,

I have a set of Illumina short read RNA-seq files which I'm hoping to use to analyse exon usage patterns within a set of specific genes.

The workflow is fastqc -> quality trimming and adapter removal -> alignment with STAR -> sorting and indexing to produce BAM file -> count alignments to annotated exons using featureCounts in subread.

However, I'm running into issues as the GENCODE annotation (GTF) file I used for alignment and feature counting (Release 33 (GRCh38.p13) Comprehensive gene annotation) has:

  • instances where the same genomic region is annotated with different ENSEMBL exon IDs; and
  • instances where two different exon annotations overlap one another for the same gene (presumably varying 5' and 3' splice sites??)

In any event, in order to do feature counting properly I need a non-redundant and non-overlapping set of exon annotations. This could possibly be the set of exons in the inferred metatranscript of the gene, or the set of exons where each exon is the largest possible set of overlapping genomic coordinates that are annotated as exons. However, I'm not quite sure how to go about producing something like this, and at which point to incorporate such an annotation into my analyses.

If I attempt to reinterpret the featureCounts table with new coordinates after it's already been produced, it's possible I'll miss some reads that haven't been counted.

I'd really appreciate some help in terms of how to go about doing this. Thanks!

ADD COMMENTlink modified 12 weeks ago by i.sudbery7.8k • written 12 weeks ago by andreismolnikov000
1
gravatar for benformatics
12 weeks ago by
benformatics1.6k
ETH Zurich
benformatics1.6k wrote:

A standard way to do this is to collapse all exonic regions for each individual gene and use those as your 'transcriptome' reference. In this case you are assigning these reads to the exonic regions of genes and they are your "meta-features".

Obviously there are cases where you will have multiple genes with overlapping exonic regions. The standard for this is to discard reads in those regions - because you cannot be sure from which gene they originate.

This is done automatically and is clearly stated on the featureCounts "webpage".

Here is what is says:

Overlap between reads and features A read is said to overlap a feature if at least one read base is found to overlap the feature. For paired-end data, a fragment (or template) is said to overlap a feature if any of the two reads from that fragment is found to overlap the feature.

> By default, featureCounts does not count reads overlapping with more than one feature (or more than one meta-feature when summarizing at meta-feature level). Users can use the -O option to instruct featureCounts to count such reads (they will be assigned to all their overlapping features or meta-features).

Note that, when counting at the meta-feature level, reads that overlap multiple features of the same meta-feature are always counted exactly once for that meta-feature, provided there is no overlap with any other meta-feature. For example, an exon-spanning read will be counted only once for the corresponding gene even if it overlaps with more than one exon.

One additional suggestion is that you could potentially remove genes that you think are not expressed at all - which would potentially allow you to recover some counts.

ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by benformatics1.6k

Thanks for this. What I'm having trouble with is how to do that first part: "collapse all exonic regions for each individual gene and use those as your 'transcriptome' reference." Does this mean editing the GTF file prior to alignment? And if so, how do I do this?

ADD REPLYlink written 12 weeks ago by andreismolnikov000

I believe as long as your GTF file has the right meta-data information this should be the default for the featureCounts function. The first example from their manual:

featureCounts -T 5 -a annotation.gtf -t exon -g gene id -o counts.txt mapping results SE.sam

Above the -t refers to what column to use as the counted features and then -g is what to group them by. In this case -g is your meta-features (i.e. genes) and the exons within each gene group are the regions where counts can come from.

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by benformatics1.6k
1
gravatar for i.sudbery
12 weeks ago by
i.sudbery7.8k
Sheffield, UK
i.sudbery7.8k wrote:

We use the smallest set of non-overlapping "chunks" that can be combined to generate any transcript of the gene.

you can get this with

$ cgat gtf2gtf --method=genes-to-unique-chunks -I input.gtf.gz > output.gtf

from cgat-apps (installable via bioconda), but the meat of the code is:

def gene_to_blocks(gene):
'''Given a list of all exons in a gene, as a list of CGAT.GTF entries, create a seperate exon
for each unqiue part of a exon, as well as one for introns. '''

    exons = [(e.start, e.end)
             for e in gene if e.feature == "exon"]

    exons = list(set(sum(exons, ())))
    exons.sort()

    entry = GTF.Entry()
    entry = entry.copy(gene[0])
    entry.transcript_id = "merged"
    entry.feature = "exon"
    entry.source = "merged"

    for i in range(len(exons)-1):
        entry.start = exons[i]
        entry.end = exons[i+1]
        entry.attributes["exon_id"] = str(i + 1)
        yield entry
ADD COMMENTlink written 12 weeks ago by i.sudbery7.8k

Thanks very much for this! However, I'm struggling to find documentation on the gtf2gtf tool in cgat, and in particular the flags that the tool accepts. --help gives an error:

$ cgat gtf2gtf --help
Traceback (most recent call last):
   File "/home/oates_binf_1/software/miniconda3/bin/cgat", line 11, in <module>
      sys.exit(main())
   File "/home/oates_binf_1/software/miniconda3/lib/python3.7/site-packages/cgat/cgat.py", line 132, in main
      module.main(sys.argv)
   File "/home/oates_binf_1/software/miniconda3/lib/python3.7/site-packages/cgat/tools/gtf2gtf.py", line 362, in main
     parser = E.ArgumentParser(description=__doc__)
AttributeError: module 'cgatcore.experiment' has no attribute 'ArgumentParser'
ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by andreismolnikov000

I've pinged the maintainers, but from a quick look at the code, it looks like you've got a mismatch between the version of cgatcore and cgat-apps. Try updating cgat-core with $ conda install -c conda-forge -c biconda cgatcore=0.6.5

ADD REPLYlink written 12 weeks ago by i.sudbery7.8k

I have now updated cgat-apps so the correct pinning of cgatcore occurs.

ADD REPLYlink written 12 weeks ago by a.p.cribbs0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 838 users visited in the last hour