Mapping sequencing reads to 5-UTR, 3-UTR and CDS using Ensembl GTF files
1
0
Entering edit mode
3.4 years ago
Beginner ▴ 80

Hi,

I want to determine the percentage of reads that map to the 5´UTR, 3´UTR and CDS of my sequencing reads. I have fastq files of my sequencing reads and I also have bam files of those reads mapping to the genome.

I wanted to use bedtools coverage to look for the overlap of my reads with the 5´UTR, 3´UTR and CDS, but any program that does this job would also work for me.

Next, I downloaded 5´UTR, 3´UTR and CDS fasta files from Ensembl. Am I right that fasta does not work with bedtools coverage? (EDIT: Fasta files do not work).

The GTF files from Ensembl contain information about sequence positions, including CDS and UTR. However 3´UTR or 5´UTR seem to not be differentiated. Also I could not find a solution to get gtf files which only contain one feature of interest such as the 5´UTR, which could then be used for bedtools or another program.

Thank you for your help!

RNA-Seq alignment sequencing genome • 2.7k views
ADD COMMENT
0
Entering edit mode

Fasta files have no positional information so yes, you cannot use it. GTF files have the information you want encoded, available from Ensembl, NCBI or GENCODE. From there you could use coverage.

ADD REPLY
0
Entering edit mode

Thank you. I would prefer the files from Ensembl, however in their gtf file the positions to specific sequences appear to be grouped in column 3. Their I can see for example exon, CDS or UTR. Do I need to filter them by CDS or UTR? How can I differentiate between 5´UTR and 3´UTR? Is there an easier way to get gtf files from Ensembl which only include CDS, 5´UTR or 3´UTR?

ADD REPLY
0
Entering edit mode

I can confirm that both the human and mouse gtf files from Ensembl contain the information about the 5'UTR and 3'UTR. The code mentioned by i.sudbery is working.

However I am getting some odd results. As expected approximately 90% of my reads map to the CDS and approximately 5-10% map to the 5'UTR and 3'UTR. However I have more then 100% mapping reads if I compare it to my total reads as my input to featureCounts, for example one file:

total reads: 16,560,149 (100%)

Assigned to CDS: 14,805,695 (89.4%)

Assigned to 5'UTR: 1,638,912 (9.9%)

Assigned to 3'UTR: 1,779,508 (10.7%)

Adding assigned reads together: 18,224,115 (110%)

My bam file was filtered for multimappers, also the FeatureCounts output sais Unassigned_MultiMapping 0. Does someone know why I am getting more Assigned Reads than my Input Reads? Is there a way to get rid of this?

ADD REPLY
3
Entering edit mode

However I have more then 100% mapping reads if I compare it to my total reads as my input to featureCounts, for example one file..

Most genes in mammalian genomes have more than one transcript. Regions that are CDS in one transcript may be UTR in other transcripts, and vice versa. If you want your %ages to add up to 100%, you must decide what to do about sequence that is annotated as both UTR and CDS.

Possibility 1 is to decide that anything that is annotated as CDS, is CDS and not anything else, even if it also holds an annotation as something else as well. This might make sense if you are quality controlling exome sequencing and what to make sure you are mostly mapping to coding regions:

To do this, you would take your UTR annotations and subtract any CDS sequences from them:

$ zcat Homo_sapiens.GRCh38.93.gtf.gz | awk '$3=="three_prime_utr"' > Homo_spaiens.GRCh38.93.3utr.gtf.gz
$ zcat Homo_sapiens.GRCh38.93.gtf.gz | awk '$3=="five_prime_utr"' > Homo_spaiens.GRCh38.93.5utr.gtf.gz
$ zcat Homo_sapiens.GRCh38.93.gtf.gz | awk '$3=="CDS"' > Homo_spaiens.GRCh38.93.CDS.gtf.gz

$ bedtools subtract -a Homo_spaiens.GRCh38.93.3utr.gtf.gz  -b Homo_spaiens.GRCh38.93.5utr.gtf.gz \
   | bedtools subtract -a -  -b Homo_spaiens.GRCh38.93.CDS.gtf.gz > Homo_spaiens.GRCh38.93.3utr.no_overlaps.gtf.gz

And similarly for 5'UTRs. Note you'll also have to decide if sequence that is annotated as the 5' UTR of one transcript, but the 3' UTR of another gets counted as 5' UTR or 3' UTR.

Possibility 2 is to decide that you just want to discard all sequence that is annotated as more than one thing. In which case repeat the above for each region type, subtracting the other two.

Note that as I've written these above, the output is a bed3 file. This is suitable for counting the overall read count, using bedtools coverage, but a) won't give you per gene counts, only overall b) Won't work with feaetureCounts, as feature counts requires a GTF file.

There are ways to propogate the gene/transcript_id through this, if its needed, but its more complex.

ADD REPLY
0
Entering edit mode

How would you feel about posting this as a new question? It contains things that future users may find useful, but at the moment, they are unlikely to find it here?

ADD REPLY
0
Entering edit mode

The code mentioned by i.sudbery is working.

Then please accept that answer (green check mark) to provide closure to this thread.

ADD REPLY
4
Entering edit mode
3.4 years ago

Ensembl GTFs (at least of human) should have separate entries for five_prime_utr and three_prime_utr.

If you could use the featureCounts software from the subread package, then you can specifty the feature type you want to map to:

$ featureCounts -a Homo_sapiens.GRCh38.93.gtf -o ouput_file_name.3utr.tsv -t three_prime_utr my_bamfile.bam
$ featureCounts -a Homo_sapiens.GRCh38.93.gtf -o ouput_file_name.5utr.tsv -t five_prime_utr my_bamfile.bam
$ featureCounts -a Homo_sapiens.GRCh38.93.gtf -o ouput_file_name.CDS.tsv -t CDS my_bamfile.bam

if you wanted to use bedtools coverage, then you will want to subset the GTF file first. E.g:

$ zcat Homo_sapiens.GRCh38.93.gtf.gz | awk '$3=="three_prime_utr"' > Homo_spaiens.GRCh38.93.3utr.gtf.gz
ADD COMMENT

Login before adding your answer.

Traffic: 2689 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6