Question

How to get human cDNA sequences together with UTR regions?

0

Entering edit mode

13 months ago

Apex92 ▴ 280

Dear all,

I have downloaded the human genome and gtf files from Gencode. Based on these two files I want to generate a fasta file that has cDNA sequences including 5' and 3' UTRs for protein-coding genes only. What is the simplest and fastest way to do this?

Would this script work? gffread -w transcripts.fa -g genome.fa transcripts.gtf

Thank you.

transcripts cDNA genome rna-seq • 938 views

ADD COMMENT • link updated 13 months ago by Istvan Albert 100k • written 13 months ago by Apex92 ▴ 280

0

Entering edit mode

cDNA is the "complementary" DNA to the mRNA transcript. mRNA transcripts include UTRs, so the cDNA sequence should too.

ADD REPLY • link 13 months ago by i.sudbery 19k

score 0 · Answer 1 · 2023-03-16

0

Entering edit mode

13 months ago

Istvan Albert 100k

The command you specify will concatenate the exon sequence for each transcript and as long as the exons contain the UTRs you will get those.

To keep only protein-coding exons, you might need to preprocess the GTF file to keep only those that have gene_type "protein_coding" tag.

In general, though I would recommend downloading the CDNA files from the same source and filtering that with some other method.

ADD COMMENT • link 13 months ago by Istvan Albert 100k

0

Entering edit mode

This probably wants to be transcript_biotype, not gene_biotype as its possible to have non-coding transcripts of coding genes.

ADD REPLY • link 13 months ago by i.sudbery 19k

score 0 · Answer 2 · 2023-03-16

0

Entering edit mode

13 months ago

Apex92 ▴ 280

Thank you for your comments. At the end of the day, I tried this approach so this might be helpful for others as well. And I would also be happy to get your feedback in case I encountered any mistakes.

I downloaded both genome and gtf files from Gencode.
Preprocessed the gtf file and converted it to a bed format in the structure below (keeping only protein-coding transcripts): chr start end transcript_name type strand.
Used bedtools to extract sequences in fasta format from the genome file using bedtools as
bedtools getfasta -fi genome.fa -bed gencode_protein_coding.bed -name > hsa_protein_coding_transcripts.fa

ADD COMMENT • link 13 months ago by Apex92 ▴ 280

0

Entering edit mode

as far as I know the bedtools getfasta can only concatenate exons if you had it in the 12 column format with block information,

if all you had was the 6 column BED as you describe it, then how could it identify the exons that form a transcript?

I believe the method that you describe will generate the unspliced transcript

ADD REPLY • link 13 months ago by Istvan Albert 100k