How to get human cDNA sequences together with UTR regions?
2
0
Entering edit mode
11 days ago
Apex92 ▴ 260

Dear all,

I have downloaded the human genome and gtf files from Gencode. Based on these two files I want to generate a fasta file that has cDNA sequences including 5' and 3' UTRs for protein-coding genes only. What is the simplest and fastest way to do this?

Would this script work? gffread -w transcripts.fa -g genome.fa transcripts.gtf

Thank you.

transcripts cDNA genome rna-seq • 360 views
0
Entering edit mode

cDNA is the "complementary" DNA to the mRNA transcript. mRNA transcripts include UTRs, so the cDNA sequence should too.

0
Entering edit mode
11 days ago

The command you specify will concatenate the exon sequence for each transcript and as long as the exons contain the UTRs you will get those.

To keep only protein-coding exons, you might need to preprocess the GTF file to keep only those that have gene_type "protein_coding" tag.

In general, though I would recommend downloading the CDNA files from the same source and filtering that with some other method.

0
Entering edit mode

This probably wants to be transcript_biotype, not gene_biotype as its possible to have non-coding transcripts of coding genes.

0
Entering edit mode
11 days ago
Apex92 ▴ 260

Thank you for your comments. At the end of the day, I tried this approach so this might be helpful for others as well. And I would also be happy to get your feedback in case I encountered any mistakes.

2. Preprocessed the gtf file and converted it to a bed format in the structure below (keeping only protein-coding transcripts): chr start end transcript_name type strand.

3. Used bedtools to extract sequences in fasta format from the genome file using bedtools as
bedtools getfasta -fi genome.fa -bed gencode_protein_coding.bed -name > hsa_protein_coding_transcripts.fa

0
Entering edit mode

as far as I know the bedtools getfasta can only concatenate exons if you had it in the 12 column format with block information,

if all you had was the 6 column BED as you describe it, then how could it identify the exons that form a transcript?

I believe the method that you describe will generate the unspliced transcript