Amino acid sequence from CDS information of gff file
0
0
Entering edit mode
2.4 years ago
doubleA • 0

Hi all, I have a somewhat basic question.

The inputs I use for analysis are the reference genome (hg38) and my sample vcf file.

I extracted the CDS region of some gene from hg38 gff file.

For example,

128937678-128937818
128936538-128936649
128935037-128935044
128937678-128937818
128936538-128936649
128935591-128935700
128937678-128937818

After that, I extracted the consensus sequence of the cds region from my sample vcf file.

Ultimately, I want to get the amino acid sequence.

I wonder if the nucleotide sequences of the CDS region of a gene extracted above can be combined and converted into amino acid sequences.

For example,

128937678-128937818 -> GAAGTG
128936538-128936649 -> GAGGCATCTCTGA
128935037-128935044 -> GAGCGAG
128937678-128937818 -> ATCTTCGG
128936538-128936649 -> CCTTCGATG
128935591-128935700 -> TTGACAACATCT
128937678-128937818 -> AGCATTTCCTC
Combination -> GAAGTGGAGGCATCTCTGAGAGCGAGATCTTCGGCCTTCGATG TTGACAACATCTAGCATTTCCTC -> Convert to amino acid sequence

Can I get the amino acid sequence like this?

Amino acid CDS gff • 1.8k views
3
Entering edit mode

If the GFF format is correct, try gffread with -y: (-y write a protein fasta file with the translation of CDS for each record)

$gffread -y proteins.fa -g Homo_sapiens.GRCh38.dna.chromosome.1.fa Homo_sapiens.GRCh38.96.chromosome.1.gff3$ head proteins.fa
>transcript:ENST00000641515 gene=OR4F5
MKKVTAEAISWNESTSETNNSMVTEFIFLGLSDSQELQTFLFMLFFVFYGGIVFGNLLIVITVVSDSHLH
SPMYFLLANLSLIDLSLSSVTAPKMITDFFSQRKVISFKGCLVQIFLLHFFGGSEMVILIAMGFDRYIAI
CKPLHYTTIMCGNACVGIMAVTWGIGFLHSVSQLAFAVHLLFCGPNEVDSFYCDLPRVIKLACTDTYRLD
IMVIANSGVLTVCSFVLLIISYTIILMTIQHRPLDKSSKALSTLTAHITVVLLFFGPCVFIYAWPFPIKS
LDKFLAVFYSVITPLLNPIIYTLRNKDMKTAIRQLRKWDAHSSVKF.
>transcript:ENST00000335137 gene=OR4F5
MVTEFIFLGLSDSQELQTFLFMLFFVFYGGIVFGNLLIVITVVSDSHLHSPMYFLLANLSLIDLSLSSVT
APKMITDFFSQRKVISFKGCLVQIFLLHFFGGSEMVILIAMGFDRYIAICKPLHYTTIMCGNACVGIMAV
TWGIGFLHSVSQLAFAVHLLFCGPNEVDSFYCDLPRVIKLACTDTYRLDIMVIANSGVLTVCSFVLLIIS

0
Entering edit mode

do you have to do this for only a few CDS or for plenty of them?

In case only few: copy-paste the DNA seq in a translation tool (eg EMBOSS transeq) .

your example, however, does not really look like a valid CDS (it does not start with an ATG for instance)

0
Entering edit mode

This is theoretically not too difficult to do, but I'm guessing since these are discontinuous ranges, they've had exons removed?

How do you define where one the first real CDS starts ends, and the next one begins, if all of your data looks like that?