Question: Amino acid sequence from CDS information of gff file
0
gravatar for doubleA
17 months ago by
doubleA0
doubleA0 wrote:

Hi all, I have a somewhat basic question.

The inputs I use for analysis are the reference genome (hg38) and my sample vcf file.

I extracted the CDS region of some gene from hg38 gff file.

For example,

128937678-128937818
128936538-128936649
128935037-128935044
128937678-128937818
128936538-128936649
128935591-128935700
128937678-128937818

After that, I extracted the consensus sequence of the cds region from my sample vcf file.

Ultimately, I want to get the amino acid sequence.

I wonder if the nucleotide sequences of the CDS region of a gene extracted above can be combined and converted into amino acid sequences.

For example,

128937678-128937818 -> GAAGTG
128936538-128936649 -> GAGGCATCTCTGA
128935037-128935044 -> GAGCGAG
128937678-128937818 -> ATCTTCGG
128936538-128936649 -> CCTTCGATG
128935591-128935700 -> TTGACAACATCT
128937678-128937818 -> AGCATTTCCTC
Combination -> GAAGTGGAGGCATCTCTGAGAGCGAGATCTTCGGCCTTCGATG TTGACAACATCTAGCATTTCCTC -> Convert to amino acid sequence

Can I get the amino acid sequence like this?

amino acid gff cds • 922 views
ADD COMMENTlink written 17 months ago by doubleA0
3

If the GFF format is correct, try gffread with -y: (-y write a protein fasta file with the translation of CDS for each record)

$ gffread -y proteins.fa -g Homo_sapiens.GRCh38.dna.chromosome.1.fa Homo_sapiens.GRCh38.96.chromosome.1.gff3
$ head proteins.fa
>transcript:ENST00000641515 gene=OR4F5
MKKVTAEAISWNESTSETNNSMVTEFIFLGLSDSQELQTFLFMLFFVFYGGIVFGNLLIVITVVSDSHLH
SPMYFLLANLSLIDLSLSSVTAPKMITDFFSQRKVISFKGCLVQIFLLHFFGGSEMVILIAMGFDRYIAI
CKPLHYTTIMCGNACVGIMAVTWGIGFLHSVSQLAFAVHLLFCGPNEVDSFYCDLPRVIKLACTDTYRLD
IMVIANSGVLTVCSFVLLIISYTIILMTIQHRPLDKSSKALSTLTAHITVVLLFFGPCVFIYAWPFPIKS
LDKFLAVFYSVITPLLNPIIYTLRNKDMKTAIRQLRKWDAHSSVKF.
>transcript:ENST00000335137 gene=OR4F5
MVTEFIFLGLSDSQELQTFLFMLFFVFYGGIVFGNLLIVITVVSDSHLHSPMYFLLANLSLIDLSLSSVT
APKMITDFFSQRKVISFKGCLVQIFLLHFFGGSEMVILIAMGFDRYIAICKPLHYTTIMCGNACVGIMAV
TWGIGFLHSVSQLAFAVHLLFCGPNEVDSFYCDLPRVIKLACTDTYRLDIMVIANSGVLTVCSFVLLIIS
ADD REPLYlink modified 17 months ago • written 17 months ago by AK1.9k

do you have to do this for only a few CDS or for plenty of them?

In case only few: copy-paste the DNA seq in a translation tool (eg EMBOSS transeq) .

your example, however, does not really look like a valid CDS (it does not start with an ATG for instance)

ADD REPLYlink written 17 months ago by lieven.sterck8.7k

This is theoretically not too difficult to do, but I'm guessing since these are discontinuous ranges, they've had exons removed?

How do you define where one the first real CDS starts ends, and the next one begins, if all of your data looks like that?

ADD REPLYlink written 17 months ago by Joe18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1151 users visited in the last hour