Question

Consensus sequences producing stop codons when translated

0

Entering edit mode

3 months ago

hemr3 ▴ 10

Hi all!

I have created a consensus sequence from a human reference and a Neanderthal genome (to fill gaps - I know this will drastically reduce the power of my research but this is fine for now), to observe potential changes between Neanderthal and human genes. I did this using BCFTOOLS.

I made sure the human alignment and Neanderthal alignment matched.

bcftools consensus -f human_reffasta_chr[].fa.gz -o consen_chr[].fa chromosome_neanderthal.vcf.gz

I then cut the required exons using SAMTOOLS.

The coordinates of the exons were drawn from https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi?REQUEST=GV&DATA=92322&BUILDS=ALLBUILDS for each gene.

samtools faidx consen_chr[].fa chr:pos-pos > gene_consen_exon_no.fa

                #cat-ting all the exons together for translation
cat ~/Research/*.fa > gene_total_exons.fa

However, when using the ENSEMBL translate tool online, the translations produced had stop codons within the amino acid sequence and/or did not have a methionine present in the beginning of the sequence; eg.

>124936775-124936906_1
VWLNSPIWTHLEKSPRAITIVETASITQSRG*SRTIGTAF*ETQ

Given the similarity between Neanderthals and humans in general, and the additional similarities created by generating a consensus sequence, I didn't expect this at all. This leads me to think that something has gone wrong in one of these steps. I triple-checked the exon coordinates, and they were correct. I ran BLAST, and the sequences had a high level of identity, but not to the degree that would be expected.

I'm not sure what I did wrong, and I would be grateful for some pointers!

bcftools ENSEMBL gene samtools fasta • 541 views

ADD COMMENT • link updated 12 weeks ago by Michael 54k • written 3 months ago by hemr3 ▴ 10

score 0 · Answer 1 · 2024-02-04

0

Entering edit mode

12 weeks ago

Michael 54k

However, when using the ENSEMBL translate tool online, the translations produced had stop codons within the amino acid sequence and/or did not have a methionine present in the beginning of the sequence; eg.

This is to be expected. The exons are not equal to the coding sequence. You need to take into account the UTR's and extract the CDS sequence for any coding change.

I have created a consensus sequence from a human reference and a Neanderthal genome. ... I did this using BCFTOOLS.

Unlike the name, bcftools consensus (they should change the name of that function) doesn't compute a consensus sequence. It is an Alt-genome really, with all variant positions replaced by the ALT allele. I am not sure if I would take that approach. Possibly, using VEP or SnpEff might be better. You would need to convert the Neanderthal coordinates into hg19 coordinates though.

ADD COMMENT • link 12 weeks ago by Michael 54k

0

Entering edit mode

Thank you so much! This makes a lot more sense. But if I'm going on the CCDS website (which says it only has the CDS regions) and am getting these results, how do I strictly get the protein coding regions only?

ADD REPLY • link 12 weeks ago by hemr3 ▴ 10

0

Entering edit mode

Have a look at this part of the web site. The database indeed contains only CDS. The whole Nucleotide sequence is the coding sequence, and when you translate it, you get the exact AA sequence starting with MANA... below. The black and blue nucleotide sequence parts both indicate exons, the color is just for marking alternating exons. The same is not the case if you extract the exon coordinates from another annotation e.g. in GFF format without removing UTR's or extract annotated CDS.

Does this clarify things?

enter image description here

ADD REPLY • link 12 weeks ago by Michael 54k

0

Entering edit mode

Not entirely, sorry - and apologies if I'm not really understanding your explanation! If you look on the same page you have coordinates above with the title 'Chromosomal Locations for CCDS [etc]' (see below image). Would these be the CDS sequence? The exon coordinates are often longer than the nucleotide sequence in the image you showed. I used these coordinates from the table for each target gene and got the issues with the results above.

enter image description here

AKA the 45278966-45279184 exon is 218 places/coordinates long, but the corresponding nucleotide sequence in the CDS sequence data is only 54 nucleotides long. How does this compare?

enter image description here

How would you get the exact coordinates for the CDS sequence so I could extract it from a FASTA file?

Thank you so much for all your help!

ADD REPLY • link 12 weeks ago by hemr3 ▴ 10

1

Entering edit mode

You can download all the CDS sequences from the genome data directory directly. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/ for the human RefSeq genome.

ADD REPLY • link 12 weeks ago by Michael 54k