Extracting CDS Region of FFAR4 Gene Using Samtools from GRCh38 Reference Genome
1
0
Entering edit mode
3 months ago
Rohan ▴ 40

I am working on extracting the CDS sequence for the FFAR4 gene from the GRCh38 reference genome using samtools faidx. Specifically, I used the following command to extract the sequence data:

samtools faidx GRCh38_full_analysis_set_plus_decoy_hla.fa 10:93566721-93587609

However, this range covers the entire gene, including introns, based on the genomic positions available in the NCBI CCDS database: CCDS:622181

My goal is to extract only the coding sequence (CDS) corresponding to the FFAR4 protein (NP_859529.2). Unfortunately, the CCDS database only provides the full gene coordinates and not the precise CDS start and end positions directly.

I have also checked the NCBI Genome Data Viewer: Genome Data Viewer, where the CDS is clearly marked but the positions aren't provided.

Questions:

  1. How can I accurately identify the specific CDS start and end coordinates for FFAR4 from available genomic data?
  2. Is there a recommended approach or tool to extract just the CDS region using the existing CCDS or NCBI resources?
  3. Has anyone encountered a similar challenge, and how did you resolve it?

I appreciate any insights or suggestions for streamlining this process.

Thank you!

samtools ccds grch38 genome cds • 822 views
ADD COMMENT
0
Entering edit mode

Samtools simply isn't designed for this sort of thing. The use of fasta is primarily for purposes of supplying a reference sequence, not for general purpose genome analysis.

You're better off doing a query against GenBank or EMBL databases and directly pulling out the gene data from that, or using the various genome browsers like Ensembl or UCSC. (I see others have already covered these options far better than I could.)

ADD REPLY
1
Entering edit mode
3 months ago
GenoMax 150k

You can get the coordinate ranges using NCBI EnterzDirect (LINK) and then use samtools range.

Basic query looks like this. There are multiple transcripts.

$ esearch -db gene -query "FFAR4 AND human [orgn]" | efetch -format gene_table

Since you are interested in NP_859529.2 here is the information for that particular variant

Reference GRCh38.p14 Primary Assembly NC_000010.11  from: 93566665 to: 93590072
mRNA transcript variant 1 NM_181745.4, 4 exons,  total annotated spliced exon length: 3653
protein isoform GPR120-L NP_859529.2 (CCDS31248.1), 4 coding  exons,  annotated AA length: 377

Exon table for  mRNA  NM_181745.4 and protein NP_859529.2
Genomic Interval Exon           Genomic Interval Coding         Gene Interval Exon              Gene Interval Coding       Exon Length      Coding Length   Intron Length
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
93566665-93567287               93566721-93567287               1-623           57-623          623             567        8803
93576091-93576219               93576091-93576219               9427-9555               9427-9555               129        129              2939
93579159-93579206               93579159-93579206               12495-12542             12495-12542             48         48               8013
93587220-93590072               93587220-93587609               20556-23408             20556-20945             2853       390

If you just need CDS sequence then you could use NCBI datasets utility.

FFAR4 gene has a gene ID of https://www.ncbi.nlm.nih.gov/gene/338557

$ datasets download gene gene-id 338557 --include cds

Unzip the archive that is produced. You will see a cds.fna file in the folder.

ADD COMMENT
0
Entering edit mode

Hi! Thank you for your response!

I successfully extracted the gene information for my gene of interest NP_001182684.1:

mRNA transcript variant 2 NM_001195755.2, 3 exons,  total annotated spliced exon length: 3605
protein isoform GPR120-S NP_001182684.1 (CCDS55720.1), 3 coding  exons,  annotated AA length: 361

Exon table for  mRNA  NM_001195755.2 and protein NP_001182684.1
Genomic Interval Exon           Genomic Interval Coding         Gene Interval Exon              Gene Interval Coding            Exon Length     Coding Length   Intron Length
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
93566665-93567287               93566721-93567287               1-623           57-623          623             567             8803
93576091-93576219               93576091-93576219               9427-9555               9427-9555               129             129             11000
93587220-93590072               93587220-93587609               20556-23408             20556-20945             2853            390

But I have a few concerns. In the CCDS database CCDS55720.1, the CCDS sequence data section shows the exact sequence I want.

Nucleotide Sequence (1086 nt):
ATGTCCCCTGAATGCGCGCGGGCAGCGGGCGACGCGCCCTTGCGCAGCCTGGAGCAAGCCAACCGCACCC
GCTTTCCCTTCTTCTCCGACGTCAAGGGCGACCACCGGCTGGTGCTGGCCGCGGTGGAGACAACCGTGCT
GGTGCTCATCTTTGCAGTGTCGCTGCTGGGCAACGTGTGCGCCCTGGTGCTGGTGGCGCGCCGACGACGC
CGCGGCGCGACTGCCTGCCTGGTACTCAACCTCTTCTGCGCGGACCTGCTCTTCATCAGCGCTATCCCTC
TGGTGCTGGCCGTGCGCTGGACTGAGGCCTGGCTGCTGGGCCCCGTTGCCTGCCACCTGCTCTTCTACGT
GATGACCCTGAGCGGCAGCGTCACCATCCTCACGCTGGCCGCGGTCAGCCTGGAGCGCATGGTGTGCATC
GTGCACCTGCAGCGCGGCGTGCGGGGTCCTGGGCGGCGGGCGCGGGCAGTGCTGCTGGCGCTCATCTGGG
GCTATTCGGCGGTCGCCGCTCTGCCTCTCTGCGTCTTCTTCCGAGTCGTCCCGCAACGGCTCCCCGGCGC
CGACCAGGAAATTTCGATTTGCACACTGATTTGGCCCACCATTCCTGGAGAGATCTCGTGGGATGTCTCT
TTTGTTACTTTGAACTTCTTGGTGCCAGGACTGGTCATTGTGATCAGTTACTCCAAAATTTTACAGATCA
CAAAGGCATCAAGGAAGAGGCTCACGGTAAGCCTGGCCTACTCGGAGAGCCACCAGATCCGCGTGTCCCA
GCAGGACTTCCGGCTCTTCCGCACCCTCTTCCTCCTCATGGTCTCCTTCTTCATCATGTGGAGCCCCATC
ATCATCACCATCCTCCTCATCCTGATCCAGAACTTCAAGCAAGACCTGGTCATCTGGCCGTCCCTCTTCT
TCTGGGTGGTGGCCTTCACATTTGCTAATTCAGCCCTAAACCCCATCCTCTACAACATGACACTGTGCAG
GAATGAGTGGAAGAAAATTTTTTGCTGCTTCTGGTTCCCAGAAAAGGGAGCCATTTTAACAGACACATCT
GTCAAAAGAAATGACTTGTCGATTATTTCTGGCTAA
Translation (361 aa):
MSPECARAAGDAPLRSLEQANRTRFPFFSDVKGDHRLVLAAVETTVLVLIFAVSLLGNVCALVLVARRRR
RGATACLVLNLFCADLLFISAIPLVLAVRWTEAWLLGPVACHLLFYVMTLSGSVTILTLAAVSLERMVCI
VHLQRGVRGPGRRARAVLLALIWGYSAVAALPLCVFFRVVPQRLPGADQEISICTLIWPTIPGEISWDVS
FVTLNFLVPGLVIVISYSKILQITKASRKRLTVSLAYSESHQIRVSQQDFRLFRTLFLLMVSFFIMWSPI
IITILLILIQNFKQDLVIWPSLFFWVVAFTFANSALNPILYNMTLCRNEWKKIFCCFWFPEKGAILTDTS
VKRNDLSIISG

However, using any of the provided coordinates results in a truncated nucleotide FASTA sequence.

When I extracted the sequence from chr10:93566721-93587609 as given:

Chromosome  Start   Stop
10  93566721    93567287    
10  93576091    93576219    
10  93587220    93587609

I obtained the correct sequence, but there is a long intervening sequence in the middle. How can I extract just the coding sequence (CDS) as shown in the CCDS sequence data section directly?

ADD REPLY
0
Entering edit mode

Use the datasets solution I had posted above. Since you only need the CDS sequence, you will find it in the cds.fna file in data folder. Sequence retrieved there seems to match the CCDS sequence.

but there is a long intervening sequence in the middle

Perhaps that is the alternating exon noted in blue in CCDS record. You could create another interval to exclude it when you do samtools with region specification.

ADD REPLY
0
Entering edit mode

The datasets solution isn't quite suitable for my needs, as I'm focused on extracting the CDS sequence directly from a vcf file using bcftools. Thank you for your help! Please let me know if you come across a solution to this issue.

ADD REPLY

Login before adding your answer.

Traffic: 1219 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6