I am working on extracting the CDS sequence for the FFAR4 gene from the GRCh38 reference genome using samtools faidx. Specifically, I used the following command to extract the sequence data:
samtools faidx GRCh38_full_analysis_set_plus_decoy_hla.fa 10:93566721-93587609
However, this range covers the entire gene, including introns, based on the genomic positions available in the NCBI CCDS database: CCDS:622181
My goal is to extract only the coding sequence (CDS) corresponding to the FFAR4 protein (NP_859529.2). Unfortunately, the CCDS database only provides the full gene coordinates and not the precise CDS start and end positions directly.
I have also checked the NCBI Genome Data Viewer: Genome Data Viewer, where the CDS is clearly marked but the positions aren't provided.
Questions:
- How can I accurately identify the specific CDS start and end coordinates for FFAR4 from available genomic data?
- Is there a recommended approach or tool to extract just the CDS region using the existing CCDS or NCBI resources?
- Has anyone encountered a similar challenge, and how did you resolve it?
I appreciate any insights or suggestions for streamlining this process.
Thank you!
Samtools simply isn't designed for this sort of thing. The use of fasta is primarily for purposes of supplying a reference sequence, not for general purpose genome analysis.
You're better off doing a query against GenBank or EMBL databases and directly pulling out the gene data from that, or using the various genome browsers like Ensembl or UCSC. (I see others have already covered these options far better than I could.)