Dear Biostar Community
I'm currently trying to generate a protein FASTA containing all known variants from HeLa (from Cosmic CellLinesProject) for variant detection in proteomics measurements.
For this, I've downloaded the variants file (VCF) and the human genome FASTA and GFF or GTF from NCBI. My plan was to call the CDS of all genes from in the genome using the GTF and then apply the VCF to generate the corresponding of the variants from the VCF file. The newly generated nucleotide FASTA with entries for each wildtype gene and its variants would then be translated to a protein FASTA, which could ultimately be used for a proteomics experiment.
Unfortunatley, I am struggling to generate a nucleotide FASTA containing the Variants and the Wildtype versions of all CDS. So far I've tried to use GATK FastaAlternateReferenceMaker or BSgenome injectSNPs. But both solutions did not work as expected, since both tools only rely on the VCF)
Can someone point me to the correct workflow to successfully create such a protein FASTA?
Thank you very much! Best, chscho
I'm not sure I understand . Explain what should be your output : a fasta ? a list of mRNA ? etc...
Hi Pierre
The VCF file from Cosmic refers to the full human genome (GRCh38) and hence has one FASTA entry per chromosome. The extraction of the CDS (from GFF) and the introduction of the variants (from VCF) has to somehow happen at the same time (to properly map the VCF and extract the CDS) to be able to generate a FASTA with an entry for each CDS/protein, which then ultimately can be translated to a protein FASTA containing separate entries for each protein and also their variants. I hope I was able to clarify the anticipated workflow.
Best, chscho