GATK: vcf to fasta
1
0
Entering edit mode
9.9 years ago
natasha.sernova ★ 4.0k

Dear colleagues, I have a few questions:

looking at the GATK-script below in the end, I want to select what I need and I am afraid I don't have enough data.

As a ref.fasta, I have just a fasta-file for the whole chromosome.</pre>

But I would like to finally have the corresponding fasta-files for each known gene

having the set of known vcf-files for this particular chromosome. Will it be possible and how should I do it?

What does this line mean: -T FastaAlternateReferenceMaker?

I don't know any input intervals for my genes from this chromosome. I hope it will be possible to isolate the information from the gene vcf-file. Is it correct? If not, where to find them?

And how to run the whole process as a cycle selecting vcf-files in the correct order? By sorting, using cat before?

Thank you very much! Natasha

java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T FastaAlternateReferenceMaker \
   -o output.fasta \
   -L input.intervals \
   --variant input.vcf \
   [--snpmask mask.vcf]
gene • 9.2k views
ADD COMMENT
1
Entering edit mode
9.9 years ago

FastaAlternateReferenceMaker is the tool that will create a reference genome incorporating the changes specified by the given --variant input.vcf. This is whole chromosome or whole genome, rather than individual files by gene.

You should use FastaAlternateReferenceMaker first, then take the resulting fasta file and cut out the parts you are interested in. Another tool will be needed.

RSEM is a tool made to quantify RNA-Seq gene expression, and it needs the kind of gene fasta you're looking for, so comes with a pre-processing tool that takes a reference genome and a gene annotation file to create the gene fasta. Use this tool on the alternate reference and you'll achieve your goal.

You will need a GTF annotation to tell the tool where the genes lie within the reference chromosomes.

ADD COMMENT
0
Entering edit mode

Dear Karl,

Thank you for your recommendations!

But unfortunately I've not quite understood how to find a list of known exons - I need their coordinates in chromosomes to run the GATK-part.

The genome is annotated one, the exons there are known. Then how to make the alignment in fasta format I am looking for.

I didn't quit understand how the program you recommend would help me with these tasks.

I have chromosomal sequences, the sets of vcf-files for each chromosome and reference genome.

Is it enough or I need something else? You wrote about a GTF annotation. Where can I find it, or I can make it by myself?

What will I need for it and is there any instruction somewhere?

Thank you very much!

N.

ADD REPLY
0
Entering edit mode

Well, there are a few different problems here. To make a genomic-chromosome with variants from the VCF you need the reference, the VCF, and the GATK tool. Genes have nothing to do with this step. You will end with a collection of mutated chromosomes, possibly with ambiguity codes at heterozygous sites (a non ATCG letter to indicate that one site contains both A and T).

Because you asked for gene fastas, this is a separate problem that can be handled next. With the mutated chromosomes you need to cut out the gene's exons and compile them as coding sequence fasta. THis will require GTF, a file that specifies where on each chromosome is an exon and for which gene. This annotation is very important to create gene fasta. It will determine how many you end with. How many genes do you want to have? are non-protein-coding genes relevant? are microRNA? What about alternate splice forms or alternate transcripts of the same gene? You can begin getting this information from the UCSC Table Browser, they have a database of RefSeq that can make the GTF.

With the GTF you can begin cutting the chromosome; which may be a lot of programming, so here I recommend the tool RSEM, which is intended for gene expression quantification, but comes bundled with side-tools that do what you need (cut the chromosome with a GTF into gene-fasta).

Probably, this is way too much work and you can re-think your goals.

ADD REPLY
0
Entering edit mode

Ok, perfect.

Let's discuss only vcf-fasta conversion with GATK.

I need the interval-data for each vcf-file.

But is there unambiguous coordinate range from the chromosome

for each particular vcf-file, othewise how I can select any interval for the GATC-script?

If the interval for a vcf-file exists somewhere(?), it's OK, but if not, where and how can I find it? Or again I've mixed everything together?

Thank you very much!

Natsha

ADD REPLY

Login before adding your answer.

Traffic: 2749 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6