Question

Extract every CDS sequences from a VCF file

1

Entering edit mode

4.6 years ago

maxime.policarpo ▴ 200

Hi everyone,

I have an alignment of genome sequencing reads mapped to a reference genome. I created a VCF file from this alignment and filtered it depending on the PHRED score (>20). Now, I want to extract every CDS sequences that are annotated on the reference genome but with the variants present on my individual mapped to this genome.

I have a gff3 file with annotations, the fasta file of the reference genome and a VCF file of my individual variants.

I have seen similar questions on which people were using bedtools getfasta to extract sequences but it only returns sequences exons by exons and it does not concatenate them in a full CDS sequence (This tool seems nice to extract transcripts sequences but not CDS).

Does anyone have an idea how to do it ? Should i first create a whole genome consensus sequence from the alignment and then use a tool that extract CDS sequences using this consensus sequence as reference ? (And which tool can do it properly ?)

Thanks a lot,

Maxime Policarpo

alignment VCF CDS • 4.7k views

ADD COMMENT • link updated 4.6 years ago by Pierre Lindenbaum 161k • written 4.6 years ago by maxime.policarpo ▴ 200

0

Entering edit mode

just curious, why would you need to do this ?

ADD REPLY • link 4.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I would like to make multiple alignments of CDS with the reference sequence and the alternate sequence (For downstream analysis such as dn/ds, analysis of non-syn mutations...)

ADD REPLY • link 4.6 years ago by maxime.policarpo ▴ 200

0

Entering edit mode

I'm not sure it is what you want but I have a perl script to extract CDS properly (concatenate them). Is is called gff3_sp_extract_sequences.pl and you can find in the GAAS repository.

gff3_sp_extract_sequences.pl --fasta genome.fa --gff annotation.gff -o cds.fa

It extracts CDS per default.

ADD REPLY • link 4.6 years ago by Juke34 8.5k

0

Entering edit mode

Well this is not very convenient because any indel of the individual mapped to the genome will cause the gff3 to not be phased anymore

ADD REPLY • link 4.6 years ago by maxime.policarpo ▴ 200

score 4 · Answer 1 · 2019-09-19

I found a way to extract the sequences i wanted. For those who wonder how I did :

I first extracted the exon strucrure from the gff3 file and placed it on a file called "Gene.structure"

I then used bedtools getfasta to extract the corresponding sequence on the Reference genome

I then used bcftools mpileup and bcftools consensus to extract the corresponding sequence on the individual mapped.

It worked perfectly fine for every genes that I needed !

score 2 · Answer 2 · 2019-09-19

2

Entering edit mode

4.6 years ago

Pierre Lindenbaum 161k

i'm late for this one; I wrote something: http://lindenb.github.io/jvarkit/Biostar398854.html

it uses a vcf+gtf+ref fasta to generate all the transcripts.

not fully tested, I'm not even sure it generates what you needed :-D

 java -jar ${JVARKIT_DIST}/biostar398854.jar \
    --gtf input.gtf.gz \
    -R ref.fasta input.vcf > out.fasta

ADD COMMENT • link 4.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre !

I will try that if I ever need to make this kind of analyses later and I will tell you if it works ! Thanks a lot,

Max

ADD REPLY • link 4.6 years ago by maxime.policarpo ▴ 200

0

Entering edit mode

Thanks for sharing this tools. But i failed in generate the sequence, everything looks well include compile and installation, but I got an empty output.

ADD REPLY • link 4.4 years ago by sunhaoyun2 • 0

0

Entering edit mode

sorry for that, feel free to submit an issue if you have time: https://github.com/lindenb/jvarkit/issues/

ADD REPLY • link 4.4 years ago by Pierre Lindenbaum 161k