Question: Extract every CDS sequences from a VCF file
0
gravatar for maxime.policarpo
5 weeks ago by
France, Paris
maxime.policarpo90 wrote:

Hi everyone,

I have an alignment of genome sequencing reads mapped to a reference genome. I created a VCF file from this alignment and filtered it depending on the PHRED score (>20). Now, I want to extract every CDS sequences that are annotated on the reference genome but with the variants present on my individual mapped to this genome.

I have a gff3 file with annotations, the fasta file of the reference genome and a VCF file of my individual variants.

I have seen similar questions on which people were using bedtools getfasta to extract sequences but it only returns sequences exons by exons and it does not concatenate them in a full CDS sequence (This tool seems nice to extract transcripts sequences but not CDS).

Does anyone have an idea how to do it ? Should i first create a whole genome consensus sequence from the alignment and then use a tool that extract CDS sequences using this consensus sequence as reference ? (And which tool can do it properly ?)

Thanks a lot,

Maxime Policarpo

alignment cds vcf • 172 views
ADD COMMENTlink modified 4 weeks ago by Pierre Lindenbaum123k • written 5 weeks ago by maxime.policarpo90

just curious, why would you need to do this ?

ADD REPLYlink written 5 weeks ago by Pierre Lindenbaum123k

I would like to make multiple alignments of CDS with the reference sequence and the alternate sequence (For downstream analysis such as dn/ds, analysis of non-syn mutations...)

ADD REPLYlink written 5 weeks ago by maxime.policarpo90

I'm not sure it is what you want but I have a perl script to extract CDS properly (concatenate them). Is is called gff3_sp_extract_sequences.pl and you can find in the GAAS repository.

gff3_sp_extract_sequences.pl --fasta genome.fa --gff annotation.gff -o cds.fa

It extracts CDS per default.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Juke-342.8k

Well this is not very convenient because any indel of the individual mapped to the genome will cause the gff3 to not be phased anymore

ADD REPLYlink written 4 weeks ago by maxime.policarpo90
1
gravatar for maxime.policarpo
4 weeks ago by
France, Paris
maxime.policarpo90 wrote:

I found a way to extract the sequences i wanted. For those who wonder how I did :

I first extracted the exon strucrure from the gff3 file and placed it on a file called "Gene.structure"

I then used bedtools getfasta to extract the corresponding sequence on the Reference genome

I then used bcftools mpileup and bcftools consensus to extract the corresponding sequence on the individual mapped.

It worked perfectly fine for every genes that I needed !

ADD COMMENTlink written 4 weeks ago by maxime.policarpo90
1
gravatar for Pierre Lindenbaum
4 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

i'm late for this one; I wrote something: http://lindenb.github.io/jvarkit/Biostar398854.html

it uses a vcf+gtf+ref fasta to generate all the transcripts.

not fully tested, I'm not even sure it generates what you needed :-D

 java -jar ${JVARKIT_DIST}/biostar398854.jar \
    --gtf input.gtf.gz \
    -R ref.fasta input.vcf > out.fasta
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Pierre Lindenbaum123k

Hi Pierre !

I will try that if I ever need to make this kind of analyses later and I will tell you if it works ! Thanks a lot,

Max

ADD REPLYlink written 4 weeks ago by maxime.policarpo90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2259 users visited in the last hour