Question: Pipeline design: from reads to consensus sequences of multiple individuals
0
gravatar for agathejouet
10 weeks ago by
agathejouet0 wrote:

Hi all,

I am having an issue with a pipeline design. I have a reference genome in fasta format as well as a gff file containing the position of annotated genes in that genome. I also have reads from several individuals of the reference genome species. My aim is to get the consensus sequence of all genes in the reference genome, from those individuals. In term, I want to run paml to identify genes that may be under selection.

The first thing I tried is to: - align reads from all individuals to the reference genome with samtools bwa mem - call variants with samtools mpileup - get the consensus sequence of all reference scaffolds with vccftools vcf-consensus - converting my gff to bed using bedops gff2bed - extracting the genes from the consensus scaffolds using bedtools getfasta

The problem with this method is that the consensus scaffolds I get from my individuals are of various lengths due to indels. When I extract the genes, I am using a bed with the positions of the genes from the reference genome, which may be different in my individuals. The software even complains that some of the genes end outside of scaffolds...

I then thought that the solution would be to extract the genes from bams before I call variants and consensus but I am having issues with (1) splitting scaffolds into multiple sequences/genes in the bam and (2) making sure that the gene names are the same in my bed, fasta, vcf and/or bam to avoid incompatibilities...

What would be great would be to have something like "samtools view -bL inbam > outbam". However, this command does not appear to do what I want meaning, going from bam with scaffolds: scaffold1 info_about_mapping_on_scaffold_1 scaffold2 info_about_mapping_on_scaffold_2...

to a bam with genes

gene1 info_about_mapping_on_gene_1 gene2 info_about_mapping_on_gene_2...

It is probably not the best way to do this and my bam may get quite big of I do this... I am sure this is often done but I am no clue how. The simple thing would be to align reads to the reference genes instead of the reference scaffolds but I am afraid this will decrease mapping accuracy...

Any help appreciated!

Thanks,

Agathe

ADD COMMENTlink modified 9 weeks ago • written 10 weeks ago by agathejouet0

It sounds like a very interesting experiment and there are probably many ways to get it done. You've done whole genome sequencing on your samples, right, or whole exome? If whole genome, one possibility (although computationally expensive) is to do a de novo genome assembly using all of your samples combined using Velvet (https://www.ebi.ac.uk/~zerbino/velvet/). Velvet will produce a consensus reference FASTA based on the sample data that you provide. With this de novo assembled reference, you can then identify genes within that genome using whatever existing information is available. There's a review here on this process: http://www.nature.com/nrg/journal/v13/n5/full/nrg3174.html

Another, somewhat cheeky, way would be to merge all of your reads together into a single file (if single-end) or pair of files (if paired end) and to then do the alignment, variant calling, and produce a single new consensus sequence in this way. You may still run into issues later on though.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by Kevin Blighe6.5k

Thanks for your answer Kevin. It is indeed whole-genome sequencing although sequencing was done a while ago, with short reads (86 bp PE). That is why I would prefer a pipeline based on mapping instead of assembly.

I am not sure what you mean with the second option, merging all reads together. How would I retrieve the consensus sequences from my different individuals if all is mixed?

It is very strange that I can't seem to find a way to map reads to a reference genome and then extract the consensus of genes from multiple individuals. I am now trying to use bedtools intersect to go from a bam with scaffolds to a bam with genes. I will let you know how it goes.

ADD REPLYlink written 9 weeks ago by agathejouet0

Bedtools intersect removes reads aligned outside of my genes but keep the scaffolds as they are in my bam, which is not what I want... Will keep looking.

ADD REPLYlink written 9 weeks ago by agathejouet0

Hi Agathe, my idea was to create a new reference genome FASTA using Velvet. Velvet allows you to use, as input, multiple paired FASTQ files to construct the new genome. You would then identify genes/transcripts in this new genome and then perform re-alignment of your samples to it. The difficult part, I believe would be identifying the new transcript boundaries in the new genome.

I recently did a similar experiment, as I describe here, with a bacterial genome (1,000 times smaller than human!).

ADD REPLYlink written 9 weeks ago by Kevin Blighe6.5k
0
gravatar for agathejouet
9 weeks ago by
agathejouet0 wrote:

Hi all,

In the end, I decided to extract annotated genes from the reference fasta to produce a new fasta and align the reads to those. Not what I really wanted but good enough...

Agathe

ADD COMMENTlink written 9 weeks ago by agathejouet0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1421 users visited in the last hour