Question

What kind of sequences I have, haploid or diploid?

0

Entering edit mode

6.2 years ago

Enrique López ▴ 10

Hello everybody! Sorry for the post, but I have a silly question...

I am using FreeBayes and Manta to detect SVs in the genome of a individual woman. First at all, a coworker pass me the reads aligned with BWA. So my question is the next: This BAM file obtained with BWA has any information about if it is haploid or diploid?

Explanation: I have a diploid genome of a woman, I sequenced it and obtained many reads that contain sequences of both alleles (diploid), then I aligned the sequences with BWA to the hg19 genome (that is haploid). So well, when I obtained the BAM file I have only the haploid genome, correct? However, when I use FreeBayes to detect SVs, I obtaine a VCF file that indicates the SVs with GT = 0/0, GT = 0/1 or GT = 1/1. I found that 0/0 means that the SV is homozygous to the reference allele (genome); 0/1 means that the SV is heterozygous, with only an allele equal to the reference genome; and 1/1 means that the SV is homozygous to the alternate allele. So... How FreeBayes know that information? Then, if I try to annotate the SVs, I need to specify that is haploid or diploid? Because I thought that is haploid.

Thank you very much!

alignment genome sequence sequencing • 3.6k views

ADD COMMENT • link updated 6.2 years ago by theobroma22 ★ 1.2k • written 6.2 years ago by Enrique López ▴ 10

2

Entering edit mode

Just a point on terminology:

SNV: single-nucleotide variant

SV: structural variant

These are very different variant types.

ADD REPLY • link 6.2 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Good point; I assumed structural variant but you're likely correct the question was really about SNPs (or maybe small indels).

ADD REPLY • link 6.2 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

You are correct, when I used FreeBayes was to detect small indels. Also, I used Manta to detect SVs so I confused the concepts.

ADD REPLY • link 6.2 years ago by Enrique López ▴ 10

0

Entering edit mode

Hi Enrique,

if you are working with sequence files in FASTA format, maybe you can be interested in our SEDA (http://www.sing-group.org/seda/) application for easily processing FASTA files (filtering, merging, modifying headers, and so on).

Regards,

Hugo.

ADD REPLY • link 6.2 years ago by Hugo ▴ 380

0

Entering edit mode

Thank you Hugo, now I am using SeqTrimNext, but I will keep it in mind in the future!

ADD REPLY • link 6.2 years ago by Enrique López ▴ 10

0

Entering edit mode

6.2 years ago

theobroma22 ★ 1.2k

You will also know the ploidy by knowing which tissue you are working with. For example, human sperm and egg cells are certainly haploid, but if it’s liver or skin cells it more likely to be diploid.

ADD COMMENT • link 6.2 years ago by theobroma22 ★ 1.2k

0

Entering edit mode

Yes, is human blood, so it is diploid. Thank you!

ADD REPLY • link 6.2 years ago by Enrique López ▴ 10

score 4 · Accepted Answer · 2018-01-27

4

Entering edit mode

6.2 years ago

Chris Fields ★ 2.2k

You will have reads from a diploid sample aligned to a haploid reference, so the information to determine whether variants exist is present in the alignment, primarily by comparing read information (position of alignment, quality scores, strand information, CIGAR string, etc) and biological information (ploidy, known variants, etc) to the reference. Tools like GATK, freeBayes, samtools + bcftools, etc primarily differ in how they determine this.

EDIT - most of these tools will report genotype information, allele frequency, and so forth based on the evidence in the BAM. In a rough manner of speaking, if you were to look at only the reads aligning to a region, and if approx. half of the reads in that region have evidence of a SV, this might be represented as GT = 0/1 (het). If they all have such evidence, then this would be GT = 1/1 (homozygous alt). It's quite a bit more complicated than that depending on the tool and how they specifically determine whether the evidence is actually a SV or a false positive due to artifacts from alignment, sequencing, the reference used, etc.

ADD COMMENT • link 6.2 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

So... the information that says that I have a diploid sample is inside of the BAM file that I obtained, no? And for this reason, FreeBayes can determinate the genotype, correct? Thanks.

ADD REPLY • link 6.2 years ago by Enrique López ▴ 10

0

Entering edit mode

I just added an edit that might help.

ADD REPLY • link 6.2 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

Thank you very much! It is all that I needed to know!

ADD REPLY • link 6.2 years ago by Enrique López ▴ 10

0

Entering edit mode

Just to add to this...

During alignment, the alignment software is not aware of ploidy. Each read is essentially mapped independently of th\e other reads. As such a bam file does not contain any explicit information of ploidy.

Most variant callers will make an assumption of the ploidy of the samples they're working on. Humans are diploid (with exception of chrX, chrY and chrMT), so --ploidy=2 (see also the Freebayes GitHub page). You could change this if that makes sense for the organism you are working on.

Obviously, a variant caller is most accurate if you specify the ploidy upfront. Theoretically, it's possible to estimate the ploidy: if for a variant most supporting reads are in a 50:50 ratio, then it's likely diploid. If you either have all reference or all variant, likely haploid. If you have 33%, 67% or 100% allele ratios: likely triploid. And so on.

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank you. When I am using FreeBayes I used this: freebayes -f ref.fa aln.bam >var.vcf, that assumed is a diploid sample.

ADD REPLY • link 6.2 years ago by Enrique López ▴ 10