Question: What kind of sequences I have, haploid or diploid?
0
gravatar for Enrique López
10 months ago by
Málaga
Enrique López10 wrote:

Hello everybody! Sorry for the post, but I have a silly question...

I am using FreeBayes and Manta to detect SVs in the genome of a individual woman. First at all, a coworker pass me the reads aligned with BWA. So my question is the next: This BAM file obtained with BWA has any information about if it is haploid or diploid?

Explanation: I have a diploid genome of a woman, I sequenced it and obtained many reads that contain sequences of both alleles (diploid), then I aligned the sequences with BWA to the hg19 genome (that is haploid). So well, when I obtained the BAM file I have only the haploid genome, correct? However, when I use FreeBayes to detect SVs, I obtaine a VCF file that indicates the SVs with GT = 0/0, GT = 0/1 or GT = 1/1. I found that 0/0 means that the SV is homozygous to the reference allele (genome); 0/1 means that the SV is heterozygous, with only an allele equal to the reference genome; and 1/1 means that the SV is homozygous to the alternate allele. So... How FreeBayes know that information? Then, if I try to annotate the SVs, I need to specify that is haploid or diploid? Because I thought that is haploid.

Thank you very much!

ADD COMMENTlink modified 10 months ago by theobroma221.1k • written 10 months ago by Enrique López10
2

Just a point on terminology:

SNV: single-nucleotide variant

SV: structural variant

These are very different variant types.

ADD REPLYlink modified 10 months ago • written 10 months ago by d-cameron1.9k

Good point; I assumed structural variant but you're likely correct the question was really about SNPs (or maybe small indels).

ADD REPLYlink written 10 months ago by Chris Fields2.0k

You are correct, when I used FreeBayes was to detect small indels. Also, I used Manta to detect SVs so I confused the concepts.

ADD REPLYlink written 10 months ago by Enrique López10

Hi Enrique,

if you are working with sequence files in FASTA format, maybe you can be interested in our SEDA (http://www.sing-group.org/seda/) application for easily processing FASTA files (filtering, merging, modifying headers, and so on).

Regards,

Hugo.

ADD REPLYlink written 10 months ago by Hugo140

Thank you Hugo, now I am using SeqTrimNext, but I will keep it in mind in the future!

ADD REPLYlink written 10 months ago by Enrique López10
4
gravatar for Chris Fields
10 months ago by
Chris Fields2.0k
University of Illinois Urbana-Champaign
Chris Fields2.0k wrote:

You will have reads from a diploid sample aligned to a haploid reference, so the information to determine whether variants exist is present in the alignment, primarily by comparing read information (position of alignment, quality scores, strand information, CIGAR string, etc) and biological information (ploidy, known variants, etc) to the reference. Tools like GATK, freeBayes, samtools + bcftools, etc primarily differ in how they determine this.

EDIT - most of these tools will report genotype information, allele frequency, and so forth based on the evidence in the BAM. In a rough manner of speaking, if you were to look at only the reads aligning to a region, and if approx. half of the reads in that region have evidence of a SV, this might be represented as GT = 0/1 (het). If they all have such evidence, then this would be GT = 1/1 (homozygous alt). It's quite a bit more complicated than that depending on the tool and how they specifically determine whether the evidence is actually a SV or a false positive due to artifacts from alignment, sequencing, the reference used, etc.

ADD COMMENTlink modified 10 months ago • written 10 months ago by Chris Fields2.0k

So... the information that says that I have a diploid sample is inside of the BAM file that I obtained, no? And for this reason, FreeBayes can determinate the genotype, correct? Thanks.

ADD REPLYlink written 10 months ago by Enrique López10

I just added an edit that might help.

ADD REPLYlink written 10 months ago by Chris Fields2.0k

Thank you very much! It is all that I needed to know!

ADD REPLYlink written 10 months ago by Enrique López10

Just to add to this...

During alignment, the alignment software is not aware of ploidy. Each read is essentially mapped independently of th\e other reads. As such a bam file does not contain any explicit information of ploidy.

Most variant callers will make an assumption of the ploidy of the samples they're working on. Humans are diploid (with exception of chrX, chrY and chrMT), so --ploidy=2 (see also the Freebayes GitHub page). You could change this if that makes sense for the organism you are working on.

Obviously, a variant caller is most accurate if you specify the ploidy upfront. Theoretically, it's possible to estimate the ploidy: if for a variant most supporting reads are in a 50:50 ratio, then it's likely diploid. If you either have all reference or all variant, likely haploid. If you have 33%, 67% or 100% allele ratios: likely triploid. And so on.

ADD REPLYlink written 10 months ago by WouterDeCoster35k

Thank you. When I am using FreeBayes I used this: freebayes -f ref.fa aln.bam >var.vcf, that assumed is a diploid sample.

ADD REPLYlink written 10 months ago by Enrique López10
0
gravatar for theobroma22
10 months ago by
theobroma221.1k
theobroma221.1k wrote:

You will also know the ploidy by knowing which tissue you are working with. For example, human sperm and egg cells are certainly haploid, but if it’s liver or skin cells it more likely to be diploid.

ADD COMMENTlink modified 10 months ago • written 10 months ago by theobroma221.1k

Yes, is human blood, so it is diploid. Thank you!

ADD REPLYlink written 10 months ago by Enrique López10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1543 users visited in the last hour