sample mixing when demultiplexing
1
0
Entering edit mode
4.4 years ago
lait ▴ 160

How is it possible to detect if samples from different humans are mixed while demiltiplexing? we have 4 samples per lane and 8 lanes in total. After demultiplexing, it turned out that some samples have double the size of other samples. The average size per sample is 10GB, but for our last run, what we got is samples with the following sizes:

• 10GB (which is normal)
• 5 GB
• 15 GB

Which appears as if some reads from certain samples when demultiplexing where linked to the wrong sample.

I already have the fastq files, BAM files and VCF files.

How can I verify computationally that read mixing happened?

Edit: each sample is sequenced twice on two different lanes. So there is a sample-collection-across-different-lanes step after demultiplexing.

demultiplex WES human samples NGS BAM • 1.5k views
4
Entering edit mode
4.4 years ago
GenoMax 118k

You can't conclude 'normality' of a sample based on the yield of the data. Libraries can behave differently and generate more or less data. Ideally this explanation is applicable and you just have unbalanced libraries in the pool. Errors in demultiplexing alone can't explain the huge differences you are observing.

0
Entering edit mode

ok thanks.. I added an important point in the Edit section.. could the error have happened then in the next step after demultiplexing? which is collecting samples across different lanes? if so, then I would come back to my previous question, if it is applicable here to check if there were sample mixing?

0
Entering edit mode

I assume this is Illumina sequencing? Are your samples part of different pools or was the same pool run on multiple lanes? If latter, it is simple to collect all reads belonging to one sample into single files by using --no-lane-splitting option with bcl2fastq. That way there can be no post-processing errors due to wrong sample merges. If former, it is still possible that the pools are quantitatively unbalanced to begin with and thus explain the yield differences.

Note: If you feel that two samples (with different indexes) were incorrectly mixed in post-processing step then isolate the index sequences and see if there are more than one in each file ( use the code here: C: Demultiplexing reads with index present in the labels )

0
Entering edit mode

thanks a lot. Using your script, I am sure now that there were no mix between the sequences. In this regard, do you have an explanation for the following:

I processed the vcf files, and calculated the b-allele frequency for the heteroz. mutations. when plotting the frequency graph, most of the plots (especially those related to the samples with unusual file size) appeared to have three peaks , one large peak at 0.5 and two smaller ones at 0.4 and 0.6. does this suggest contamination? or?