Question

Mate reads are aligned to different chromosomes

0

Entering edit mode

20 months ago

wormball ▴ 10

Hello!

I am complete noob, so sorry for dumb questions.

I am trying to call mutations from tumor only human samples (enriched with gene panel) from illumina.

I looked at the result of MarkDuplicates and was disappointed. It thrown away most good reads in regions of interest, but kept a lot of duplicate reads. When i clicked on these reads, i found that their mate reads were aligned to other chromosomes or not aligned at all, e. g.: (see lower)

So my questions are:

How can i distinguish between library preparation artifact and actual translocation? I suspect most of them being the former.
If this is a preparation artifact, what could cause this and how to avoid it in future?
How can i tell MarkDuplicates (or maybe bwa or something else) to discard such reads? And should i?
Maybe i should not use MarkDuplicates at all in this case? Cos it throws away many reads which can actually be fragmentation coincidences.
How can i distinguish fragmentation coincidence from PCR artifact?

Thanks in advance.

Read name = NB552223:8:H57YJBGXG:2:22304:5136:14363
Sample = _S3
Library = _S3
Read group = _S3
Read length = 151bp
Flags = 81
----------------------
Mapping = Primary @ MAPQ 60
Reference span = chr1:155 904 594-155 904 744 (-) = 151bp
Cigar = 151M
Clipping = None
----------------------
Mate is mapped = yes
Mate start = chr2:32916517 (+)
First in pair
----------------------
MC = 39M112S
PG = MarkDuplicates
NM = 0
MQ = 0
UQ = 0
AS = 151
XS = 19
Hidden tags: MD, RGLocation = chr1:155 904 628
Base = A @ QV 32

Read name = NB552223:8:H57YJBGXG:1:23201:17738:8530
Sample = _S3
Library = _S3
Read group = _S3
Read length = 151bp
Flags = 81
----------------------
Mapping = Primary @ MAPQ 60
Reference span = chr1:155 904 594-155 904 744 (-) = 151bp
Cigar = 151M
Clipping = None
----------------------
Mate is mapped = yes
Mate start = chr2:32916360 (+)
First in pair
----------------------
MC = 40M111S
PG = MarkDuplicates
NM = 0
MQ = 0
UQ = 0
AS = 151
XS = 19
Hidden tags: MD, RGLocation = chr1:155 904 661
Base = T @ QV 36

Read name = NB552223:8:H57YJBGXG:4:13410:2304:6795
Sample = _S3
Library = _S3
Read group = _S3
Read length = 151bp
Flags = 81
----------------------
Mapping = Primary @ MAPQ 60
Reference span = chr1:155 904 594-155 904 744 (-) = 151bp
Cigar = 151M
Clipping = None
----------------------
Mate is mapped = yes
Mate start = chrX:122774463 (+)
First in pair
----------------------
MC = 55S46M50S
PG = MarkDuplicates
NM = 0
MQ = 0
UQ = 0
AS = 151
XS = 19
Hidden tags: MD, RGLocation = chr1:155 904 685
Base = G @ QV 36

Read name = NB552223:8:H57YJBGXG:4:11507:4304:10003
Sample = _S3
Library = _S3
Read group = _S3
Read length = 151bp
Flags = 409
----------------------
Mapping = Secondary @ MAPQ 60
Reference span = chr1:155 904 804-155 904 914 (-) = 111bp
Cigar = 111M40S
Clipping = Right 40 soft
----------------------
Mate is mapped = no
Second in pair
----------------------
SupplementaryAlignments
chr1:155 904 678-155 904 722 (+) = 44bp @MAPQ 60 NM0
----------------------
PG = MarkDuplicates
NM = 0
UQ = 0
AS = 111
XS = 20
Hidden tags: SA, MD, RGLocation = chr1:155 904 877
Base = C @ QV 36

Some enriched region after and before MarkDuplicates (without read downsampling):

Some region where mate reads align (with read downsampling):

illumina gatk bwa picard MarkDuplicates • 1.2k views

ADD COMMENT • link updated 20 months ago by Istvan Albert 100k • written 20 months ago by wormball ▴ 10

1

Entering edit mode

the images are way too large and the stats are too verbose as well so it makes it difficult to get a good sense

but sounds like you have identified the problem as MarkDuplicates not working quite right,

filter for reads in proper pair and that should remove the mates with wrong chromosomes

if you are enriching for certain templates it does not make sense to remove duplicates, the enriched segments will produce duplicates, so you are basically both enriching and get rid of the same templates

ADD REPLY • link 20 months ago by Istvan Albert 100k

0

Entering edit mode

Thanks!

filter for reads in proper pair

So i should use samtools view -f 2 ?

I looked at the mate read nest shown at the second picture and saw that this is polyG region, so basically trash. I also looked at some other mate read locations, and some of them are polyA/T regions. How such pairing can arise during preparation/sequencing?

if you are enriching for certain templates it does not make sense to remove duplicates

On the other hand if they were fragmentation coincidences, they should generally have only one coinciding end of two, which is not my case. So i think they are actually PCR artifacts.

I looked at another sample in the same run, and it has about 3 times more "good" deduplicated read pairs and about 1.5x less trash pairs.

ADD REPLY • link 20 months ago by wormball ▴ 10

1

Entering edit mode

As the saying goes "Happy families are all alike; every unhappy family is unhappy in its own way."

So when it comes to weird data you see, there could be many reasons that can cause various problems.

Now when it comes to your data, you seem to have a coverage of over 2 hundreds thousand X over a region that seems to be 400 bp

How could one even distribute so many reads over 400bp without creating many "duplicated" reads?

When you reach a certain level of coverage, "duplication" is a natural consequence of covering the same region beyond saturation. At that point is not a duplication at all; the word is incorrect.

It is not "duplicated" in the classical sense of copying an existing fragment. It is "duplicated" by fragmenting in the same position.

ADD REPLY • link 20 months ago by Istvan Albert 100k

0

Entering edit mode

How could one even distribute so many reads over 400bp without creating many "duplicated" reads?

We can calculate rough estimation by modeling.

Let L be the length our interval from which we get our fragments, so first and second ends of the fragment are evenly distributed in this interval. So we can have L * L different possibilities of fragment location.

L = 100; r = randint(L ** 2, size = 20000);
h = histogram(r, bins = range(L ** 2));
figure(); hist(h[0], bins = range(20), log = True); xticks(range(20));

enter image description here

So if we have 20000 fragments distributed over 100 bp, the most duplicate fragment will occur in only 9 repetitions, and most fragments will be repeated only 1 - 3 times.

However in my case i have most fragments repeated about 100 times. So i conclude that even fragmentation can not result in such distribution. Another evidence is that in another sample i have 3x more unique reads over exactly the same region and about the same read depth.

There still can be possibility that the fragmentation is not exactly even and prefers some sites over others. I know the fragmentation was mechanical, but i do not know how selective it is.

Or maybe the dna had been fragmented before/during storage by native enzymes present in tissues.

ADD REPLY • link 20 months ago by wormball ▴ 10

1

Entering edit mode

In my opinion, the formula has an unrealistic size distribution expectation, and the results won't match the scenario.

The fragments are selected to be the same size. Ideally, if you were to have a fixed length, the fragment can only "land" on coordinates 1, 2, ... L; thus after placing 2*L reads, your average duplication is 2.

Of course, in reality, there is variability in the fragment size, but with a strong bias for certain sizes, thus the expected duplication will always be smaller than the theoretical N/L. Still, the number of possible sizes is, in my opinion, much-much lower than L^2

I think you should run your simulation by selecting lengths from a gaussian distribution and varying the standard deviation, and seeing what you get then.

ADD REPLY • link 20 months ago by Istvan Albert 100k

0

Entering edit mode

Here is my fragment length distribution:

enter image description here

The discussed sample is S3 (brown). So the size variability is about 100 bp, and length of the region is about 300 bp which is much larger than numbers used in the model, so theoretical duplication level should be even lower.

Moreover, if we take L = 50, the maximum duplication level will be only 18.

And here is my actual duplication histogram:

enter image description here

So most of my samples show much less duplication.

ADD REPLY • link 20 months ago by wormball ▴ 10

1

Entering edit mode

just to keep the numbers straight you seem to be starting out with a 200K coverage not 20K as in the formula. And if you want to use the whole region, the read number is even larger than that since the 200K is the coverage of a single base, the region is wider than a read length. It is pushing closer to 1 million reads.

It is just a general observation. If you generate a million reads over a four hundred basepair region ... you are going to have a lot of duplication. Even the slightest bias in fragmentation etc. can lead to biases. Fragmentation bias is still a valid measurement. A PCR bias is not a valid measurement.

That's really all I am saying, as the coverage rises things that are not PCR duplicates may look like PCR duplicates.

Though here, I would agree that the brown line does look way off the rest and indicates more problems.

ADD REPLY • link 20 months ago by Istvan Albert 100k

1

Entering edit mode

it is also curious that the brown line is worse than the teal, that seems to be undetermined data

ADD REPLY • link 20 months ago by Istvan Albert 100k