Read counts an order of magnitude higher on one chromosome
4
0
Entering edit mode
5 weeks ago
Samuel • 0

Hi,

I am having an issue with a sequencing run that when demultiplexed, aligned, and filtered each individual has 1-2 million reads, but these reads are predominantly on one chromosome. For background these are oncorhynchus mykiss and o. clarki samples. The 6th chromosome has an order of magnitude higher reads than the other chromosomes. The libraries were prepared for rad-seq (sbf1) with 19 plates on one novaseq 6000 lane.

The demultiplexing was done using deML. Aligning to the O. mykiss reference genome used bwa and filtering/sorting in samtools.

I can't figure out if the issue is a library prep issue, sequencing issue, or something that I am doing wrong in the splitting and creation of the bam files? If anyone has had a similar problem please let me know.

Thanks, Sam

novaseq alignment Rad-seq sequencing demultiplex • 717 views
0
Entering edit mode

I can't figure out if the issue is a library prep issue, sequencing issue, or something that I am doing wrong in the splitting and creation of the bam files?

You should also consider possible biological reasons for this observation, like aneuploidy, repeat sequences, nucleotide composition biases, etc.

I strongly suggest visualizing the read coverage of this chromosome. Is the increase in read coverage uniform or are there specific regions with lots of reads?

0
Entering edit mode

Hi jv,

I selected two individuals at random from the plate to look at the depth across chromosome 6. It seems that there is one small region that is accounting for the majority of the reads. The individuals are colored differently. Others have sequenced these species using Rad-seq without this issue so not sure it is biological, but I am new to this. I appreciate your insight.

1
Entering edit mode

I mean it's clearly not the whole chromosome it's specific locus. How long is this region? What does the underlying sequence look like?

0
Entering edit mode

These regions seem to be ~60 bases long. The sequence for the omy06 region is "ATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT". When I do a blast search it comes up as H. pylori, COVID-19, or synthetic construct. Seems to be pointing toward contamination.

0
Entering edit mode

Check my answer. These are adapters and should be trimmed prior to alignment. Sometimes one round of trimming isn't enough (in the case that you already trimmed your reads).

0
Entering edit mode

Could you do this plot for other chromosomes as well? The peak might be there, too.

0
Entering edit mode

2
Entering edit mode
4 weeks ago
benformatics ★ 2.9k

ATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT is the read2 adapter for some types of RAD-seq: https://github.com/MikkelSchubert/adapterremoval/issues/54 and https://patents.google.com/patent/US20160122753A1/en

I'm guessing this is the same for your case. Why the adapter sequence exists in the genome... who knows.

0
Entering edit mode

Good catch. @OP, did you use bwa (or any local aligner) for alignment?

0
Entering edit mode

Yes, I used bwa for alignment.

1
Entering edit mode
5 weeks ago
jv ▴ 430

Others have sequenced these species using Rad-seq without this issue so not sure it is biological

Others who? Other researchers in your group? How do you know this isn't something others have observed?

Presumably you are doing research, and now you've made an observation, one that you assume varies from others' observations and that is ok, that is research, now you need to follow up on the observation and get a better understanding as to the source of these over-represented reads in your Rad-seq data. Only once you investigate can you know if it's artifact or something real.

Did you do quality control of your raw sequencing data? It is very important that you do. With QC you'll get a better sense of whether library prep issues or sequencing issues exist. Currently I wouldn't expect library prep or sequencing issues to cause this result but I've never worked with rad-seq data before so can't say for sure.

Some additional questions to consider: What genes or sequences are located in this region of chromosome 6? Do all of your samples have it?

1
Entering edit mode
5 weeks ago

Possibly this is not a complete explanation but keep in mind that salmonids have undergone several rounds of whole-genome duplications, see the genome paper (Berthelot et al 2014). In particular, see Fig. 2b of the same. Chromosome 6, 11, and partially sex, 6, 12, and 5, were all derived from a single ancestral karyotype by Ts3R and Ss4R duplication events. A collapsed and conserved repetitive region might be sufficient to attract a lot of reads.

1
Entering edit mode
5 weeks ago
ATpoint 62k

I am neither a RAD-seq person, nor familiar with these organisms but regions that are prone to accumulate excessive read counts are not uncommon for NGS experiments. For human and mouse there are even established blacklists to remove known artifact regions. These could be some kind of low-complexity regions that accumulate false alignments that then pile up or these regions have sequence similarities with regions that are not properly represented in the reference genome, yet present in the genome, hence also prone to accumulate false alignments. Did you check for the alignment scores of the reads in these regions? Are these low? Are there other datasets you could check to see whether this is a common issue?

0
Entering edit mode

The alignment seems good, 98% mapped. No it is not a common problem with other datasets I've checked.

607841 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 591688 + 0 supplementary 0 + 0 duplicates 597497 + 0 mapped (98.30% : N/A) 16153 + 0 paired in sequencing 8449 + 0 read1 7704 + 0 read2 0 + 0 properly paired (0.00% : N/A) 5809 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 5538 + 0 with mate mapped to a different chr 5343 + 0 with mate mapped to a different chr (mapQ>=5)