Entering edit mode
5.9 years ago
cpad0112
21k
I got paired end sequencing results from core. I was checking for duplicated reads as part of QC. I got read entries like this from R1 file (headers):
@NB501140:187:HHVLCBGX9:2:23206:9999:2086 1:N:0:CACCTTAC
@NB501140:187:HHVLCBGX9:2:23206:9999:2086 1:N:0:CACCTTAC
@NB501140:187:HHVLCBGX9:2:23206:9999:7430 1:N:0:CACCTTAC
@NB501140:187:HHVLCBGX9:2:23206:9999:7430 1:N:0:CACCTTAC
Is this duplication possible?
Further querying file, printed following read information:
zgrep -iw -A 3 "NB501140:187:HHVLCBGX9:2:23206:9999:2086 1:N:0:CACCTTAC" R1.fastq.gz
@NB501140:187:HHVLCBGX9:2:23206:9999:2086 1:N:0:CACCTTAC
AGCATCAGAGGCACCCACCTGAGGAAAGTCCTCGCTGTCCATGGCCTGCAGAGTCTGGTTGGCTGTCTCCAGAAGCTCCTCGGAGCTCTCCAGGGCCCGCGTGCAGGCAGCCAGCTGGTTCTGTGGATACCAGGCACCAAAGGAGGGGACA
+
AAAAAEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEAAEEEAAEEEEAEEEAEE<E/
--
@NB501140:187:HHVLCBGX9:2:23206:9999:2086 1:N:0:CACCTTAC
AGCATCAGAGGCACCCACCTGAGGAAAGTCCTCGCTGTCCATGGCCTGCAGAGTCTGGTTGGCTGTCTCCAGAAGCTCCTCGGAGCTCTCCAGGGCCCGCGTGCAGGCAGCCAGCTGGTTCTGTGGATACCAGGCACCAAAGGAGGGGACA
+
AAAAAEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEAAEEEAAEEEEAEEEAEE<E/
How many are there , just check using seqkit
Source : https://bioinf.shenwei.me/seqkit/usage/#duplicate
~ 25k sequences are like that. @ Vijay Lakhujani
So, what you are concerned about is identical headers + identical sequence or ANY ONE of these 2?
All of them. Al three are exactly identical (read name, read sequence and base qualities) between duplicated sequences, which is kind of hard to believe. I wanted to know if this is fairly common as I haven't come across identically duplicated reads, in the past Vijay Lakhujani
Well, I for one have never encountered this before
looks very dodgy to me as well
Thanks Vijay Lakhujani I have used this for duplicate read identification. Since I had duplicate read names i used '-n' instead '-s'.
Hi!
I also used the same command, however, it's producing a truncated fastq file for some reason.
You very likely have a corrupted data file to begin with. See if you can download a fresh copy. Ask for
md5sum
values to confirm you have a complete copy.Hi Genomax. Thanks for your input. We have sent a request to check if there was some issue at the demultiplexing step. However, I have a doubt that maybe I am missing something. Isn't it the job of markduplicates to mark such duplicates and not throw an error. In the case of PCR duplicates, isn't it like PCR duplicates are assigned the same read name. If not, how and why the read names of PCR duplicates different from their primary alignment?
There should be no duplicate read names in a normal illumina read file. Each sequence that passed QC cam from a unique cluster that originated from a single library fragment. PCR duplicates are generally identified by identical end-to-end sequence in both pairs of reads.
I've never seen this. Are the duplicated reads randomly placed? Or duplicates immediately follow the original? Are the reads repeated on R2 as well?
I guess this thread can be closed. I talked with core and they agreed to relook at the data. Thanks @ Vijay Lakhujani lieven.sterck h.mon
This definitely looks like data/file corruption. If the original files don't have this problem then you would need to get a fresh copy (ask for md5sums to compare). Otherwise the core will have to re-run demultiplexing and regenerate the data.
Hi, I am facing the same issue. Is there a turnaround for this? I am unable to resolve this issue.
Because of two reads having same header I am facing issue at Markduplicates step.