Remove all entries with duplicate names from fastq file?
1
0
Entering edit mode
5 weeks ago
wormball • 0

Hello!

I have some paired end fastq files supposedly originating directly from illumina. But they contain some number of records with duplicate names (but different sequences) on which MergeBamAlignment swears. So i need a tool to remove all such duplicates. I saw the advice to use seqtk Duplicate/identical reads in fastq file , but seqtk leaves one of the duplicates untouched. Which may lead to wrong results cos there is no guarantee that it will leave two reads from the same pair.

Is there a tool that removes all reads that have duplicate names?

fastq illumina duplicate • 155 views
1
Entering edit mode

But they contain some number of records with duplicate names (but different sequences)

With normal illumina sequence data that should not happen. If at all possible I advise that you go back and find original data. This indicates that someone has fiddled with this file in some way and you have no way of knowing what else may have happened.

That said you may be able to use dedupe.sh from BBMap suite. Take a look at the in-line help. Especially the rmn= parameter.

1
Entering edit mode
5 weeks ago
wormball • 0

Thanks! However it seems too complicated to me, and i could not make it do what i want. So i wrote the desired script myself:

#!/usr/bin/python3

import sys

if __name__ == "__main__":
if len(sys.argv) < 2:
print("""rmdup.py - removes all occurences of entries with duplicate names from fastq file
usage: rmdup.py file.fastq > file_rmdup.fastq""")
exit()
d = {}
dd = {}
for i in range(0, len(l), 4):
s = l[i].split()[0]
if s in d:
dd[s] = 1
d[s] = 1
for i in range(0, len(l), 4):
s = l[i].split()[0]
if not s in dd:
for a in range(4):
print(l[i+a], end="")

1
Entering edit mode

Since you have a solution that works I moved your comment to an answer. You can go ahead and accept this answer to provide closure to this thread.