Question

Remove/Delete unique reads from a DNA fasta file

0

Entering edit mode

5.3 years ago

saanasum ▴ 10

I have millions of fasta files each containing many DNA sequences.

I want to batch remove/delete reads (DNA sequences) that are unique in each fasta file, respectively. I only want to keep duplicate reads that occur at least two times in each fasta file, respectively.

Does anybody know a command line based solution or a program/script doing that?

Many thanks!

fasta unique reads remove DNA • 2.5k views

ADD COMMENT • link updated 5.3 years ago by GenoMax 142k • written 5.3 years ago by saanasum ▴ 10

0

Entering edit mode

seqkit common is what you are looking for.

ADD REPLY • link 5.3 years ago by finswimmer 16k

0

Entering edit mode

I guess this is an excellent solution for finding common sequences between two or more files (I tried it). However, I just want to extract sequences that occur at least two times within ONE file containing many fasta sequences (OR even better: to remove unique sequences in ONE fasta file). Finally, this should be done for millions of fasta files.

ADD REPLY • link 5.3 years ago by saanasum ▴ 10

0

Entering edit mode

However, seqkit might be the right choice when using the rmdup command and with -d parameter. seqkit rmdup

ADD REPLY • link 5.3 years ago by saanasum ▴ 10

0

Entering edit mode

Create a for loop in bash over your files
In this loop, call a python script with one of your file as argument
In this python script, create a dictionnary that you will fill with sequence as key
For each sequence if it is already a key in your dictionnary output the sequence (which is a duplicate)

ADD REPLY • link 5.3 years ago by Bastien Hervé 5.3k

score 2 · Accepted Answer · 2019-01-22

2

Entering edit mode

5.3 years ago

saanasum ▴ 10

Thanks @finswimmer for the suggestion of using seqkit.

It is possible as described here: seqkit rmdup. When using -d a file containing the duplicated reads can be specified. Using a for loop in bash should enable automation for millions of fasta files.

ADD COMMENT • link 5.3 years ago by saanasum ▴ 10

0

Entering edit mode

Isn't this the opposite thing to what you asked?

You've asked to keep only reads that are at least duplicated or more. This will remove all the duplicates instead...perhaps I'm missing something...

ADD REPLY • link 5.3 years ago by Joe 21k

0

Entering edit mode

-d, --dup-seqs-file string   file to save duplicated seqs.

The reads will be removed from original file and captured in new name specified is how I am reading this.

ADD REPLY • link 5.3 years ago by GenoMax 142k

0

Entering edit mode

Ah yep i see, knew it was too late in the evening...

ADD REPLY • link 5.3 years ago by Joe 21k

0

Entering edit mode

Exactly, using -d a new file containing only duplicate reads will be generated. Therefore, in this file all unique reads are deleted. I can proceed working with this new file. As batch in bash:

for i in *.fasta; do cat $i | seqkit rmdup -s -i -m -d "$i""_unique-reads-removed"; done

ADD REPLY • link 5.3 years ago by saanasum ▴ 10

score 2 · Accepted Answer · 2019-01-22

2

Entering edit mode

5.3 years ago

GenoMax 142k

dedupe.sh from BBMap suite should also work. outd= will collect duplicated sequences.

ADD COMMENT • link 5.3 years ago by GenoMax 142k