Question: Remove/Delete unique reads from a DNA fasta file
0
gravatar for saanasum
4 months ago by
saanasum10
saanasum10 wrote:

I have millions of fasta files each containing many DNA sequences.

I want to batch remove/delete reads (DNA sequences) that are unique in each fasta file, respectively. I only want to keep duplicate reads that occur at least two times in each fasta file, respectively.

Does anybody know a command line based solution or a program/script doing that?

Many thanks!

dna remove reads unique fasta • 359 views
ADD COMMENTlink modified 4 months ago by genomax68k • written 4 months ago by saanasum10

seqkit common is what you are looking for.

ADD REPLYlink written 4 months ago by finswimmer11k

I guess this is an excellent solution for finding common sequences between two or more files (I tried it). However, I just want to extract sequences that occur at least two times within ONE file containing many fasta sequences (OR even better: to remove unique sequences in ONE fasta file). Finally, this should be done for millions of fasta files.

ADD REPLYlink modified 4 months ago • written 4 months ago by saanasum10

However, seqkit might be the right choice when using the rmdup command and with -d parameter. seqkit rmdup

ADD REPLYlink modified 4 months ago • written 4 months ago by saanasum10
  • Create a for loop in bash over your files
  • In this loop, call a python script with one of your file as argument
  • In this python script, create a dictionnary that you will fill with sequence as key
  • For each sequence if it is already a key in your dictionnary output the sequence (which is a duplicate)
ADD REPLYlink modified 4 months ago • written 4 months ago by Bastien Hervé4.3k
2
gravatar for saanasum
4 months ago by
saanasum10
saanasum10 wrote:

Thanks @finswimmer for the suggestion of using seqkit.

It is possible as described here: seqkit rmdup. When using -d a file containing the duplicated reads can be specified. Using a for loop in bash should enable automation for millions of fasta files.

ADD COMMENTlink modified 4 months ago • written 4 months ago by saanasum10

Isn't this the opposite thing to what you asked?

You've asked to keep only reads that are at least duplicated or more. This will remove all the duplicates instead...perhaps I'm missing something...

ADD REPLYlink written 4 months ago by jrj.healey12k
-d, --dup-seqs-file string   file to save duplicated seqs.

The reads will be removed from original file and captured in new name specified is how I am reading this.

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax68k

Ah yep i see, knew it was too late in the evening...

ADD REPLYlink written 4 months ago by jrj.healey12k

Exactly, using -d a new file containing only duplicate reads will be generated. Therefore, in this file all unique reads are deleted. I can proceed working with this new file. As batch in bash:

for i in *.fasta; do cat $i | seqkit rmdup -s -i -m -d "$i""_unique-reads-removed"; done
ADD REPLYlink modified 4 months ago • written 4 months ago by saanasum10
2
gravatar for genomax
4 months ago by
genomax68k
United States
genomax68k wrote:

dedupe.sh from BBMap suite should also work. outd= will collect duplicated sequences.

ADD COMMENTlink written 4 months ago by genomax68k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1299 users visited in the last hour