Question

How to find identical sequences in genome fasta file (by Python or any possible program) ?

0

Entering edit mode

3 months ago

Sony ▴ 10

Hello everyone,

I have a genome fasta file which has 16,941 sequences. Here are example of my "genome.fasta":

>scf7180000026027
GAATGCATACTGCATCGATA

>scf7180000026028
CATAAAACGTCTCCATCGCT

>scf7180000026029
TGCCCAAGTTGTGAAGTGTC

>scf7180000026030
TGCCCAAGTTGTGAAGTGTC

I want to find identical sequences in this genome fasta file, and return their ids. My final purpose are find and remove any identical sequences present in my genome fasta file.

Thank you everyone for any suggestion.

fasta • 310 views

ADD COMMENT • link updated 3 months ago by Ram 44k • written 3 months ago by Sony ▴ 10

0

Entering edit mode

ADD REPLY • link 3 months ago by Pierre Lindenbaum 163k

score 1 · Answer 1 · 2024-05-20

My final purpose are find and remove any identical sequences present in my genome fasta file.

You can use clumpify.sh from BBMap suite for this --> Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. It will accept fasta format sequences.

clumpify.sh -Xmx10g in=your_file.fa out=deduped_file.fa dedupe subs=0

subs=0 does perfect matches. Increase that number to allow mismatches.

You can use addcopies to mark headers with counts of sequences found like so

>scf7180000026027
GAATGCATACTGCATCGATA
>scf7180000026029 copies=3
TGCCCAAGTTGTGAAGTGTC
>scf7180000026028
CATAAAACGTCTCCATCGCT