I have a list of >100 circular contigs that I would like to remove from my de novo genome assembly.fasta. How can I remove these contigs from the assembly.fasta using a text file with the contig names/numbers? Or is there another way?
If you already know which contigs are circular, you can use the really cool
seqkit tool. The
grep subcommand is the perfect tool for this job.
seqkit grep assembly.fasta -n -v -f circular_contigs.txt > assembly_clean.fasta
-n specifies to match by full name not just by id pattern (this means the names need to match 100%)
-v inverts the search criteria (i.e. anything that's not circular)
-f specifies the file by which to look for patterns (in this case the circular contig header names)
circular_contigs.txt is a list (one header per line) that identifies the circular contigs to be removed
> assembly_clean.fasta seqkit outputs to the terminal (stdin) so this last bit is piping into a new file
More info here: https://bioinf.shenwei.me/seqkit/
Hope that helps