Cleaning up .fasta files by removing redundant sequences pre-alignment?
1
0
Entering edit mode
4.4 years ago
Tbr • 0

I have a selection of 16S sequences derived from different species, clustered into several different fasta files based on which genus the sequence came from. I would like to perform alignments on the sequences in order to probe for conserved regions for each genus. However, I have a few shorter sequences in which the full length of this sequence are contained entirely within longer ones. I just wanted to know if there are any softwares in which I can clean up these data to remove these redundant sequences before aligning them (as I currently do not have access to a huge amount of computational memory so removing any extraneous data would be of great benefit).

Any advice would be greatly appreciated!

alignment • 1.3k views
ADD COMMENT
2
Entering edit mode
4.4 years ago
Mensur Dlakic ★ 27k

CD-HIT is specifically designed for that purpose. It removes all sequence above a certain level of identity, and always retains the longest sequence.

cd-hit -i input.fas -o input.99 -c 0.99 -n 5

This will remove sequences at 99% identity - you may need to adjust that threshold.

ADD COMMENT
0
Entering edit mode

Oh yes that is exactly what I was looking for, I knew it must have existed somewhere! Thank you

ADD REPLY

Login before adding your answer.

Traffic: 2699 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6