I have a task that I'm sure has been done before but I can't find a simple solution. Given a fasta file, with multiple sequences per species/individual name, I want to randomly sample one sequence per species/individual. Not each species/individual will always have the same number of sequences.
I've found a few similar posts (https://www.biostars.org/p/18831/), but they randomly sample a subset based on a given percentage. I think I want to do something similar to this post, but that was never fully answered: Help with randomly sampling from a fasta file.
Here's my example:
>SpeciesA.seq0 CCACTTTA...... >SpeciesA.seq1 CCTCTTTA...... >SpeciesA.seq2 CCGCTTTA...... >SpeciesA.seq3 CCACTTTA...... >SpeciesB.seq0 GCCCTTTA...... >SpeciesB.seq1 GCCCTTTA..... >SpeciesB.seq2 ACCCTTTA..... >SpeciesB.seq3 GCCCTTTA..... >SpeciesC.seq0 GCCCTTTA..... >SpeciesC.seq1 GCCCTTTA.....
I want to randomly select one sequence per species/individual so the output will look like this:
>SpeciesA.seq3 CCACTTTA...... >SpeciesB.seq1 GCCCTTTA..... >SpeciesC.seq1 GCCCTTTA.....
Ideally, the resulting sequences will be concatenated/pasted into a new file with the same name as the starting input fasta file (which corresponds to a gene name). I only have limited bash experience so help would be greatly appreciated!