26 days ago

Hi,

I have a fasta file with sequences like the following. The pair of sequences have a similar header. I want to generate a file with the sequences which have a header with no "shuffled". How to do that in bash?

>AABR03119176.1/72910-72785
UCCCCCAGAGUCUGGGCUUGGUGCUUUGCAGUGCUGGCGACCUAUUCCCUUUGACGAUCCCUAGGUGGAGAUGGGGCAUGAGGAUCCUCCAGGGGAAUAGCUCACCGCCACUGGGCAACAGGCCUA
>AABR03119176.1/72910-72785-shuffled
CCGCUAGCGUGAUUGGGGACGGGAUCGACCGGUGGCCCGCCGACGCCUCACCUCAUACUCGUAUGUGAUGCCGAGGGCUAGGUAAGAUGGUUGAACGCUCUAGAGUGCCCUCUGAACUUAGCCUCU
>AANN01820944.1/1549-1423
UUUCCCUCAGAAUAGGCUUGUUGCUUUACAGUACUGGUGAUCCAUUCUCUUUGAUGAUCCCcUAGGUGGAGAUGGGGCAUGAGGAUCCUCCAAGGGAAAGACUCAUCAUCACUGGGCAACAGCCUUA
>AANN01820944.1/1549-1423-shuffled
AGGCUCUGACAUAGACUCUUCUUUAGUGGGCGCGCCGACACAUACCUGUcUGAGGAGAUCGAAAUGUGUAGUCCGACAGAACUAAACAAGACUCGUCGGUGCUUAGACUUCUUUCCUGUUUGCGAUU

try these:

$sed '/^>/ s/-shuffled$//' test.fa or

$awk -F "-shuffled" '{print$1}' test.fa or

$awk -v RS=">" -v OFS="\n" 'NR>1 {sub("-shuffled$","",$1); print ">"$1,\$2}' test.fa.

But you will have sequences with identical headers. Somewhere else, this could be a problem.

26 days ago
cat <yourFile> | paste - - | grep -v 'shuffled' | sed 's/\t/\n/g' > new_file


cat your file, put header and sequence on one line (paste) , grep all lines that do not match 'shuffled' (grep -v ) , put data back in two lines header+sequence (sed)

as an additional note I want to add that I provided a working solution here but that you could have found this yourself doing some searching as this has been asked/answered a number of times before.

Thanks but an Error is given. How to solve it?

sed: -e expression #1, char 6: unterminated s' command

apologies for that, it was missing a trailing /`, fixed it in the cmdline above