Delete character from sequence id
3.1 years ago

Hi, I am trying to delete the NCBI accession numbers from the sequence ids in a fasta file.

Sequences ids look like:

>Elytraria_mexicana_JQ691768.1


I am trying things like

sed 's/_*.*//' myfile.fasta


or

sed 's/_*.*//g' myfile.fasta


They don't work.

Have any of you done this before?

Thanks for any input,

I would try simply sed 's/_[A-Z].[0-9]*.[0-9]//g' myfile.fasta

You're using . as both a metacharacter and a literal .. Are you sure it will work reliably and the . that is supposed to match the literal . won't end up matching something else?

yes I agree Ram, Here . may match anything. to make it more reliable we can use \. instead. Thanks

Thanks!! It works!!

sed -r 's/_[A-Z0-9]+[.][0-9]+//g' aligned_trnG-trnS.fasta > new_trnG-trnS.fasta

just cut:

 cut -d '_' -f 1,2 in.fasta

Thank you so much!!
This command works:

sed -r 's/_[A-Z0-9]+[.][0-9]+//g' aligned_trnG-trnS.fasta > new_trnG-trnS.fasta


=D

3.1 years ago
Ram

Your sed is designed to look at each string once, and delete all occurrences of underscore followed by a character, removing just _J. Given that the Q is not preceded by an underscore, your pattern doesn't match it.

Try sed 's/_[A-Z0-9]+[.][0-9]+//g' myfile.fasta

Hi Ram, Thank you so much for your suggestion. this command

sed 's/_[A-Z0-9]+[.][0-9]+//g' myfile.fasta


Doesn't works. I am now trying something like

sed 's/_+[A-Z]+[A-Z]+[0-9]+[0-9]+[0-9]+[0-9]+[0-9]+[0-9]+[.]+[0-9]//g' myfile.fasta


And it also doesn't works. Would you have any sed manual to suggest? Many thanks!

Try sed -r instead of just sed with the first command. The second one is a little too unnecessarily verbose.