Fasta header contain multiple string
1
0
Entering edit mode
3.1 years ago
harry ▴ 30

I have large fasta file. As you see below there are > sign present in some fasta header like

>exon2_ENST00000218032|>exon2_ENST00000218032
>exon17_ENST00000253024|>exon17_ENST00000253024

I want to remove the >sign from the header sequence, after remove the header is then look like this

>exon2_ENST00000218032|exon2_ENST00000218032
>exon17_ENST00000253024|exon17_ENST00000253024

It is an actual fasta header that has only one > sign not more than 1.

>exon2_ENST00000218032|>exon2_ENST00000218032
GAAGCTTTTGGTCTATATTGTTAATTGCCATTGCTGTAAATCTTAAAATGAATGAATAAAAATGTTTCATTTTACAAAAAACATGTTCCTTCAGTCGTCAATGCTGACCTGCATTTTCCTGCTAATATCTGGTTCCTGTGAGTTATGCGC
>exon1_ENST00000218032|>exon1_ENST00000218032
GCTGCTGCAAGTTACGGAATGAAAAATTAGAACAACAGAAACATGGTTTCTCTTCTCGGCCACCTCCTGCATAGAGGGTACCATTCTGC
>exon1_ENST00000218032|exon2_ENST00000218032
AAGCTTTTGGTCTATATTGTTAATTGCCATTGCTGTAAATCTTAAAATGAATGAATAAAAATGTTTCATTTTACAAGTTTCTCTTCTCGGCCACCTCCTGCATAGAGGGTACCATTCTGCGCTGCTGCAAGTTACGGAATGAAAAATTAG
>exon17_ENST00000253024|>exon17_ENST00000253024
TCCCTGGTGGCCCCATCCCCCAGTTCCTCACGATATGGTTTTTACTTCTGTGGATTTAATAAAAACTTCACCAGTTACAAGGCAGACGTGCAGTCCATCATCGGCCTGCAGCGCTTCTTCGAGACGCGCATGAACGAGGCCTTCGGTGAC
>exon16_ENST00000253024|>exon16_ENST00000253024
TACAGCTCCCCACAGGAGTTTGCCCAGGATGTGGGCCGCATGTTCAAGCAATTCAACAAGTTAACTGAGACCAGCCCGGTGGCACCCTGGATCTGACCCTGATCCGTGCCCGCCTCCAGGAGAAGTTGTCACCTCCC

So please can anyone tell me how I remove the multiple > sign in my fasta header. Thanks in advance

fasta • 1.1k views
ADD COMMENT
1
Entering edit mode
3.1 years ago

Removing all > and then inserting a new one at the beginning.

sed '/^>/ { s/>//g; s/^/>/ }'  seqs.fasta > result.fasta

Ref:

ADD COMMENT
2
Entering edit mode

based on OP format, it can be further simplified:

$ sed '/^>/ s/>//2' input.fasta  

2g in case of more > in header.

ADD REPLY
0
Entering edit mode

Thanks it working fine can you just help me how do i remove duplicate header with sequence as you can see ENST00000368129_1|ENST00000368129_3, It contain 2 sequence: I tried this command to remove duplicate sequence:

paste -d $'\t' - - <fastaFileWithNoLinebreaksInSeq | sort -t $'\t' -uk1,1 | awk 'BEGIN{FS="\t";OFS="\n"}{print $1,$2}'

but it does not give right output: It give output like this

>_10|_10
AAAGCAAAACACAGAATTGACGGAAAGACTTACGTTATTAAACGTGTTAAATATAATAACGAGTTTGGCATGGATTTTAAAGAAATAGAATTAATTGGCTCAGGTGGATTTGGCCAAGTTTTC
>_10|_11
ATTATGATCCTGAGACCAGTGATGATTCTCTTGAGAGCAGTGATTATGATCCTGAGAACAGCAAAAATAGTTCAAGGTTTGGCATGGATTTTAAAGAAATAGAATTAATTGGCTCAGGTGGATTTGGCCAAGTTTTCAAAGCAAAACACA
>_10|_12
TTTGGAACTCTTTGAACAAATAACAAAAGGGGTGGATTATATACATTCAAAAAAATTAATTCATAGAGATCTTAAGGTTTGGCATGGATTTTAAAGAAATAGAATTAATTGGCTCAGGTGGATTTGGCCAAGTTTTCAAAGCAAAACACA
>_10|_13
ACTTGTAACATCTCTGAAAAATGATGGAAAGCGAACAAGGAGTAAGGGAACTTTGCGATACATGAGCCCAGAACAGGTTTGGCATGGATTTTAAAGAAATAGAATTAATTGGCTCAGGTGGATTTGGCCAAGTTTTCAAAGCAAAACACA
>_10|_14
AACGAATTTCTTCGCAAGACTATGGAAAGGAAGTGGACCTCTACGCTTTGGGGCTAATTCTTGCTGAACTTCTTCATGTTTGGCATGGATTTTAAAGAAATAGAATTAATTGGCTCAGGTGGATTTGGCCAAGTTTTCAAAGCAAAACAC

I don't want to lose any header information, so can you tell me what is wrong with this command.

>ENST00000368129_1|ENST00000368129_1
GAAAGTCCACAGAGGAGTTTAAAGCAGCCATGCCAAAAGTGCACTTGCACTCTAAGGAAGCTGAGGTGGGGGAGGCGG
>ENST00000368129_1|ENST00000368129_3
CATATTATTTGACCTAAGTGACAACACTGGGAAAATGGAAGTACTGGGGGTTAGAAACGAGGACACAATGAAATGTTGCACTTGCACTCTAAGGAAGCTGAGGTGGGGGAGGCGGGAAAGTCCACAGAGGAGTTTAAAGCAGCCATGCCA
>ENST00000368129_1|ENST00000368129_2
TGAAACCCCGAAGATCAACACGCTTCAAACTCAGCCCCTTGGAACAATTGTGAATGGTTTGTTTGTAGTCCAGAAGTGCACTTGCACTCTAAGGAAGCTGAGGTGGGGGAGGCGGGAAAGTCCACAGAGGAGTTTAAAGCAGCCATGCCA
>ENST00000368129_1|ENST00000368129_3
CATATTATTTGACCTAAGTGACAACACTGGGAAAATGGAAGTACTGGGGGTTAGAAACGAGGACACAATGAAATGTTGCACTTGCACTCTAAGGAAGCTGAGGTGGGGGAGGCGGGAAAGTCCACAGAGGAGTTTAAAGCAGCCATGCCA
>ENST00000368130_6|ENST00000368130_6
TTAGGATAGAATAATTGCTGGATAAACAAATTCAGAATATCAACAGATGATCACAATAAACATCTGTTTCTCATTCAGTTATTAAGGCCAAAAAAAAAACATAGAGAAGTAAAAAGGACCAATTCAAGCCAACTGGTCTAAGCAGCATTT

Thanks in advance for your instant reply.

ADD REPLY
0
Entering edit mode

There is no problem with the command. Said command doesn't trim any thing. Check your input if it has any spaces in headers. Please post this as separate post with headers from fastaFileWithNoLinebreaksInSeq.

ADD REPLY

Login before adding your answer.

Traffic: 1657 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6