remove part of a multifasta header with sed or other
1
0
Entering edit mode
9.6 years ago

Hi!

I have something like this:

>barcodelabel=#ITS2_A_B1_VG6RM_00076_01732;size=12594;
TCGTTCTCGGACTTTGGGTACAAGGGGCAGGGCTGGCTGCTTCCGGCAGGCGGCCCCGCCGGCGGCGGGGGCCGCCAGTC
GCCGAGTCCTGGCCGCGGTTGCAAAGGGTGGGGTGGCGCCCGGGGGCGTGACCCATTAATGATCCTTCCGCAGGTTCACC
TACGGAAACCTTGTTACGACTTTTACTTCCTCTAAATGACCAAG
>barcodelabel=#ITS2_A_B2_VG6RM_00466_00157;size=9208;
TTAAGTTTTTTCAGACGCTGATTGCAACTGCAAATGGTTTAAATTGTCCAATCGGCGGGCGGACCCGCCGAGGAAACGTA
AGGTACTTAAAAGACATGGGTAAGAGATAGCAGGCAAAGCCTACAACTCTAGGTAATGATCCTTCCGCAGGTTCACCTAC
GGAAACCTTGTTACGACTTTTACTTCCTCTAAATGACCAAG
>barcodelabel=#ITS2_A_B1_VG6RM_00284_01321;size=8857;
TTAATTTGTTACTGACGCTGATTGCAATTACAAAAGGTTTATGTTTGTCCTAGTGGTGGGCGAACCCACCAAGGAAACAA
GAAGTACGCAAAAGACAAGGGTGAATAATTCAGCAAGGCTGTAACCCCGAGAGGTTCCAGCCCGCCTTCATATTTGTGTA
ATGATCCCTCCGCAGGTTCACCTACGGAGACCTTGTTACGACTTTTACTTCCTCTAAATGACCAAG

and I would like to take off the read id part (VG6RM_00284_01321) with sed.

I do not know how do do so since the part before varies (B1, B2, etc).

Thank you

sequence • 2.7k views
ADD COMMENT
3
Entering edit mode
9.6 years ago
iraun 6.2k

If you have only B one time (I mean B1, B2, B3 ... but only once, not B1_B2 in the same read), you can try this to extract the read ID:

sed -r 's/(^>.*B[0-9]\_)[^;]*(;.*)/\1\2/' file >out_file
ADD COMMENT
2
Entering edit mode

That seems a bit off the mark to me. Shouldn't it be

sed -r 's/(^>.*B[0-9]\_)[^;]*(;.*)/\1\2/' file >out_file

as the aim is to retain everything except the part between B[0-9]_ and ; ?

ADD REPLY
0
Entering edit mode

Oh, I understood the opposite, maybe you're right RamRS, but I'm not sure yet.

ADD REPLY
0
Entering edit mode

But it is a bit strange to keep everything except the read ID, isn't it?

ADD REPLY
0
Entering edit mode

Maybe, but that's what OP wants to do. No idea why.

ADD REPLY
0
Entering edit mode

Also, you might wanna edit your answer so OP can accept it.

ADD REPLY
1
Entering edit mode

RamRS, you won the battle ;)

ADD REPLY
1
Entering edit mode

It is the correct way to do? I mean, the correct answer is yours, not mine, so I guess that you should post your comment as an answer?

ADD REPLY
0
Entering edit mode

It's OK. Virtual points are a petty thing to compete for. Plus, you misunderstood the question, so it's not like you did not know how to get there :-)

Also, OP has accepted your answer. High time for you to edit the content!

ADD REPLY
1
Entering edit mode

hahaha, OK, thank you :)

ADD REPLY
0
Entering edit mode

Also, to circumvent your "if there is only one B1/B2", you can use a minimal/non-greedy match expression. Over here, a greedy match of the \1 expression serves the purpose, but let's say you wanna match the shortest a*b in "aabb", if you use the expression a*?b, the match is aab (as opposed to the aabb match for the usual a*b expression - the ? makes all the difference!)

ADD REPLY
0
Entering edit mode

Hi, I did your command line (sed -r 's/(>*).*_B[0-50]_([^;]+).*/\1\2/'g test.fasta > out_test_.fasta) but it actually does the opposite of what I want:

>VG6RM_00466_00157
TTAAGTTTTTTCAGACGCTGATTGCAACTGCAAATGGTTTAAATTGTCCAATCGGCGGGCGGACCCGCCGAGGAAACGTA
AGGTACTTAAAAGACATGGGTAAGAGATAGCAGGCAAAGCCTACAACTCTAGGTAATGATCCTTCCGCAGGTTCACCTAC
GGAAACCTTGTTACGACTTTTACTTCCTCTAAATGACCAAG
>VG6RM_00284_01321
TTAATTTGTTACTGACGCTGATTGCAATTACAAAAGGTTTATGTTTGTCCTAGTGGTGGGCGAACCCACCAAGGAAACAA
GAAGTACGCAAAAGACAAGGGTGAATAATTCAGCAAGGCTGTAACCCCGAGAGGTTCCAGCCCGCCTTCATATTTGTGTA
ATGATCCCTCCGCAGGTTCACCTACGGAGACCTTGTTACGACTTTTACTTCCTCTAAATGACCAAG

Sorry it might not have been clear before hand, but I am looking ot get something like this:

>barcodelabel=#ITS2_A_B1_;size=8857;
TTAATTTGTTACTGACGCTGATTGCAATTACAAAAGGTTTATGTTTGTCCTAGTGGTGGGCGAACCCACCAAGGAAACAA
GAAGTACGCAAAAGACAAGGGTGAATAATTCAGCAAGGCTGTAACCCCGAGAGGTTCCAGCCCGCCTTCATATTTGTGTA
ATGATCCCTCCGCAGGTTCACCTACGGAGACCTTGTTACGACTTTTACTTCCTCTAAATGACCAAG
ADD REPLY
0
Entering edit mode

Yep, sorry, I understood the opposite, see RamRS comment, he has written the correct command for your goal.

ADD REPLY

Login before adding your answer.

Traffic: 2431 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6