Question

Shortening fasta sequence names

0

Entering edit mode

7.0 years ago

=) • 0

Hi all,

I have ridiculously long sequence titles for my sequences in a fasta file and it is preventing me from making a local blast database:

SNP_lower_path_271686|P_1:30_A/C|high|nb_pol_1|left_unitig_length_32|right_unitig_length_2|left_contig_length_32|right_contig_length_2|C1_1|C2_2|C3_0|C4_0|C5_4|C6_4|C7_3|C8_2|C9_1|C10_1|C11_1|C12_0|C13_3|C14_3|C15_1|C16_1|C17_1|C18_0|C19_2|C20_2|C21_1|C22_2|C23_2|C24_1|C25_4|C26_2|C27_1|C28_1|C29_0|C30_0|C31_0|C32_0|C33_1|C34_2|C35_3|C36_3|C37_2|C38_2|C39_0|C40_0|C41_3|C42_3|C43_0|C44_0|C45_0|C46_0|C47_1|C48_1|C49_0|C50_0|C51_0|C52_0|C53_0|C54_0|C55_0|C56_0|C57_0|C58_0|C59_0|C60_0|Q1_71|Q2_72|Q3_0|Q4_0|Q5_64|Q6_68|Q7_72|Q8_73|Q9_73|Q10_73|Q11_71|Q12_0|Q13_69|Q14_72|Q15_73|Q16_71|Q17_73|Q18_0|Q19_69|Q20_69|Q21_73|Q22_72|Q23_72|Q24_73|Q25_72|Q26_71|Q27_73|Q28_65|Q29_0|Q30_0|Q31_0|Q32_0|Q33_71|Q34_73|Q35_68|Q36_70|Q37_73|Q38_72|Q39_0|Q40_0|Q41_69|Q42_72|Q43_0|Q44_0|Q45_0|Q46_0|Q47_73|Q48_73|Q49_0|Q50_0|Q51_0|Q52_0|Q53_0|Q54_0|Q55_0|Q56_0|Q57_0|Q58_0|Q59_0|Q60_0|rank_1

I only want to retain the "SNP_lower_path_271686" part of the title, before the first '|' sign. I've tried several sed commands on here with no luck. Does anyone have a sed command for me to try?

Thank you in advance!

blast • 4.4k views

ADD COMMENT • link updated 7.0 years ago by Joe 21k • written 7.0 years ago by =) • 0

1

Entering edit mode

As many people will probably tell you, this is not a "bioinformatic question". Try to ask those on http://stackoverflow.com/.

Depending on your favorite programming language you can do many things. In R you would simply do:

a <- "SNP_lower_path_271686|P_1:30_A/C|high|nb_pol_1|left_unitig_length_32|right_unitig_length_2|left_contig_length_32|..."
sapply(strsplit(a, split="\\|"), function(x) x[[1]])
[1] "SNP_lower_path_271686"

and with an unix command:

a="SNP_lower_path_271686|P_1:30_A/C|high|nb_pol_1|left_unitig_length_32|right_unitig_length_2|left_contig_length_3"
echo $a | cut -d "|" -f1

ADD REPLY • link 7.0 years ago by VHahaut ★ 1.2k

1

Entering edit mode

As many people will probably tell you, this is not a "bioinformatic question". Try to ask those on http://stackoverflow.com/.

I'm not sure if I agree with that. Fasta and blast seem pretty bioinformatic to me. Many solutions to bioinformatic problems are small shell commands.

ADD REPLY • link 7.0 years ago by WouterDeCoster 47k

0

Entering edit mode

It is duly noted.

Cheers, S

ADD REPLY • link 7.0 years ago by =) • 0

1

Entering edit mode

There are previous threads that offer solutions for versions similar to your request. Here is one example: Trim The Fasta Title They may not be exactly identical but trying variations out helps you learn. You can always ask a question if you are not able to get something to work just right. Searching Biostars prior to asking a new question is always useful.

ADD REPLY • link 7.0 years ago by GenoMax 141k

0

Entering edit mode

Precisely why I asked!

ADD REPLY • link 7.0 years ago by =) • 0

0

Entering edit mode

But you provided no examples of what sed commands you had tried. Add that information next time around.

ADD REPLY • link 7.0 years ago by GenoMax 141k

1

Entering edit mode

Providing your failed commands could let people help to correct them. So you can improve your SHELL skills.

ADD REPLY • link 7.0 years ago by shenwei356 8.4k

0

Entering edit mode

Thank you. I consider this a constructive response.

ADD REPLY • link 7.0 years ago by =) • 0

score 2 · Answer 1 · 2017-04-15

2

Entering edit mode

7.0 years ago

shenwei356 8.4k

For Linux/Mac OS X, using shell command sed is the simplest way:

# by removing "|" and later characters:
sed -r 's/\|.+//' seqs.fa > newseqs.fa

Since you asked sed command, please ignore method below:

~~If you need to run on Windows, the simplest way is using seqkit, just download the tarball of executable binary files, decompress and immediately run:~~

~~# specifying the leading non-"|" characters as sequence ID # -i means only output the sequence ID not whole FASTA header seqkit seq -i --id-regexp "^([^\|]+)" seqs.fa > newseqs.fa~~

ADD COMMENT • link 7.0 years ago by shenwei356 8.4k

0

Entering edit mode

Thank you. Your sed command worked.

ADD REPLY • link 7.0 years ago by =) • 0

1

Entering edit mode

Better yet, you can "accept" this answer (by using the green check mark) to provide "closure" for this question.

ADD REPLY • link 7.0 years ago by GenoMax 141k

0

Entering edit mode

An "upvote" by clicking the "thumb up" picture at left of the answer is appreciated.

ADD REPLY • link 7.0 years ago by shenwei356 8.4k

1

Entering edit mode

You can have my vote!

ADD REPLY • link 7.0 years ago by WouterDeCoster 47k

score 0 · Answer 2 · 2017-04-15

0

Entering edit mode

7.0 years ago

Joe 21k

Same answer I gave in this thread will do what you need to do:

A: fasta seq header

ADD COMMENT • link 7.0 years ago by Joe 21k

0

Entering edit mode

NB you haven't said what platform you're using - I don't think it will matter for this example but sed on OSX and in UNIX are slightly different (unless you installed GNU utils on Mac) - something to be wary of when using.

ADD REPLY • link 7.0 years ago by Joe 21k