Shortening fasta sequence names
2
0
Entering edit mode
7.0 years ago
=) • 0

Hi all,

I have ridiculously long sequence titles for my sequences in a fasta file and it is preventing me from making a local blast database:

SNP_lower_path_271686|P_1:30_A/C|high|nb_pol_1|left_unitig_length_32|right_unitig_length_2|left_contig_length_32|right_contig_length_2|C1_1|C2_2|C3_0|C4_0|C5_4|C6_4|C7_3|C8_2|C9_1|C10_1|C11_1|C12_0|C13_3|C14_3|C15_1|C16_1|C17_1|C18_0|C19_2|C20_2|C21_1|C22_2|C23_2|C24_1|C25_4|C26_2|C27_1|C28_1|C29_0|C30_0|C31_0|C32_0|C33_1|C34_2|C35_3|C36_3|C37_2|C38_2|C39_0|C40_0|C41_3|C42_3|C43_0|C44_0|C45_0|C46_0|C47_1|C48_1|C49_0|C50_0|C51_0|C52_0|C53_0|C54_0|C55_0|C56_0|C57_0|C58_0|C59_0|C60_0|Q1_71|Q2_72|Q3_0|Q4_0|Q5_64|Q6_68|Q7_72|Q8_73|Q9_73|Q10_73|Q11_71|Q12_0|Q13_69|Q14_72|Q15_73|Q16_71|Q17_73|Q18_0|Q19_69|Q20_69|Q21_73|Q22_72|Q23_72|Q24_73|Q25_72|Q26_71|Q27_73|Q28_65|Q29_0|Q30_0|Q31_0|Q32_0|Q33_71|Q34_73|Q35_68|Q36_70|Q37_73|Q38_72|Q39_0|Q40_0|Q41_69|Q42_72|Q43_0|Q44_0|Q45_0|Q46_0|Q47_73|Q48_73|Q49_0|Q50_0|Q51_0|Q52_0|Q53_0|Q54_0|Q55_0|Q56_0|Q57_0|Q58_0|Q59_0|Q60_0|rank_1

I only want to retain the "SNP_lower_path_271686" part of the title, before the first '|' sign. I've tried several sed commands on here with no luck. Does anyone have a sed command for me to try?

Thank you in advance!

blast • 4.4k views
ADD COMMENT
1
Entering edit mode

As many people will probably tell you, this is not a "bioinformatic question". Try to ask those on http://stackoverflow.com/.

Depending on your favorite programming language you can do many things. In R you would simply do:

a <- "SNP_lower_path_271686|P_1:30_A/C|high|nb_pol_1|left_unitig_length_32|right_unitig_length_2|left_contig_length_32|..."
sapply(strsplit(a, split="\\|"), function(x) x[[1]])
[1] "SNP_lower_path_271686"

and with an unix command:

a="SNP_lower_path_271686|P_1:30_A/C|high|nb_pol_1|left_unitig_length_32|right_unitig_length_2|left_contig_length_3"
echo $a | cut -d "|" -f1
ADD REPLY
1
Entering edit mode

As many people will probably tell you, this is not a "bioinformatic question". Try to ask those on http://stackoverflow.com/.

I'm not sure if I agree with that. Fasta and blast seem pretty bioinformatic to me. Many solutions to bioinformatic problems are small shell commands.

ADD REPLY
0
Entering edit mode

It is duly noted.

Cheers, S

ADD REPLY
1
Entering edit mode

There are previous threads that offer solutions for versions similar to your request. Here is one example: Trim The Fasta Title They may not be exactly identical but trying variations out helps you learn. You can always ask a question if you are not able to get something to work just right. Searching Biostars prior to asking a new question is always useful.

ADD REPLY
0
Entering edit mode

Precisely why I asked!

ADD REPLY
0
Entering edit mode

But you provided no examples of what sed commands you had tried. Add that information next time around.

ADD REPLY
1
Entering edit mode

Providing your failed commands could let people help to correct them. So you can improve your SHELL skills.

ADD REPLY
0
Entering edit mode

Thank you. I consider this a constructive response.

ADD REPLY
2
Entering edit mode
7.0 years ago

For Linux/Mac OS X, using shell command sed is the simplest way:

# by removing "|" and later characters:
sed -r 's/\|.+//' seqs.fa > newseqs.fa

Since you asked sed command, please ignore method below:

If you need to run on Windows, the simplest way is using seqkit, just download the tarball of executable binary files, decompress and immediately run:

# specifying the leading non-"|" characters as sequence ID
# -i means only output the sequence ID not whole FASTA header
seqkit seq -i --id-regexp "^([^\|]+)" seqs.fa > newseqs.fa

ADD COMMENT
0
Entering edit mode

Thank you. Your sed command worked.

ADD REPLY
1
Entering edit mode

Better yet, you can "accept" this answer (by using the green check mark) to provide "closure" for this question.

ADD REPLY
0
Entering edit mode

An "upvote" by clicking the "thumb up" picture at left of the answer is appreciated.

ADD REPLY
1
Entering edit mode

You can have my vote!

ADD REPLY
0
Entering edit mode
7.0 years ago
Joe 21k

Same answer I gave in this thread will do what you need to do:

A: fasta seq header

ADD COMMENT
0
Entering edit mode

NB you haven't said what platform you're using - I don't think it will matter for this example but sed on OSX and in UNIX are slightly different (unless you installed GNU utils on Mac) - something to be wary of when using.

ADD REPLY

Login before adding your answer.

Traffic: 1878 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6