Question

How to extract pattern matching sequences from a fasta file?

1

Entering edit mode

5.2 years ago

marlenejensen ▴ 20

Hi :) I have a huge fasta file containing sequences from several organisms (endosymbiosis) and I want to extract the sequences for each organism respectively. Since my scripting skills are very, very basic I was wondering if there is way to do this in my ubuntu command line without a script?

Of course I tried "grep" but this only gives me the headers/accesion numbers without the sequences....

Thanls in advance for your help!

sequencing • 5.1k views

ADD COMMENT • link 5.2 years ago by marlenejensen ▴ 20

0

Entering edit mode

Please post example input and expected output marlenejensen

ADD REPLY • link 5.2 years ago by cpad0112 21k

0

Entering edit mode

Finally fixed it. Thank your for your support guys!

ADD REPLY • link 5.2 years ago by marlenejensen ▴ 20

0

Entering edit mode

Hello marlenejensen,

Don't forget to follow up on your threads.

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

Please do the same for your previous posts as well.

ADD REPLY • link 5.2 years ago by Kevin Blighe 87k

score 2 · Answer 1 · 2019-02-09

2

Entering edit mode

5.2 years ago

WouterDeCoster 47k

Assuming you have a linear fasta file (without folded sequences on multiple lines) you can use grep with the -A 1 option, to give you the matching line and the one after the match. Other useful options would be -f, -F and -w.

ADD COMMENT • link 5.2 years ago by WouterDeCoster 47k

0

Entering edit mode

It has folded sequences with multiple lines....is it possible to take that into account?

ADD REPLY • link 5.2 years ago by marlenejensen ▴ 20

0

Entering edit mode

You can linearise the fasta sequences with the following command:

$ while read line ; do if [ "${line:0:1}" == ">" ]; then echo -e "\n"$line ; else  echo $line | tr -d '\n' ; fi ; done < seqs.fasta | tail -n+2 > linear.fasta

Then follow Wouter's suggestion using the new linear file.

ADD REPLY • link 5.2 years ago by Joe 21k

0

Entering edit mode

I was using this suggestion and it was going to take hours. Instead I used seqkit:

seqkit seq -w 0 species.fa > species_linear.fa

And it took a few seconds.

ADD REPLY • link 4.0 years ago by iwt • 0

WouterDeCoster · Answer 2 · 2019-02-09

1

Entering edit mode

5.2 years ago

finswimmer 16k

samtools faidx seq.fa headerid will give you the sequence and the header for the given ID.

The first time you are using this command a index for random access is created. So it can take some time. After that every access to a sequence will be fast.

ADD COMMENT • link updated 5.2 years ago by WouterDeCoster 47k • written 5.2 years ago by finswimmer 16k

0

Entering edit mode

Thank you! I tried this bit there is a different line length in one of the sequences...is there an option to ignore that case?

ADD REPLY • link 5.2 years ago by marlenejensen ▴ 20

2

Entering edit mode

I'm not sure if I understood your question correctly. samtools faidx give you the sequence with the given ID in one fasta file containing multiple sequences.

So what is your question/problem about sequence length?

fin swimmer

ADD REPLY • link 5.2 years ago by finswimmer 16k

0

Entering edit mode

Yes I tried to run the command but then I got the errror that there is one sequence with a differnet line length which is why it cannot work. So I was thinking that I maybe need to normalize my file?

ADD REPLY • link 5.2 years ago by marlenejensen ▴ 20

0

Entering edit mode

Please add the command you used and the exact error / warning you got.

ADD REPLY • link 5.2 years ago by WouterDeCoster 47k

0

Entering edit mode

samtools faidx OlaviusV10.fasta Delta1
[E::fai_build_core] Different line length in sequence 'Symbionts_6frame_scaffold_2458_1'
Could not load fai index of OlaviusV10.fasta

ADD REPLY • link updated 5.2 years ago by finswimmer 16k • written 5.2 years ago by marlenejensen ▴ 20

0

Entering edit mode

Have you tried first to linearize your fasta file?

ADD REPLY • link 5.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Linearize the fasta file like Wouter suggested is one solution. The other is to use reformat.sh from bbtools to reiapr your fasta file.

$ reformat.sh in=input.fasta out=output.fasta

Now try samtools faidxwith the output.fasta.

fin swimmer

ADD REPLY • link 5.2 years ago by finswimmer 16k

0

Entering edit mode

When I tried to use that I got this error.

Error: Could not find or load main class Lab.Data.current. Caused by: java.lang.ClassNotFoundException: Lab.Data.current.

I am super sorry for all the confusion...I just started with all this and I am still struggling a lot. I think I forgot to say that I work with a windows system/ubuntu.

What do I need to do to get this script to run? Thanks ind avance for your help and your patience...

ADD REPLY • link 5.2 years ago by marlenejensen ▴ 20

1

Entering edit mode

For linearizing fasta file, you can use seqkit and run following command:

seqkit seq <yourfile.fa> -w 0 -o <output.fa>

ADD REPLY • link 5.2 years ago by cpad0112 21k