How to extract pattern matching sequences from a fasta file?
2
1
Entering edit mode
5.2 years ago

Hi :) I have a huge fasta file containing sequences from several organisms (endosymbiosis) and I want to extract the sequences for each organism respectively. Since my scripting skills are very, very basic I was wondering if there is way to do this in my ubuntu command line without a script?

Of course I tried "grep" but this only gives me the headers/accesion numbers without the sequences....

Thanls in advance for your help!

sequencing • 5.1k views
ADD COMMENT
0
Entering edit mode

Please post example input and expected output marlenejensen

ADD REPLY
0
Entering edit mode

Finally fixed it. Thank your for your support guys!

ADD REPLY
0
Entering edit mode

Hello marlenejensen,

Don't forget to follow up on your threads.

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

Please do the same for your previous posts as well.

ADD REPLY
2
Entering edit mode
5.2 years ago

Assuming you have a linear fasta file (without folded sequences on multiple lines) you can use grep with the -A 1 option, to give you the matching line and the one after the match. Other useful options would be -f, -F and -w.

ADD COMMENT
0
Entering edit mode

It has folded sequences with multiple lines....is it possible to take that into account?

ADD REPLY
0
Entering edit mode

You can linearise the fasta sequences with the following command:

$ while read line ; do if [ "${line:0:1}" == ">" ]; then echo -e "\n"$line ; else  echo $line | tr -d '\n' ; fi ; done < seqs.fasta | tail -n+2 > linear.fasta

Then follow Wouter's suggestion using the new linear file.

ADD REPLY
0
Entering edit mode

I was using this suggestion and it was going to take hours. Instead I used seqkit:

seqkit seq -w 0 species.fa > species_linear.fa

And it took a few seconds.

ADD REPLY
1
Entering edit mode
5.2 years ago

samtools faidx seq.fa headerid will give you the sequence and the header for the given ID.

The first time you are using this command a index for random access is created. So it can take some time. After that every access to a sequence will be fast.

ADD COMMENT
0
Entering edit mode

Thank you! I tried this bit there is a different line length in one of the sequences...is there an option to ignore that case?

ADD REPLY
2
Entering edit mode

I'm not sure if I understood your question correctly. samtools faidx give you the sequence with the given ID in one fasta file containing multiple sequences.

So what is your question/problem about sequence length?

fin swimmer

ADD REPLY
0
Entering edit mode

Yes I tried to run the command but then I got the errror that there is one sequence with a differnet line length which is why it cannot work. So I was thinking that I maybe need to normalize my file?

ADD REPLY
0
Entering edit mode

Please add the command you used and the exact error / warning you got.

ADD REPLY
0
Entering edit mode
samtools faidx OlaviusV10.fasta Delta1
[E::fai_build_core] Different line length in sequence 'Symbionts_6frame_scaffold_2458_1'
Could not load fai index of OlaviusV10.fasta
ADD REPLY
0
Entering edit mode

Have you tried first to linearize your fasta file?

ADD REPLY
0
Entering edit mode

Linearize the fasta file like Wouter suggested is one solution. The other is to use reformat.sh from bbtools to reiapr your fasta file.

$ reformat.sh in=input.fasta out=output.fasta

Now try samtools faidxwith the output.fasta.

fin swimmer

ADD REPLY
0
Entering edit mode

When I tried to use that I got this error.

Error: Could not find or load main class Lab.Data.current. Caused by: java.lang.ClassNotFoundException: Lab.Data.current.

I am super sorry for all the confusion...I just started with all this and I am still struggling a lot. I think I forgot to say that I work with a windows system/ubuntu.

What do I need to do to get this script to run? Thanks ind avance for your help and your patience...

ADD REPLY
1
Entering edit mode

For linearizing fasta file, you can use seqkit and run following command:

seqkit seq <yourfile.fa> -w 0 -o <output.fa>
ADD REPLY

Login before adding your answer.

Traffic: 2988 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6