Question: How to extract pattern matching sequences from a fasta file?
1
gravatar for marlenejensen
10 days ago by
marlenejensen10 wrote:

Hi :) I have a huge fasta file containing sequences from several organisms (endosymbiosis) and I want to extract the sequences for each organism respectively. Since my scripting skills are very, very basic I was wondering if there is way to do this in my ubuntu command line without a script?

Of course I tried "grep" but this only gives me the headers/accesion numbers without the sequences....

Thanls in advance for your help!

sequencing • 245 views
ADD COMMENTlink modified 8 days ago • written 10 days ago by marlenejensen10

Please post example input and expected output marlenejensen

ADD REPLYlink written 9 days ago by cpad011211k

Finally fixed it. Thank your for your support guys!

ADD REPLYlink written 8 days ago by marlenejensen10

Hello marlenejensen,

Don't forget to follow up on your threads.

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

Please do the same for your previous posts as well.

ADD REPLYlink written 8 days ago by Kevin Blighe37k
2
gravatar for WouterDeCoster
10 days ago by
Belgium
WouterDeCoster36k wrote:

Assuming you have a linear fasta file (without folded sequences on multiple lines) you can use grep with the -A 1 option, to give you the matching line and the one after the match. Other useful options would be -f, -F and -w.

ADD COMMENTlink written 10 days ago by WouterDeCoster36k

It has folded sequences with multiple lines....is it possible to take that into account?

ADD REPLYlink written 10 days ago by marlenejensen10

You can linearise the fasta sequences with the following command:

$ while read line ; do if [ "${line:0:1}" == ">" ]; then echo -e "\n"$line ; else  echo $line | tr -d '\n' ; fi ; done < seqs.fasta | tail -n+2 > linear.fasta

Then follow Wouter's suggestion using the new linear file.

ADD REPLYlink modified 10 days ago • written 10 days ago by jrj.healey10k
1
gravatar for finswimmer
10 days ago by
finswimmer9.9k
Germany
finswimmer9.9k wrote:

samtools faidx seq.fa headerid will give you the sequence and the header for the given ID.

The first time you are using this command a index for random access is created. So it can take some time. After that every access to a sequence will be fast.

ADD COMMENTlink modified 9 days ago by WouterDeCoster36k • written 10 days ago by finswimmer9.9k

Thank you! I tried this bit there is a different line length in one of the sequences...is there an option to ignore that case?

ADD REPLYlink written 10 days ago by marlenejensen10
2

I'm not sure if I understood your question correctly. samtools faidx give you the sequence with the given ID in one fasta file containing multiple sequences.

So what is your question/problem about sequence length?

fin swimmer

ADD REPLYlink written 10 days ago by finswimmer9.9k

Yes I tried to run the command but then I got the errror that there is one sequence with a differnet line length which is why it cannot work. So I was thinking that I maybe need to normalize my file?

ADD REPLYlink written 10 days ago by marlenejensen10

Please add the command you used and the exact error / warning you got.

ADD REPLYlink written 9 days ago by WouterDeCoster36k
samtools faidx OlaviusV10.fasta Delta1
[E::fai_build_core] Different line length in sequence 'Symbionts_6frame_scaffold_2458_1'
Could not load fai index of OlaviusV10.fasta
ADD REPLYlink modified 9 days ago by finswimmer9.9k • written 9 days ago by marlenejensen10

Have you tried first to linearize your fasta file?

ADD REPLYlink written 9 days ago by WouterDeCoster36k

Linearize the fasta file like Wouter suggested is one solution. The other is to use reformat.sh from bbtools to reiapr your fasta file.

$ reformat.sh in=input.fasta out=output.fasta

Now try samtools faidxwith the output.fasta.

fin swimmer

ADD REPLYlink written 9 days ago by finswimmer9.9k

When I tried to use that I got this error.

Error: Could not find or load main class Lab.Data.current. Caused by: java.lang.ClassNotFoundException: Lab.Data.current.

I am super sorry for all the confusion...I just started with all this and I am still struggling a lot. I think I forgot to say that I work with a windows system/ubuntu.

What do I need to do to get this script to run? Thanks ind avance for your help and your patience...

ADD REPLYlink written 9 days ago by marlenejensen10
1

For linearizing fasta file, you can use seqkit and run following command:

seqkit seq <yourfile.fa> -w 0 -o <output.fa>
ADD REPLYlink written 9 days ago by cpad011211k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1762 users visited in the last hour