Question

grep/awk issue: can't extract sequences from a fasta file

0

Entering edit mode

5.8 years ago

benoit.kunath ▴ 10

Hey everybody,

I have an issue and I'm completely running out of idea here.

I have a fasta file with mutiple sequences. I have an ID file with IDs corresponding to some of the sequences in the fasta file and I want to extract those sequences.

Typical task, I've done it many times either using grep or awk like:

 grep -A1 -w -f id_file.txt Input.fasta > output.fasta

Thing is this time, It doesn't work. Even though I can find the ID when using ctrl F look for in notepad. When I used grep or awk it doesn't give me anything. I've suspected hidden character or "to the next line" character issue but none of those were the issue.

Does any of you have an idea on what could prevent grep to recognize my sequences and to extract them?

Thank you very much for any help. Best,

Ben

fasta sequence extraction grep shell • 2.9k views

ADD COMMENT • link updated 5.4 years ago by Biostar 20 • written 5.8 years ago by benoit.kunath ▴ 10

2

Entering edit mode

Since you mention notepad I assume this file has been opened in that program on Windows? I suggest using dos2unix on your server to change the line endings to unix format to see if that fixes the problem.

Generally for the task you are referring to, faSomeRecords utility from Jim Kent (UCSC) works wonders. That tool is described in this recent thread along with several other options: Extract fasta sequences from a file using a list in another file.

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

Sorry, have to go to the dry cleaners.

Paste some sample data for the others. Also mention the system (operating system) and shell that you're using.

ADD REPLY • link 5.8 years ago by Kevin Blighe 87k

0

Entering edit mode

try

 tr -d '\r' < Input.fasta | grep -A1 -w -f <(tr -d '\r' < id_file.txt) > output.fasta

ADD REPLY • link 5.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hello benoit.kunath,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Additionally, please use appropriate tags. Your question is about "fasta sequence extraction". That should have been a tag when you created the question. When you add appropriate tags, users that follow the tag (usually experts interested in helping others in that subject matter) get notified of your question, and this means you stand a better chance at getting a relevant, useful response faster. Thank you!

ADD REPLY • link 5.8 years ago by lakhujanivijay 5.8k

0

Entering edit mode

Hello,

Thanks everyone for your help. I used notepad++ to change the line endings on both the ID and the fasta file. I've also made sure that my sequences were properly on one line each.

Now, grep gives me a output, but some of the sequences are not complete. It looks like something like that:

>Protein52886 yypfschkkGQSGGPTSVINASAAGVIQEALKQDCITAVYGAAHGIKpfiisartltvtplffrilhhrrpqwkafllltaqyfllyprslllsiphkkhnailssrfrktl
 --

The sequences from notepad after conversion of the ending looks like that:

>Protein33753
RTTAEEIWRDTDGQVDIFVAGVGTGGTISGVGEVLKlvirdlvfcnytlphlrgkmrpsickaykdfydeiedifgmlds
nevntywnylknykfeggmlq
>Protein48634
IGYAMIKDAEEKGIINKDTVIIEPTSGNTGIALAFVAAARyymrfknrnfimirraynmftkaeeiyskfneeniqimip
kkllftllqqvdrllellsneevasnfaiydyisnaemlmvklyilsaepynqkeviletsiaeflvirdlvfcnytlph
lrgkmrpsickaykdfydeiedifgmldsnevntywnylknykfeggmlq

Any Idea why some of the sequences are not complete? I've had that issue before I think, but it was because my sequences were not "really" on 1 line. But this time I made sure it was using this command line:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' input.fasta | awk '{if($0~/^>/)print $1; else print}' > output.fasta

Thanks a lot for your help. (and thanks for the edit on the post. I'll pay more attention!

ADD REPLY • link 5.8 years ago by benoit.kunath ▴ 10

0

Entering edit mode

Please post problematic example input sequences and output sequences and your final code.

ADD REPLY • link 5.8 years ago by cpad0112 21k

0

Entering edit mode

I posted that already, it's in my previous post

ADD REPLY • link 5.8 years ago by benoit.kunath ▴ 10

0

Entering edit mode

Use an established fasta parsing library or program. They've already ironed out all kinds of quirks of fasta files.

ADD REPLY • link 5.8 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

any example? Thanks

ADD REPLY • link 5.8 years ago by benoit.kunath ▴ 10

1

Entering edit mode

Perl: Bioperl module Bio::SeqIO
R: function read.fasta() in package seqinr
C: Li Heng's parser
Python: there should be an equivalent to SeqIO in biopython

ADD REPLY • link 5.8 years ago by Jean-Karim Heriche 27k

1

Entering edit mode

Using seqkit

How to extract sequences subset from FASTA/Q file with name/ID list file?

ADD REPLY • link 5.8 years ago by lakhujanivijay 5.8k