grep/awk issue: can't extract sequences from a fasta file
0
0
Entering edit mode
3.9 years ago

Hey everybody,

I have an issue and I'm completely running out of idea here.

I have a fasta file with mutiple sequences. I have an ID file with IDs corresponding to some of the sequences in the fasta file and I want to extract those sequences.

Typical task, I've done it many times either using grep or awk like:

 grep -A1 -w -f id_file.txt Input.fasta > output.fasta


Thing is this time, It doesn't work. Even though I can find the ID when using ctrl F look for in notepad. When I used grep or awk it doesn't give me anything. I've suspected hidden character or "to the next line" character issue but none of those were the issue.

Does any of you have an idea on what could prevent grep to recognize my sequences and to extract them?

Thank you very much for any help. Best,

Ben

fasta sequence extraction grep shell • 2.1k views
2
Entering edit mode

Since you mention notepad I assume this file has been opened in that program on Windows? I suggest using dos2unix on your server to change the line endings to unix format to see if that fixes the problem.

Generally for the task you are referring to, faSomeRecords utility from Jim Kent (UCSC) works wonders. That tool is described in this recent thread along with several other options: Extract fasta sequences from a file using a list in another file.

0
Entering edit mode

Sorry, have to go to the dry cleaners.

Paste some sample data for the others. Also mention the system (operating system) and shell that you're using.

0
Entering edit mode

try

 tr -d '\r' < Input.fasta | grep -A1 -w -f <(tr -d '\r' < id_file.txt) > output.fasta

0
Entering edit mode

Hello benoit.kunath,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.

Additionally, please use appropriate tags. Your question is about "fasta sequence extraction". That should have been a tag when you created the question. When you add appropriate tags, users that follow the tag (usually experts interested in helping others in that subject matter) get notified of your question, and this means you stand a better chance at getting a relevant, useful response faster. Thank you!

0
Entering edit mode

Hello,

Thanks everyone for your help. I used notepad++ to change the line endings on both the ID and the fasta file. I've also made sure that my sequences were properly on one line each.

Now, grep gives me a output, but some of the sequences are not complete. It looks like something like that:

>Protein52886 yypfschkkGQSGGPTSVINASAAGVIQEALKQDCITAVYGAAHGIKpfiisartltvtplffrilhhrrpqwkafllltaqyfllyprslllsiphkkhnailssrfrktl
--


The sequences from notepad after conversion of the ending looks like that:

>Protein33753
RTTAEEIWRDTDGQVDIFVAGVGTGGTISGVGEVLKlvirdlvfcnytlphlrgkmrpsickaykdfydeiedifgmlds
nevntywnylknykfeggmlq
>Protein48634
IGYAMIKDAEEKGIINKDTVIIEPTSGNTGIALAFVAAARyymrfknrnfimirraynmftkaeeiyskfneeniqimip
kkllftllqqvdrllellsneevasnfaiydyisnaemlmvklyilsaepynqkeviletsiaeflvirdlvfcnytlph
lrgkmrpsickaykdfydeiedifgmldsnevntywnylknykfeggmlq


Any Idea why some of the sequences are not complete? I've had that issue before I think, but it was because my sequences were not "really" on 1 line. But this time I made sure it was using this command line:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' input.fasta | awk '{if($0~/^>/)print$1; else print}' > output.fasta


Thanks a lot for your help. (and thanks for the edit on the post. I'll pay more attention!

0
Entering edit mode

Please post problematic example input sequences and output sequences and your final code.

0
Entering edit mode

I posted that already, it's in my previous post

0
Entering edit mode

Use an established fasta parsing library or program. They've already ironed out all kinds of quirks of fasta files.

0
Entering edit mode

any example? Thanks

1
Entering edit mode

Perl: Bioperl module Bio::SeqIO
R: function read.fasta() in package seqinr
C: Li Heng's parser
Python: there should be an equivalent to SeqIO in biopython

1
Entering edit mode