Question: grep/awk issue: can't extract sequences from a fasta file
0
gravatar for benoit.kunath
13 months ago by
benoit.kunath10 wrote:

Hey everybody,

I have an issue and I'm completely running out of idea here.

I have a fasta file with mutiple sequences. I have an ID file with IDs corresponding to some of the sequences in the fasta file and I want to extract those sequences.

Typical task, I've done it many times either using grep or awk like:

 grep -A1 -w -f id_file.txt Input.fasta > output.fasta

Thing is this time, It doesn't work. Even though I can find the ID when using ctrl F look for in notepad. When I used grep or awk it doesn't give me anything. I've suspected hidden character or "to the next line" character issue but none of those were the issue.

Does any of you have an idea on what could prevent grep to recognize my sequences and to extract them?

Thank you very much for any help. Best,

Ben

ADD COMMENTlink modified 8 months ago by Biostar ♦♦ 20 • written 13 months ago by benoit.kunath10
2

Since you mention notepad I assume this file has been opened in that program on Windows? I suggest using dos2unix on your server to change the line endings to unix format to see if that fixes the problem.

Generally for the task you are referring to, faSomeRecords utility from Jim Kent (UCSC) works wonders. That tool is described in this recent thread along with several other options: Extract fasta sequences from a file using a list in another file.

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax70k

Sorry, have to go to the dry cleaners.

Paste some sample data for the others. Also mention the system (operating system) and shell that you're using.

ADD REPLYlink modified 13 months ago • written 13 months ago by Kevin Blighe45k

try

 tr -d '\r' < Input.fasta | grep -A1 -w -f <(tr -d '\r' < id_file.txt) > output.fasta
ADD REPLYlink written 13 months ago by Pierre Lindenbaum121k

Hello benoit.kunath,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Additionally, please use appropriate tags. Your question is about "fasta sequence extraction". That should have been a tag when you created the question. When you add appropriate tags, users that follow the tag (usually experts interested in helping others in that subject matter) get notified of your question, and this means you stand a better chance at getting a relevant, useful response faster. Thank you!

ADD REPLYlink modified 13 months ago • written 13 months ago by lakhujanivijay4.2k

Hello,

Thanks everyone for your help. I used notepad++ to change the line endings on both the ID and the fasta file. I've also made sure that my sequences were properly on one line each.

Now, grep gives me a output, but some of the sequences are not complete. It looks like something like that:

>Protein52886 yypfschkkGQSGGPTSVINASAAGVIQEALKQDCITAVYGAAHGIKpfiisartltvtplffrilhhrrpqwkafllltaqyfllyprslllsiphkkhnailssrfrktl
 --

The sequences from notepad after conversion of the ending looks like that:

>Protein33753
RTTAEEIWRDTDGQVDIFVAGVGTGGTISGVGEVLKlvirdlvfcnytlphlrgkmrpsickaykdfydeiedifgmlds
nevntywnylknykfeggmlq
>Protein48634
IGYAMIKDAEEKGIINKDTVIIEPTSGNTGIALAFVAAARyymrfknrnfimirraynmftkaeeiyskfneeniqimip
kkllftllqqvdrllellsneevasnfaiydyisnaemlmvklyilsaepynqkeviletsiaeflvirdlvfcnytlph
lrgkmrpsickaykdfydeiedifgmldsnevntywnylknykfeggmlq

Any Idea why some of the sequences are not complete? I've had that issue before I think, but it was because my sequences were not "really" on 1 line. But this time I made sure it was using this command line:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' input.fasta | awk '{if($0~/^>/)print $1; else print}' > output.fasta

Thanks a lot for your help. (and thanks for the edit on the post. I'll pay more attention!

ADD REPLYlink modified 13 months ago • written 13 months ago by benoit.kunath10

Please post problematic example input sequences and output sequences and your final code.

ADD REPLYlink written 13 months ago by cpad011211k

I posted that already, it's in my previous post

ADD REPLYlink written 13 months ago by benoit.kunath10

Use an established fasta parsing library or program. They've already ironed out all kinds of quirks of fasta files.

ADD REPLYlink written 13 months ago by Jean-Karim Heriche20k

any example? Thanks

ADD REPLYlink written 13 months ago by benoit.kunath10
1

Perl: Bioperl module Bio::SeqIO
R: function read.fasta() in package seqinr
C: Li Heng's parser
Python: there should be an equivalent to SeqIO in biopython

ADD REPLYlink modified 13 months ago • written 13 months ago by Jean-Karim Heriche20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1709 users visited in the last hour