Question

How to extract information from headers of fasta file

3

Entering edit mode

9.6 years ago

Crystal ▴ 70

Hi All,

I know a lot of people asked similar questions before. I want to specify my question. I have a database with 5000+ sequences. The format of the header for each sequence is

>AAA23421(AI041) fim41, [Escherichia coli]

>AAA23421 is the gene ID and AI041 is the VFID. I want to extract gene ID in one txt file and VFID in another txt file. The code I used before is:

grep "^>" file.fa | cut -c 2-9 > destination_file.txt
grep "^>" file.fa | cut -c 11-16 > destination_file.txt

because I thought all gene ID is the same length. BUT, I was wrong. So I can't extract right information. Is there any modification I can do to extract gene ID between > and ( and then extract VFID between ()? I have another database which I asked yesterday. The format of the headers (before I remove all the space) is

>VFG0676 lef - anthrax toxin lethal factor, lef, [Bacteria Name] (VF0142)

Is there anyway I can only extract VFG0676 and (VF0142) together to a new txt file? Since some of VFGs do not have their corresponding VFs, so I'd like to extract them in two columns of the same file. PS: the lengths of the headers are definitely not the same. But all the VFG ID are in the front with same length and if they have VF ID, all the VF IDs are in () with same length.

Thank you

sequence • 10k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Crystal ▴ 70

Ram · Answer 1 · 2015-03-19

2

Entering edit mode

9.6 years ago

Devon Ryan 104k

For the first:

grep "^>" foo.fa | sed 's/[>()]/\t/g' | cut -f 2,3 > destination_file.txt

for the second:

grep "^>" foo.fa | perl -ne '@f=/^>(\w+) .+\((\w+)\)$/;printf "%s\t%s\n", @f' > destination_file.txt

Edit: There are likely nicer ways to do that in perl, it's not a language I particularly enjoy.

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

The first one works fantastically!!!!!!!

The second one didn't work. :(

It is really complicated for me, and I can only copy everything you typed and can't figure out where the problem is.

ADD REPLY • link 9.6 years ago by Crystal ▴ 70

1

Entering edit mode

For the second case, the following might work

grep "^>" foo.fa | perl -ne '@f=/^>(\w+) .+?\(?(\w+)?\)?$/;printf "%s\t%s\n", @f' > destination_file.txt

The various question marks make things optional. This will print both the VFG... and VF... part if they're both there and use print nothing in place of the VF... portion if it's not there. That's at least true of the example I tested locally. There are likely edge cases I've not considered, so scroll through the results to ensure they look essentially correct.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

It was weird. My output is 4KB, but the inside is blank. I can even scroll down in the txt file, but just couldn't see anything there. Are all the space in the code necessary or did I missed anything?

Thank you.

ADD REPLY • link 9.6 years ago by Crystal ▴ 70

1

Entering edit mode

Almost every space in there is needed. If you get a bunch of blank lines (cat -A destination_file.txt | head to see line endings and such) then either you copied something over incorrectly or there's some kind of difference between either what perl is doing on each of our systems or my assumptions about the input file's format. You could always post the output of grep "^>" foo.fa > header_lines.txt somewhere and then I or someone else can look to see if this is a silly mistake on my end or what.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you so much for your patience.

I tried the code twice (copied the code very carefully), and got the same results (all blanks)

I used cat -A destination_file.txt | head, and it showed a list of ^I$.

I never tried perl on linux terminal, is this the problem?

Also, how I can attach my files here? I've had the header_lines file ready.

Thank you.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.6 years ago by Crystal ▴ 70

0

Entering edit mode

You'll have to upload that file elsewhere, biostars doesn't allow attachments. If the file isn't too big then pastebin would probably work. There's also always, dropbox, google drive, etc.

ADD REPLY • link 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

I just added a link to the original post. It is the file of headers for the second database.

I really want to solve the problem. Otherwise, i have to manually extract the information during the weekend.

ADD REPLY • link 9.6 years ago by Crystal ▴ 70

1

Entering edit mode

Oh, well the example in your post didn't match what was in the file, where all spaces were replaced with underscores. That's why nothing was working. Anyway, you can just download the results here. For future reference, the appropriate perl part of the command was:

perl -ne '@f=/^>(\w+?)_.+?\(?(\w+)?\)?$/;printf "%s\t%s\n", @f'

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

I'm so sorry for this confusion. I asked the question on how to remove space in headers two days ago and the code just helped me replace my original file.

I should try your code on my original file.

But, THANK YOU SO MUCH FOR HELPING ME!!

ADD REPLY • link 9.6 years ago by Crystal ▴ 70

0

Entering edit mode

Ah, I'd missed that the VF... bit was optional in the second case (just reread your post). I'll have to think about that.

ADD REPLY • link 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

Hi,

I have another question, is it possible to extract the both VFG and VF together for the first database.

>AAA23421(AI041) fim41, [Escherichia coli]

Extract "AAA23421(AI041)" together from the header?

Thank you.

ADD REPLY • link 9.6 years ago by Crystal ▴ 70

0

Entering edit mode

grep "^>" some_file | sed -e "s/>//g;s/ /\t/g" | cut -f 1

grep finds all lines starting with >. Sed then replaces all > with nothing and then all spaces with tabs. cut then displays only the first column. Note that I didn't test that, but it should more or less work.

ADD REPLY • link 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

I followed the code, but the output seemed like remove all space and extract all the information from the header.

This is the format looks like:

AAA23421(AI041) fim41,[Escherichiacoli];

Basically, what I need is all the information before the first column, there is a space between fim41 and ).

I have attached the link here, would you like to check it, please.

BTW, just a stupid question, what is t/g mean in the code?

Thanks
Crystal

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.6 years ago by Crystal ▴ 70

1

Entering edit mode

It works on my computer:

AAA23421(AI041)
AAA23584(AI036)
AAA23585(AI036)
AAA23586(AI036)
AAA23598(AI036)
AAA23661(AI016)
AAA23662(AI016)
AAA23663(AI016)
AAA23780(AI036)
AAA57451(AI007)

\t is a tab. s/>//g means search (s) for > and replace it with nothing (what's between the 2nd and third /). The results of that are then put through s/ /\t/g, which means search (s) for a space ( ) and replace it with a tab (\t).

BTW, the g part of all of that just means "globally", which I guess isn't needed here. This is standard regex syntax. Regex is very confusing to learn, but extremely useful.

ADD REPLY • link 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

I know what the problem is!

I ran your code on my local computer terminal and it didn't work.

Then I ran yours on the server and it worked well! The same as the other code I modified!

So all problems have been solved. :)

ADD REPLY • link 9.6 years ago by Crystal ▴ 70

1

Entering edit mode

BTW, here are the results.

ADD REPLY • link 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you so so so much, Devon!!!!!!!!!!!!

Yes, your code and the other one made sense to me and I tried to study from them.

But I don't know why both didn't work on my Mac terminal. Is this the system you used?

Again, I really appreciate your help these days.

Have a nice day!

ADD REPLY • link 9.6 years ago by Crystal ▴ 70

0

Entering edit mode

Hi Devon,

Is there anyway that I can extract sequence with certain lengths (< 100 bp) from the fasta file?

I have some short sequences in my database and I want to find them out.

Thanks
Crystal

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Crystal ▴ 70

0

Entering edit mode

Sure, you can do that with biopython easily enough. Something along the line of:

for record in Seqio.parse(open("something.fa","r"), "fasta") :
    if(len(record.seq) < 100) :
        do stuff

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

Hi Devon,

I have a new file and I need to do similar thing to extract ID from the headers. The ID is like VFG000676(gi:4894323) (lef) anthrax toxin lethal factor precursor [Anthrax toxin (VF0142)] [Bacillus anthracis str. Sterne] Is there a way just extract information of of VF0142 in the third ()? Thanks

ADD REPLY • link 8.6 years ago by Crystal ▴ 70

0

Entering edit mode

Use something similar with the re module, namely re.match().

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

0

Entering edit mode

Do I need to write a python code to run this re.match()?

ADD REPLY • link 8.6 years ago by Crystal ▴ 70

0

Entering edit mode

Yes

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

0

Entering edit mode

if I want to specify I want to extract VF#### in the (). Should my code be m=re.match("()\(VF\wt)",element)?

ADD REPLY • link 8.6 years ago by Crystal ▴ 70

0

Entering edit mode

m = re.search("(VF[\d]+)", element)

This will find VF followed by 1 or more digits.

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

0

Entering edit mode

But I also need to specify the VF should be within the (), otherwise, there is a chance to extract VFG#### from the beginning of the header. Also should I import re? I found a example of using re.match, it seems I also need to import re? I haven't had any experience using python. Sorry for the stupid questions.

ADD REPLY • link 8.6 years ago by Crystal ▴ 70

0

Entering edit mode

VFG### isn't VF followed by one or more integers. Yes, you'll need to import re.

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

1

Entering edit mode

Great. So here is what I have so far:

> import re 
header=('header.txt'),
 for element in header:
> m=re.search("(VF[\d]+)",element) 


if m: 
 print(m)

How can I have a output.txt?

ADD REPLY • link 8.6 years ago by Crystal ▴ 70

0

Entering edit mode

Quite close, once you have m:

of = open("output.txt", "w") if m is not None: of.write("{}".format(m.group(0))) of.close()

You'll need to modify header=('header.txt') to header = open('header.txt') as well. Otherwise, you exactly got the idea!

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you Devon. I tried the code using print, it worked well. However, then i figured out in some IDs, there was no VF###. Instead, they are SS###, AI### or CVF###. Then I tried to modify the code as:

m=re.search("(VF[\d]+)\(SS[\d]+)\(AI[\d]+)\(CVF[\d]+)",element)

But I got the error message as "unbalanced parentheses".

Also for your code about the output, should I separate them into several lines? Or should I just write after I define m:

   of=open("output.txt,"w")
   if m:
   of.write("{}".format(m.group(0))) of.close()

Thank you.

ADD REPLY • link 8.6 years ago by Crystal ▴ 70

0

Entering edit mode

Instead of \), try |(. \ is an escape character, so it says, "Treat the next character as just a normal character and not something special, like an enclosure." So you ended up removing a bunch of the left parentheses.

BTW, you might find this site useful. You can play around with regular expressions and input a test string. I find that to be a more convenient way to use to get things like this working.

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you so much, Devon.

ADD REPLY • link 8.6 years ago by Crystal ▴ 70

Ram · Answer 2 · 2015-03-19

2

Entering edit mode

9.6 years ago

Pierre Lindenbaum 164k

Using sed and group expression

grep '^>' input.fa | sed -r 's/>([A-Z0-9]+)[^\(]*\(([A-Z0-9]+)\).*/\1,\2/g' > out.txt

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.6 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

Hi Pierre, I suggest a few cosmetic modifications (remove the global flag g, target headers with /^>/):

sed -r '/^>/ s/>([A-Z0-9]+)[^\(]*\(([A-Z0-9]+)\).*/\1,\2/' input.fa > out.txt

ADD REPLY • link 9.6 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

Is this code for the first database?

How about the second database, it seems more difficult to grep VFG id from the beginning and VF from the end.

ADD REPLY • link 9.6 years ago by Crystal ▴ 70

Ram · Answer 3 · 2015-03-20

I want to extract gene ID in one txt file

grep -Po -e ">.*?\(" fileName.fa | sed 's/[>(]//g' >file1.txt

Find pattern between > and (
Replace > and ( with null
Redirect to a file

...and VFID in another txt file.

grep -Po -e "\(.*?\)" fileName.fa | sed 's/[()]//g' >file2.txt

Similar summary like the above.

...can only extract VFG0676 and (VF0142) together to a new txt file?

grep -Po -e "VF.*? |\(VF.*?\)" fileName.fa | sed -e :a -e '$!N;s/\n(//;ta' -e 'P;D' | tr ')' ' ' >file3.txt

grep: Extract characters after VF until it hits space OR characters after VF inside the brackets (in this case it hits both, prints in two seperate lines)
sed: If line starts with "(", append to previous line
tr the ")" into null
redirect to a file

Ram · Answer 4 · 2015-03-24

0

Entering edit mode

9.6 years ago

Bharat Iyengar ▴ 330

awk  -F "(" '/^>/{print substr($1,2) > "GeneId.txt"; x=match($2,/\)/);print substr($2,1,x-1) > "VFID.txt"}' yourfile.fa

To get both the ids in the same file:

awk 'BEGIN{FS="(";OFS="\t"} /^>/{x=match($2,/\)/);print substr($1,2), substr($2,1,x-1)}' yourfile.fa > tab_separated_ids.txt

For your other database (getting both the ids in the same file):

>VFG0676 lef - anthrax toxin lethal factor, lef, [Bacteria Name] (VF0142)

awk -v OFS="\t"  '/^>/{gsub(/\(|\)/,"",$NF);  print sub(print substr($1,2), $NF}'  database.fa > tab_separated_ids.txt

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.6 years ago by Bharat Iyengar ▴ 330

0

Entering edit mode

It's typically unwise to assume constant length IDs, particularly when they can be separated by variable width descriptors.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.6 years ago by Devon Ryan 104k

1

Entering edit mode

yep.. Didn't see that.. Edited

ADD REPLY • link 9.6 years ago by Bharat Iyengar ▴ 330