Question: How to extract information from headers of fasta file
1
gravatar for Crystal
4.7 years ago by
Crystal30
United States
Crystal30 wrote:

Hi All,

I know a lot of people asked similar questions before. I want to specify my question.

I have a database with 5000+ sequences. The format of the header for each sequence is

>AAA23421(AI041) fim41, [Escherichia coli]

>AAA23421 is the gene ID and AI041 is the VFID.

I want to extract gene ID in one txt file and VFID in another txt file.

The code I used before is: 

grep "^>" file.fa | cut -c 2-9 > destination_file.txt

grep "^>" file.fa | cut -c 11-16 > destination_file.txt

because i thought all gene ID is the same length.

BUT, i was wrong. So I can't extract right information.

Is there any modification I can do to extract gene ID between ">" and "(" and then extract VFID between "()"?

 

I have another database which I asked yesterday. The format of the headers (before I remove all the space) is

>VFG0676 lef - anthrax toxin lethal factor, lef, [Bacteria Name] (VF0142)

Is there anyway I can only extract VFG0676 and (VF0142) together to a new txt file? Since some of VFGs do not have their corresponding VFs, so I'd like to extract them in two columns of the same file. PS: the lengths of the headers are definitely not the same. But all the VFG ID are in the front with same length and if they have VF ID, all the VF IDs are in () with same length.

Thank you.

sequence • 4.5k views
ADD COMMENTlink modified 4.7 years ago by Bharat Iyengar270 • written 4.7 years ago by Crystal30
2
gravatar for Devon Ryan
4.7 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

For the first:

grep "^>" foo.fa | sed 's/[>()]/\t/g' | cut -f 2,3 > destination_file.txt

for the second:

grep "^>" foo.fa | perl -ne '@f=/^>(\w+) .+\((\w+)\)$/;printf "%s\t%s\n", @f' > destination_file.txt

Edit: There are likely nicer ways to do that in perl, it's not a language I particularly enjoy.

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Devon Ryan92k

The first one works fantastically!!!!!!!

The second one didn't work. :(

It is really complicated for me, and I can only copy everything you typed and can't figure out where the problem is.

ADD REPLYlink written 4.7 years ago by Crystal30
1

For the second case, the following might work

grep "^>" foo.fa | perl -ne '@f=/^>(\w+) .+?\(?(\w+)?\)?$/;printf "%s\t%s\n", @f' > destination_file.txt

The various question marks make things optional. This will print both the VFG... and VF... part if they're both there and use print nothing in place of the VF... portion if it's not there. That's at least true of the example I tested locally. There are likely edge cases I've not considered, so scroll through the results to ensure they look essentially correct.

ADD REPLYlink written 4.7 years ago by Devon Ryan92k

It was weird. My output is 4KB, but the inside is blank. I can even scroll down in the txt file, but just couldn't see anything there. Are all the space in the code necessary or did I missed anything?

Thank you.

ADD REPLYlink written 4.7 years ago by Crystal30
1

Almost every space in there is needed. If you get a bunch of blank lines (cat -A destination_file.txt | head to see line endings and such) then either you copied something over incorrectly or there's some kind of difference between either what perl is doing on each of our systems or my assumptions about the input file's format. You could always post the output of grep "^>" foo.fa > header_lines.txt somewhere and then I or someone else can look to see if this is a silly mistake on my end or what.
 

ADD REPLYlink written 4.7 years ago by Devon Ryan92k

Thank you so much for your patience.

I tried the code twice (copied the code very carefully), and got the same results (all blanks)

I used cat -A destination_file.txt | head, and it showed a list of ^I$.

I never tried perl on linux terminal, is this the problem?

Also, how I can attach my files here? I've had the header_lines file ready.

Thank you.

ADD REPLYlink written 4.7 years ago by Crystal30

You'll have to upload that file elsewhere, biostars doesn't allow attachments. If the file isn't too big then pastebin would probably work. There's also always, dropbox, google drive, etc.

ADD REPLYlink written 4.7 years ago by Devon Ryan92k

I just added a link to the original post. It is the file of headers for the second database.

I really want to solve the problem. Otherwise, i have to manually extract the information during the weekend.

ADD REPLYlink written 4.7 years ago by Crystal30
1

Oh, well the example in your post didn't match what was in the file, where all spaces were replaced with underscores. That's why nothing was working. Anyway, you can just download the results here. For future reference, the appropriate perl part of the command was:

perl -ne '@f=/^>(\w+?)_.+?\(?(\w+)?\)?$/;printf "%s\t%s\n", @f'
ADD REPLYlink written 4.7 years ago by Devon Ryan92k

I'm so sorry for this confusion. I asked the question on how to remove space in headers two days ago and the code just helped me replace my original file.

I should try your code on my original file.

But, THANK YOU SO MUCH FOR HELPING ME!!

ADD REPLYlink written 4.7 years ago by Crystal30

Ah, I'd missed that the VF... bit was optional in the second case (just reread your post). I'll have to think about that.

ADD REPLYlink written 4.7 years ago by Devon Ryan92k

Hi,

I have another question, is it possible to extract the both VFG and VF together for the first database.

>AAA23421(AI041) fim41, [Escherichia coli]

Extract "AAA23421(AI041)" together from the header?

Thank you.

ADD REPLYlink written 4.7 years ago by Crystal30
grep "^>" some_file | sed -e "s/>//g;s/ /\t/g" | cut -f 1

grep finds all lines starting with >. Sed then replaces all > with nothing and then all spaces with tabs. cut then displays only the first column. Note that I didn't test that, but it should more or less work.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Devon Ryan92k

I followed the code, but the output seemed like remove all space and extract all the information from the header.

This is the format looks like:

AAA23421(AI041)fim41,[Escherichiacoli];

Basically, what I need is all the information before the first column, there is a space between fim41 and ")".

I have attached the link here, would you like to check it, please. https://www.dropbox.com/s/i3v41pigc0db9pa/R3_headers.txt?dl=0&s=sl

BTW, just a stupid question, what is t/g mean in the code?

Thanks

Crystal

ADD REPLYlink written 4.7 years ago by Crystal30
1

It works on my computer:

AAA23421(AI041)
AAA23584(AI036)
AAA23585(AI036)
AAA23586(AI036)
AAA23598(AI036)
AAA23661(AI016)
AAA23662(AI016)
AAA23663(AI016)
AAA23780(AI036)
AAA57451(AI007)

\t is a tab. s/>//g means search (s) for > and replace it with nothing (what's between the 2nd and third /). The results of that are then put through s/ /\t/g, which means search (s) for a space ( ) and replace it with a tab (\t).

BTW, the g part of all of that just means "globally", which I guess isn't needed here. This is standard regex syntax. Regex is very confusing to learn, but extremely useful.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Devon Ryan92k

I know what the problem is!

I ran your code on my local computer terminal and it didn't work.

Then I ran yours on the server and it worked well! The same as the other code I modified!

So all problems have been solved. :)

ADD REPLYlink written 4.7 years ago by Crystal30
1

BTW, here are the results.

ADD REPLYlink written 4.7 years ago by Devon Ryan92k

Thank you so so so much, Devon!!!!!!!!!!!!

Yes, your code and the other one made sense to me and I tried to study from them.

But I don't know why both didn't work on my Mac terminal. Is this the system you used?

Again, I really appreciate your help these days.

Have a nice day!

ADD REPLYlink written 4.7 years ago by Crystal30

Hi Devon,

Is there anyway that I can extract sequence with certain lengths (< 100 bp) from the fasta file?

I have some short sequences in my database and I want to find them out. 

Thanks

Crystal

ADD REPLYlink written 4.6 years ago by Crystal30

Sure, you can do that with biopython easily enough. Something along the line of:

for record in Seqio.parse(open("something.fa","r"), "fasta") :
    if(len(record.seq) < 100) :
        do stuff
ADD REPLYlink written 4.6 years ago by Devon Ryan92k

Hi Devon,

I have a new file and I need to do similar thing to extract ID from the headers. The ID is like VFG000676(gi:4894323) (lef) anthrax toxin lethal factor precursor [Anthrax toxin (VF0142)] [Bacillus anthracis str. Sterne] Is there a way just extract information of of VF0142 in the third ()? Thanks

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Crystal30

Use something similar with the re module, namely re.match().

ADD REPLYlink written 3.6 years ago by Devon Ryan92k

Do I need to write a python code to run this re.match()?

ADD REPLYlink written 3.6 years ago by Crystal30

Yes

ADD REPLYlink written 3.6 years ago by Devon Ryan92k

if I want to specify I want to extract VF#### in the (). Should my code be m=re.match("()\(VF\wt)",element)?

ADD REPLYlink written 3.6 years ago by Crystal30

m = re.search("(VF[\d]+)", element)

This will find VF followed by 1 or more digits.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Devon Ryan92k

But I also need to specify the VF should be within the (), otherwise, there is a chance to extract VFG#### from the beginning of the header. Also should I import re? I found a example of using re.match, it seems I also need to import re? I haven't had any experience using python. Sorry for the stupid questions.

ADD REPLYlink written 3.6 years ago by Crystal30

VFG### isn't VF followed by one or more integers. Yes, you'll need to import re.

ADD REPLYlink written 3.6 years ago by Devon Ryan92k
1

Great. So here is what I have so far:

> import re 
header=('header.txt'),
 for element in header:
> m=re.search("(VF[\d]+)",element) 


if m: 
 print(m)

How can I have a output.txt?

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Crystal30

Quite close, once you have m:

of = open("output.txt", "w") if m is not None: of.write("{}".format(m.group(0))) of.close()

You'll need to modify header=('header.txt') to header = open('header.txt') as well. Otherwise, you exactly got the idea!

ADD REPLYlink written 3.6 years ago by Devon Ryan92k

Thank you Devon. I tried the code using print, it worked well. However, then i figured out in some IDs, there was no VF###. Instead, they are SS###, AI### or CVF###. Then I tried to modify the code as:

m=re.search("(VF[\d]+)\(SS[\d]+)\(AI[\d]+)\(CVF[\d]+)",element)

But I got the error message as "unbalanced parentheses".

Also for your code about the output, should I separate them into several lines? Or should I just write after I define m:

   of=open("output.txt,"w")
   if m:
   of.write("{}".format(m.group(0))) of.close()

Thank you.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Crystal30

Instead of \), try |(. \ is an escape character, so it says, "Treat the next character as just a normal character and not something special, like an enclosure." So you ended up removing a bunch of the left parentheses.

BTW, you might find this site useful. You can play around with regular expressions and input a test string. I find that to be a more convenient way to use to get things like this working.

ADD REPLYlink written 3.6 years ago by Devon Ryan92k

Thank you so much, Devon.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Crystal30
1
gravatar for Pierre Lindenbaum
4.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

using sed and group expression

grep '^>' input.fa | sed -r 's/>([A-Z0-9]+)[^\(]*\(([A-Z0-9]+)\).*/\1,\2/g' > out.txt

 

ADD COMMENTlink written 4.7 years ago by Pierre Lindenbaum124k
1

Hi Pierre, I suggest a few cosmetic modifications (remove the global flag g, target headers with /^>/):

sed -r '/^>/ s/>([A-Z0-9]+)[^\(]*\(([A-Z0-9]+)\).*/\1,\2/' input.fa > out.txt
ADD REPLYlink written 4.7 years ago by Frédéric Mahé3.0k

Is this code for the first database?

How about the second database, it seems more difficult to grep VFG id from the beginning and VF from the end.

ADD REPLYlink written 4.7 years ago by Crystal30
1
gravatar for Prakki Rama
4.7 years ago by
Prakki Rama2.3k
Singapore
Prakki Rama2.3k wrote:

I want to extract gene ID in one txt file

grep -Po -e ">.*?\(" fileName.fa | sed 's/[>(]//g' >file1.txt

1. Find pattern between > and (

2. replace > and ( with null

3. redirect to a file

 ...and VFID in another txt file.

grep -Po -e "\(.*?\)" fileName.fa | sed 's/[()]//g' >file2.txt

Similar summary like the above.

...can only extract VFG0676 and (VF0142) together to a new txt file?

grep -Po -e "VF.*? |\(VF.*?\)" fileName.fa | sed -e :a -e '$!N;s/\n(//;ta' -e 'P;D' | tr ')' ' ' >file3.txt

1. grep: Extract characters after VF until it hits space OR characters after VF inside the brackets (in this case it hits both, prints in two seperate lines)

2. sed: If line starts with "(", append to previous line

3. tr the ")" into null

4. redirect to a file

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Prakki Rama2.3k

Your explanation really make sense to me, and at least I knew how did these code work.

Thank You.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Crystal30
0
gravatar for Bharat Iyengar
4.7 years ago by
Bombay, India
Bharat Iyengar270 wrote:
awk  -F "(" '/^>/{print substr($1,2) > "GeneId.txt"; x=match($2,/\)/);print substr($2,1,x-1) > "VFID.txt"}' yourfile.fa

To get both the ids in the same file:

awk 'BEGIN{FS="(";OFS="\t"} /^>/{x=match($2,/\)/);print substr($1,2), substr($2,1,x-1)}' yourfile.fa > tab_separated_ids.txt

For your other database (getting both the ids in the same file):

>VFG0676 lef - anthrax toxin lethal factor, lef, [Bacteria Name] (VF0142)

awk -v OFS="\t"  '/^>/{gsub(/\(|\)/,"",$NF);  print sub(print substr($1,2), $NF}'  database.fa > tab_separated_ids.txt
ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Bharat Iyengar270

It's typically unwise to assume constant length IDs, particularly when they can be separated by variable width descriptors.
 

ADD REPLYlink written 4.7 years ago by Devon Ryan92k
1

yep.. Didn't see that.. Edited

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Bharat Iyengar270
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1758 users visited in the last hour