Hi All,
I know a lot of people asked similar questions before. I want to specify my question. I have a database with 5000+ sequences. The format of the header for each sequence is
>AAA23421(AI041) fim41, [Escherichia coli]
>AAA23421
is the gene ID and AI041
is the VFID. I want to extract gene ID in one txt file and VFID in another txt file. The code I used before is:
grep "^>" file.fa | cut -c 2-9 > destination_file.txt
grep "^>" file.fa | cut -c 11-16 > destination_file.txt
because I thought all gene ID is the same length. BUT, I was wrong. So I can't extract right information. Is there any modification I can do to extract gene ID between >
and (
and then extract VFID between ()
? I have another database which I asked yesterday. The format of the headers (before I remove all the space) is
>VFG0676 lef - anthrax toxin lethal factor, lef, [Bacteria Name] (VF0142)
Is there anyway I can only extract VFG0676 and (VF0142) together to a new txt file? Since some of VFGs do not have their corresponding VFs, so I'd like to extract them in two columns of the same file. PS: the lengths of the headers are definitely not the same. But all the VFG ID are in the front with same length and if they have VF ID, all the VF IDs are in ()
with same length.
Thank you
The first one works fantastically!!!!!!!
The second one didn't work. :(
It is really complicated for me, and I can only copy everything you typed and can't figure out where the problem is.
For the second case, the following might work
The various question marks make things optional. This will print both the VFG... and VF... part if they're both there and use print nothing in place of the VF... portion if it's not there. That's at least true of the example I tested locally. There are likely edge cases I've not considered, so scroll through the results to ensure they look essentially correct.
It was weird. My output is 4KB, but the inside is blank. I can even scroll down in the txt file, but just couldn't see anything there. Are all the space in the code necessary or did I missed anything?
Thank you.
Almost every space in there is needed. If you get a bunch of blank lines (
cat -A destination_file.txt | head
to see line endings and such) then either you copied something over incorrectly or there's some kind of difference between either what perl is doing on each of our systems or my assumptions about the input file's format. You could always post the output ofgrep "^>" foo.fa > header_lines.txt
somewhere and then I or someone else can look to see if this is a silly mistake on my end or what.Thank you so much for your patience.
I tried the code twice (copied the code very carefully), and got the same results (all blanks)
I used
cat -A destination_file.txt | head
, and it showed a list of^I$
.I never tried perl on linux terminal, is this the problem?
Also, how I can attach my files here? I've had the header_lines file ready.
Thank you.
You'll have to upload that file elsewhere, biostars doesn't allow attachments. If the file isn't too big then pastebin would probably work. There's also always, dropbox, google drive, etc.
I just added a link to the original post. It is the file of headers for the second database.
I really want to solve the problem. Otherwise, i have to manually extract the information during the weekend.
Oh, well the example in your post didn't match what was in the file, where all spaces were replaced with underscores. That's why nothing was working. Anyway, you can just download the results here. For future reference, the appropriate perl part of the command was:
I'm so sorry for this confusion. I asked the question on how to remove space in headers two days ago and the code just helped me replace my original file.
I should try your code on my original file.
But, THANK YOU SO MUCH FOR HELPING ME!!
Ah, I'd missed that the VF... bit was optional in the second case (just reread your post). I'll have to think about that.
Hi,
I have another question, is it possible to extract the both VFG and VF together for the first database.
>AAA23421(AI041) fim41, [Escherichia coli]
Extract "AAA23421(AI041)" together from the header?
Thank you.
grep finds all lines starting with
>
. Sed then replaces all>
with nothing and then all spaces with tabs.cut
then displays only the first column. Note that I didn't test that, but it should more or less work.I followed the code, but the output seemed like remove all space and extract all the information from the header.
This is the format looks like:
Basically, what I need is all the information before the first column, there is a space between
fim41
and)
.I have attached the link here, would you like to check it, please.
BTW, just a stupid question, what is t/g mean in the code?
Thanks
Crystal
It works on my computer:
\t
is a tab.s/>//g
means search (s
) for>
and replace it with nothing (what's between the 2nd and third/
). The results of that are then put throughs/ /\t/g
, which means search (s
) for a space (\t
).BTW, the
g
part of all of that just means "globally", which I guess isn't needed here. This is standard regex syntax. Regex is very confusing to learn, but extremely useful.I know what the problem is!
I ran your code on my local computer terminal and it didn't work.
Then I ran yours on the server and it worked well! The same as the other code I modified!
So all problems have been solved. :)
BTW, here are the results.
Thank you so so so much, Devon!!!!!!!!!!!!
Yes, your code and the other one made sense to me and I tried to study from them.
But I don't know why both didn't work on my Mac terminal. Is this the system you used?
Again, I really appreciate your help these days.
Have a nice day!
Hi Devon,
Is there anyway that I can extract sequence with certain lengths (< 100 bp) from the fasta file?
I have some short sequences in my database and I want to find them out.
Thanks
Crystal
Sure, you can do that with biopython easily enough. Something along the line of:
Hi Devon,
I have a new file and I need to do similar thing to extract ID from the headers. The ID is like
VFG000676(gi:4894323) (lef) anthrax toxin lethal factor precursor [Anthrax toxin (VF0142)] [Bacillus anthracis str. Sterne]
Is there a way just extract information of of VF0142 in the third ()? ThanksUse something similar with the
re
module, namelyre.match()
.Do I need to write a python code to run this re.match()?
Yes
if I want to specify I want to extract VF#### in the (). Should my code be
m=re.match("()\(VF\wt)",element)
?m = re.search("(VF[\d]+)", element)
This will find
VF
followed by 1 or more digits.But I also need to specify the VF should be within the (), otherwise, there is a chance to extract VFG#### from the beginning of the header. Also should I import re? I found a example of using re.match, it seems I also need to import re? I haven't had any experience using python. Sorry for the stupid questions.
VFG### isn't VF followed by one or more integers. Yes, you'll need to
import re
.Great. So here is what I have so far:
How can I have a output.txt?
Quite close, once you have
m
:of = open("output.txt", "w") if m is not None: of.write("{}".format(m.group(0))) of.close()
You'll need to modify
header=('header.txt')
toheader = open('header.txt')
as well. Otherwise, you exactly got the idea!Thank you Devon. I tried the code using print, it worked well. However, then i figured out in some IDs, there was no VF###. Instead, they are SS###, AI### or CVF###. Then I tried to modify the code as:
But I got the error message as "unbalanced parentheses".
Also for your code about the output, should I separate them into several lines? Or should I just write after I define m:
Thank you.
Instead of
\)
, try|(
.\
is an escape character, so it says, "Treat the next character as just a normal character and not something special, like an enclosure." So you ended up removing a bunch of the left parentheses.BTW, you might find this site useful. You can play around with regular expressions and input a test string. I find that to be a more convenient way to use to get things like this working.
Thank you so much, Devon.