I know a lot of people asked similar questions before. I want to specify my question.
I have a database with 5000+ sequences. The format of the header for each sequence is
>AAA23421(AI041) fim41, [Escherichia coli]
>AAA23421 is the gene ID and AI041 is the VFID.
I want to extract gene ID in one txt file and VFID in another txt file.
The code I used before is:
grep "^>" file.fa | cut -c 2-9 > destination_file.txt
grep "^>" file.fa | cut -c 11-16 > destination_file.txt
because i thought all gene ID is the same length.
BUT, i was wrong. So I can't extract right information.
Is there any modification I can do to extract gene ID between ">" and "(" and then extract VFID between "()"?
I have another database which I asked yesterday. The format of the headers (before I remove all the space) is
>VFG0676 lef - anthrax toxin lethal factor, lef, [Bacteria Name] (VF0142)
Is there anyway I can only extract VFG0676 and (VF0142) together to a new txt file? Since some of VFGs do not have their corresponding VFs, so I'd like to extract them in two columns of the same file. PS: the lengths of the headers are definitely not the same. But all the VFG ID are in the front with same length and if they have VF ID, all the VF IDs are in () with same length.