Question: How to Extract some specific information from a genome annotation file
gravatar for Vinay Singh
4.5 years ago by
Vinay Singh50
INDIA, New Delhi, JNU
Vinay Singh50 wrote:


I have a genome annotation file and in it I want to extract some information like gene,Note,etc Provide me some solution in unix command by using which i can get my desired information in a file. Sample file

NC_012870.1 RefSeq  CDS 10199186    10199404    .   -   0   ID=cds1181;Parent=rna1181;Dbxref=InterPro:IPR006043,JGIDB:Sorbi1_5048357,Genbank:XP_002466633.1,GeneID:8062961;Name=XP_002466633.1;**Note**=similar to Nucleobase-ascorbate transporter LPE1;gbkey=CDS;**gene**=Sb01g011360;product=hypothetical protein;protein_id=XP_002466633.1
NC_012870.1 RefSeq  CDS 10199487    10199647    .   -   2   ID=cds1181;Parent=rna1181;Dbxref=InterPro:IPR006043,JGIDB:Sorbi1_5048357,Genbank:XP_002466633.1,GeneID:8062961;Name=XP_002466633.1;Note=similar to Nucleobase-ascorbate transporter LPE1;gbkey=CDS;gene=Sb01g011360;product=hypothetical protein;protein_id=XP_002466633.1

like this i have a long file now i want to separate some information like gene, Note.

unix command genome perl • 2.4k views
ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Vinay Singh50

Thanks a lot Mr. Pierre, can you please give me some resources for learning these Unix command it will be a great help.

ADD REPLYlink written 4.5 years ago by Vinay Singh50

Just google bash learning exercises.

ADD REPLYlink written 4.5 years ago by RamRS30k

I would suggest to start learning unix basic commands for processing records of files. grep, awk, sed, bio-awk would be the first for you to learn and work on all informations that you have in your files. Life gets easy with these. Take a look at this link and this for first hand use.

P.S: I am not saying you will be able to do everything but still you will have a start and then people in the community can help you more. It is important for you to learn as well. This would help you in future. I hope you already understood the command line which Pierre has put as an answer. Also this question really is a stack overflow query which can be found out it simple search. I would not really go with the way of saying it as a bioinformatics question. People can beg to differ.

ADD REPLYlink written 4.5 years ago by ivivek_ngs5.0k

I'd recommend against bioawk as an initial piece. Focus on core utils, not sed and awk. cat, cut, tr, wc, sort, uniq and the like. Then, work on pipes and redirects. Move on then to grep, then awk and sed. Practice better regular expressions. Starting with grep will only confuse people unless they have a strong regex background.

ADD REPLYlink written 4.5 years ago by RamRS30k

Infact I did not mention in every details, yes before jumping to sed and awk it is important to learn the likes of core utils. Starting with grep will definitely confuse. I assume a part of regex learning is done while in Masters so that was my point. However what Ram has said is more precise and directed way.

ADD REPLYlink written 4.5 years ago by ivivek_ngs5.0k
gravatar for Pierre Lindenbaum
4.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:
cut -f 9 input | tr ";" "\n" | grep -E '^(Note|gene)=' > out.txt
ADD COMMENTlink written 4.5 years ago by Pierre Lindenbaum130k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1555 users visited in the last hour