Question

How to Extract some specific information from a genome annotation file

0

Entering edit mode

8.1 years ago

Vinay Singh ▴ 50

Hello,

I have a genome annotation file and in it I want to extract some information like gene,Note,etc Provide me some solution in unix command by using which i can get my desired information in a file. Sample file

NC_012870.1 RefSeq  CDS 10199186    10199404    .   -   0   ID=cds1181;Parent=rna1181;Dbxref=InterPro:IPR006043,JGIDB:Sorbi1_5048357,Genbank:XP_002466633.1,GeneID:8062961;Name=XP_002466633.1;**Note**=similar to Nucleobase-ascorbate transporter LPE1;gbkey=CDS;**gene**=Sb01g011360;product=hypothetical protein;protein_id=XP_002466633.1
NC_012870.1 RefSeq  CDS 10199487    10199647    .   -   2   ID=cds1181;Parent=rna1181;Dbxref=InterPro:IPR006043,JGIDB:Sorbi1_5048357,Genbank:XP_002466633.1,GeneID:8062961;Name=XP_002466633.1;Note=similar to Nucleobase-ascorbate transporter LPE1;gbkey=CDS;gene=Sb01g011360;product=hypothetical protein;protein_id=XP_002466633.1

like this i have a long file now i want to separate some information like gene, Note.

unix command perl genome • 3.8k views

ADD COMMENT • link 8.1 years ago by Vinay Singh ▴ 50

0

Entering edit mode

Thanks a lot Mr. Pierre, can you please give me some resources for learning these Unix command it will be a great help.

ADD REPLY • link 8.1 years ago by Vinay Singh ▴ 50

1

Entering edit mode

Just google bash learning exercises.

ADD REPLY • link 8.1 years ago by Ram 43k

1

Entering edit mode

I would suggest to start learning unix basic commands for processing records of files. grep, awk, sed, bio-awk would be the first for you to learn and work on all informations that you have in your files. Life gets easy with these. Take a look at this link and this for first hand use.

P.S: I am not saying you will be able to do everything but still you will have a start and then people in the community can help you more. It is important for you to learn as well. This would help you in future. I hope you already understood the command line which Pierre has put as an answer. Also this question really is a stack overflow query which can be found out it simple search. I would not really go with the way of saying it as a bioinformatics question. People can beg to differ.

ADD REPLY • link 8.1 years ago by ivivek_ngs ★ 5.2k

2

Entering edit mode

I'd recommend against bioawk as an initial piece. Focus on core utils, not sed and awk. cat, cut, tr, wc, sort, uniq and the like. Then, work on pipes and redirects. Move on then to grep, then awk and sed. Practice better regular expressions. Starting with grep will only confuse people unless they have a strong regex background.

ADD REPLY • link 8.1 years ago by Ram 43k

0

Entering edit mode

Infact I did not mention in every details, yes before jumping to sed and awk it is important to learn the likes of core utils. Starting with grep will definitely confuse. I assume a part of regex learning is done while in Masters so that was my point. However what Ram has said is more precise and directed way.

ADD REPLY • link 8.1 years ago by ivivek_ngs ★ 5.2k

score 4 · Accepted Answer · 2016-04-07

4

Entering edit mode

8.1 years ago

Pierre Lindenbaum 161k

cut -f 9 input | tr ";" "\n" | grep -E '^(Note|gene)=' > out.txt

ADD COMMENT • link 8.1 years ago by Pierre Lindenbaum 161k