Entering edit mode
3.0 years ago
storm1907
▴
30
Hello, I have NCBI reference .gtf file, containing annotations about genes, transcripts, protein id, etc. I need to extract only those rows, containing word "gene". Can that be done with bedtools intersect, or should I use awk?
Input file looks similar to this:
1 BestRefSeq gene 943678 943679 . + . gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS"; 1 943678 943678 0 T T
1 BestRefSeq gene 943682 943683 . + . gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS"; 1 943682 943682 0 T T
1 BestRefSeq gene 943686 943687 . + . gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS"; 1 943686 943686 0 T T
1 BestRefSeq gene 943692 943693 . + . gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS"; 1 943692 943692 0 T T
1 BestRefSeq transcript 924024 924025 . + . gene_id "SAMD11"; transcript_id "NM_001385640.1"; db_xref "GeneID:148398"; gbkey "mRNA"; gene "SAMD11"; product "sterile alpha motif domain containing 11, transcript variant 2"; transcript_biotype "mRNA"; 1 924024 924024 0 G G
1 BestRefSeq transcript 924310 924311 . + . gene_id "SAMD11"; transcript_id "NM_001385640.1"; db_xref "GeneID:148398"; gbkey "mRNA"; gene "SAMD11"; product "sterile alpha motif domain containing 11, transcript variant 2"; transcript_biotype "mRNA"; 1 924310 924310 0 G G
1 BestRefSeq transcript 924321 924322 . + . gene_id "SAMD11"; transcript_id "NM_001385640.1"; db_xref "GeneID:148398"; gbkey "mRNA"; gene "SAMD11"; product "sterile alpha motif domain containing 11, transcript variant 2"; transcript_biotype "mRNA"; 1 924321 924321 0 G G
1 BestRefSeq transcript 924533 924534 . + . gene_id "SAMD11"; transcript_id "NM_001385640.1"; db_xref "GeneID:148398"; gbkey "mRNA"; gene "SAMD11"; product "sterile alpha motif domain containing 11, transcript variant 2"; transcript_biotype "mRNA"; 1 924533 924533 0 G G
And I need only this part of gtf file:
1 BestRefSeq gene 943678 943679 . + . gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS"; 1 943678 943678 0 T T
1 BestRefSeq gene 943682 943683 . + . gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS"; 1 943682 943682 0 T T
1 BestRefSeq gene 943686 943687 . + . gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS"; 1 943686 943686 0 T T
1 BestRefSeq gene 943692 943693 . + . gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS"; 1 943692 943692 0 T T
I will appreciate any tips.
Thank you!