bedtools intersect combined with filtering after particular word
1
0
Entering edit mode
3.0 years ago
storm1907 ▴ 30

Hello, I have NCBI reference .gtf file, containing annotations about genes, transcripts, protein id, etc. I need to extract only those rows, containing word "gene". Can that be done with bedtools intersect, or should I use awk?

Input file looks similar to this:

1       BestRefSeq      gene    943678  943679  .       +       .       gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS";        1       943678  943678  0       T       T
1       BestRefSeq      gene    943682  943683  .       +       .       gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS";        1       943682  943682  0       T       T
1       BestRefSeq      gene    943686  943687  .       +       .       gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS";        1       943686  943686  0       T       T
1       BestRefSeq      gene    943692  943693  .       +       .       gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS";        1       943692  943692  0       T       T
1       BestRefSeq      transcript      924024  924025  .       +       .       gene_id "SAMD11"; transcript_id "NM_001385640.1"; db_xref "GeneID:148398"; gbkey "mRNA"; gene "SAMD11"; product "sterile alpha motif domain containing 11, transcript variant 2"; transcript_biotype "mRNA";         1       924024  924024  0       G       G
1       BestRefSeq      transcript      924310  924311  .       +       .       gene_id "SAMD11"; transcript_id "NM_001385640.1"; db_xref "GeneID:148398"; gbkey "mRNA"; gene "SAMD11"; product "sterile alpha motif domain containing 11, transcript variant 2"; transcript_biotype "mRNA";         1       924310  924310  0       G       G
1       BestRefSeq      transcript      924321  924322  .       +       .       gene_id "SAMD11"; transcript_id "NM_001385640.1"; db_xref "GeneID:148398"; gbkey "mRNA"; gene "SAMD11"; product "sterile alpha motif domain containing 11, transcript variant 2"; transcript_biotype "mRNA";         1       924321  924321  0       G       G
1       BestRefSeq      transcript      924533  924534  .       +       .       gene_id "SAMD11"; transcript_id "NM_001385640.1"; db_xref "GeneID:148398"; gbkey "mRNA"; gene "SAMD11"; product "sterile alpha motif domain containing 11, transcript variant 2"; transcript_biotype "mRNA";         1       924533  924533  0       G       G

And I need only this part of gtf file:

1       BestRefSeq      gene    943678  943679  .       +       .       gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS";        1       943678  943678  0       T       T
1       BestRefSeq      gene    943682  943683  .       +       .       gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS";        1       943682  943682  0       T       T
1       BestRefSeq      gene    943686  943687  .       +       .       gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS";        1       943686  943686  0       T       T
1       BestRefSeq      gene    943692  943693  .       +       .       gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS";        1       943692  943692  0       T       T

I will appreciate any tips.

Thank you!

bedtools • 719 views
ADD COMMENT
0
Entering edit mode
3.0 years ago
Ram 43k

You should be able to use awk:

awk -F"\t" '$3=="gene"' my_file.gtf > my_genes_only_file.gtf
ADD COMMENT

Login before adding your answer.

Traffic: 2568 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6