Closed:How to extract the transcripts with specific class_codes from a gtf file
0
0
Entering edit mode
3.4 years ago
newbie ▴ 120

I have a sample.gtf file like below:

chr1    StringTie       transcript      10001   10390   .       +       .       transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10001   10101   .       +       .       transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10179   10390   .       +       .       transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       transcript      10001   10467   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10001   10101   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10173   10224   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       exon    10391   10467   .       +       .       transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "3";
chr1    StringTie       transcript      10001   10467   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10001   10101   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10173   10249   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       exon    10398   10467   .       +       .       transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "3";
chr1    StringTie       transcript      10005   10467   .       +       .       transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10005   10178   .       +       .       transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10361   10467   .       +       .       transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       transcript      10011   10467   .       +       .       transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1    StringTie       exon    10011   10178   .       +       .       transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; exon_number "1";
chr1    StringTie       exon    10405   10467   .       +       .       transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; exon_number "2";
chr1    StringTie       transcript      57598   58856   .       +       .       transcript_id "ENST00000642116.1"; gene_id "MSTRG.7562"; gene_name "OR4G11P"; xloc "XLOC_000002"; ref_gene_id "ENSG00000240361.2"; cmp_ref "ENST00000642116.1"; class_code "c"; tss_id "TSS2";
chr1    StringTie       exon    57598   57653   .       +       .       transcript_id "ENST00000642116.1"; gene_id "MSTRG.7562"; exon_number "1";
chr1    StringTie       exon    58700   58856   .       +       .       transcript_id "ENST00000642116.1"; gene_id "MSTRG.7562"; exon_number "2";
chr1    StringTie       transcript      65419   71585   .       +       .       transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; gene_name "OR4F5"; xloc "XLOC_000003"; ref_gene_id "ENSG00000186092.5"; cmp_ref "ENST00000641515.1"; class_code "="; tss_id "TSS3";
chr1    StringTie       exon    65419   65433   .       +       .       transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; exon_number "1";
chr1    StringTie       exon    65520   65573   .       +       .       transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; exon_number "2";
chr1    StringTie       exon    69037   71585   .       +       .       transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; exon_number "3";
chr1    StringTie       transcript      65572   75288   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; gene_name "OR4F5"; xloc "XLOC_000003"; cmp_ref "ENST00000641515.1"; class_code "j"; tss_id "TSS4";
chr1    StringTie       exon    65572   65573   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "1";
chr1    StringTie       exon    69037   69093   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "2";
chr1    StringTie       exon    74913   75288   .       +       .       transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "3";
chr1    StringTie       transcript      69055   71585   .       +       .       transcript_id "ENST00000335137.4"; gene_id "MSTRG.7563"; gene_name "OR4F5"; xloc "XLOC_000003"; ref_gene_id "ENSG00000186092.5"; contained_in "ENST00000641515.1"; cmp_ref "ENST00000641515.1"; class_code "c"; tss_id "TSS5";
chr1    StringTie       exon    69055   71585   .       +       .       transcript_id "ENST00000335137.4"; gene_id "MSTRG.7563"; exon_number "1";
chr1    StringTie       transcript      83779   84926   .       +       .       transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; xloc "XLOC_000004"; class_code "u"; tss_id "TSS6";
chr1    StringTie       exon    83779   83829   .       +       .       transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; exon_number "1";
chr1    StringTie       exon    83854   84926   .       +       .       transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; exon_number "2";
chr1    StringTie       transcript      89710   90455   .       +       .       transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; gene_name "AL627309.3"; xloc "XLOC_000005"; cmp_ref "ENST00000495576.1"; class_code "s"; tss_id "TSS7";
chr1    StringTie       exon    89710   90050   .       +       .       transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; exon_number "1";
chr1    StringTie       exon    90287   90455   .       +       .       transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; exon_number "2";

I tried extracting the transcripts and their exons with class_code "u" like below:

awk -F "\t" '/class_code "u"/ {print $0}' sample.gtf > new_filename.gtf

The above awk command gave only transcripts, their exons not seen in the new_filename.gtf. I actually want to extract multiple class_codes transcripts with their exons. How to use awk for that?

I need transcripts with class_codes u, s, j along with their exons.

RNA-Seq awk gtf grep • 120 views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 1672 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6