Entering edit mode
3.4 years ago
newbie
▴
120
I have a sample.gtf
file like below:
chr1 StringTie transcript 10001 10390 . + . transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1 StringTie exon 10001 10101 . + . transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; exon_number "1";
chr1 StringTie exon 10179 10390 . + . transcript_id "MSTRG.6917.1"; gene_id "MSTRG.6917"; exon_number "2";
chr1 StringTie transcript 10001 10467 . + . transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1 StringTie exon 10001 10101 . + . transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "1";
chr1 StringTie exon 10173 10224 . + . transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "2";
chr1 StringTie exon 10391 10467 . + . transcript_id "MSTRG.6917.3"; gene_id "MSTRG.6917"; exon_number "3";
chr1 StringTie transcript 10001 10467 . + . transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1 StringTie exon 10001 10101 . + . transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "1";
chr1 StringTie exon 10173 10249 . + . transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "2";
chr1 StringTie exon 10398 10467 . + . transcript_id "MSTRG.6917.2"; gene_id "MSTRG.6917"; exon_number "3";
chr1 StringTie transcript 10005 10467 . + . transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1 StringTie exon 10005 10178 . + . transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; exon_number "1";
chr1 StringTie exon 10361 10467 . + . transcript_id "MSTRG.6917.4"; gene_id "MSTRG.6917"; exon_number "2";
chr1 StringTie transcript 10011 10467 . + . transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
chr1 StringTie exon 10011 10178 . + . transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; exon_number "1";
chr1 StringTie exon 10405 10467 . + . transcript_id "MSTRG.6917.5"; gene_id "MSTRG.6917"; exon_number "2";
chr1 StringTie transcript 57598 58856 . + . transcript_id "ENST00000642116.1"; gene_id "MSTRG.7562"; gene_name "OR4G11P"; xloc "XLOC_000002"; ref_gene_id "ENSG00000240361.2"; cmp_ref "ENST00000642116.1"; class_code "c"; tss_id "TSS2";
chr1 StringTie exon 57598 57653 . + . transcript_id "ENST00000642116.1"; gene_id "MSTRG.7562"; exon_number "1";
chr1 StringTie exon 58700 58856 . + . transcript_id "ENST00000642116.1"; gene_id "MSTRG.7562"; exon_number "2";
chr1 StringTie transcript 65419 71585 . + . transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; gene_name "OR4F5"; xloc "XLOC_000003"; ref_gene_id "ENSG00000186092.5"; cmp_ref "ENST00000641515.1"; class_code "="; tss_id "TSS3";
chr1 StringTie exon 65419 65433 . + . transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; exon_number "1";
chr1 StringTie exon 65520 65573 . + . transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; exon_number "2";
chr1 StringTie exon 69037 71585 . + . transcript_id "ENST00000641515.1"; gene_id "MSTRG.7563"; exon_number "3";
chr1 StringTie transcript 65572 75288 . + . transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; gene_name "OR4F5"; xloc "XLOC_000003"; cmp_ref "ENST00000641515.1"; class_code "j"; tss_id "TSS4";
chr1 StringTie exon 65572 65573 . + . transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "1";
chr1 StringTie exon 69037 69093 . + . transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "2";
chr1 StringTie exon 74913 75288 . + . transcript_id "MSTRG.7563.2"; gene_id "MSTRG.7563"; exon_number "3";
chr1 StringTie transcript 69055 71585 . + . transcript_id "ENST00000335137.4"; gene_id "MSTRG.7563"; gene_name "OR4F5"; xloc "XLOC_000003"; ref_gene_id "ENSG00000186092.5"; contained_in "ENST00000641515.1"; cmp_ref "ENST00000641515.1"; class_code "c"; tss_id "TSS5";
chr1 StringTie exon 69055 71585 . + . transcript_id "ENST00000335137.4"; gene_id "MSTRG.7563"; exon_number "1";
chr1 StringTie transcript 83779 84926 . + . transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; xloc "XLOC_000004"; class_code "u"; tss_id "TSS6";
chr1 StringTie exon 83779 83829 . + . transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; exon_number "1";
chr1 StringTie exon 83854 84926 . + . transcript_id "MSTRG.7564.1"; gene_id "MSTRG.7564"; exon_number "2";
chr1 StringTie transcript 89710 90455 . + . transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; gene_name "AL627309.3"; xloc "XLOC_000005"; cmp_ref "ENST00000495576.1"; class_code "s"; tss_id "TSS7";
chr1 StringTie exon 89710 90050 . + . transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; exon_number "1";
chr1 StringTie exon 90287 90455 . + . transcript_id "MSTRG.7565.1"; gene_id "MSTRG.7565"; exon_number "2";
I tried extracting the transcripts and their exons with class_code "u"
like below:
awk -F "\t" '/class_code "u"/ {print $0}' sample.gtf > new_filename.gtf
The above awk command gave only transcripts, their exons not seen in the new_filename.gtf
. I actually want to extract multiple class_codes transcripts
with their exons. How to use awk
for that?
I need transcripts with class_codes u, s, j
along with their exons.