Question

How to get transcript ID and tpm from the gtf files of ballgown outputs?

0

Entering edit mode

3.4 years ago

newbie ▴ 120

I have used ballgown on 150 samples. And the ballgown outputs are like below:

Sample1
    |_____ e2t.ctab
    |_____ e_data.ctab
    |_____ Sample1.gtf
    |_____ i2t.ctab
    |_____ i_data.ctab
    |_____ t_data.ctab

Sample1.gtf looks:

# stringtie -e -B -p 8 -G /path/stringtie_output/stringtie_merged.gtf -o /path/Sample1.gtf /path/Sample1.sorted.bam
# StringTie version 1.3.3
chr1    StringTie       transcript      10001   10390   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    StringTie       exon    10001   10101   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.1"; exon_number "1"; cov "0.0";
chr1    StringTie       exon    10179   10390   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.1"; exon_number "2"; cov "0.0";
chr1    StringTie       transcript      10001   10465   .       -       .       gene_id "MSTRG.6918"; transcript_id "MSTRG.6918.2"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    StringTie       exon    10001   10167   .       -       .       gene_id "MSTRG.6918"; transcript_id "MSTRG.6918.2"; exon_number "1"; cov "0.0";
chr1    StringTie       exon    10423   10465   .       -       .       gene_id "MSTRG.6918"; transcript_id "MSTRG.6918.2"; exon_number "2"; cov "0.0";
chr1    StringTie       transcript      10001   10467   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.2"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    StringTie       exon    10001   10101   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.2"; exon_number "1"; cov "0.0";
chr1    StringTie       exon    10173   10249   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.2"; exon_number "2"; cov "0.0";
chr1    StringTie       exon    10398   10467   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.2"; exon_number "3"; cov "0.0";
chr1    StringTie       transcript      10001   10467   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.3"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    StringTie       exon    10001   10101   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.3"; exon_number "1"; cov "0.0";
chr1    StringTie       exon    10173   10224   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.3"; exon_number "2"; cov "0.0";
chr1    StringTie       exon    10391   10467   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.3"; exon_number "3"; cov "0.0";
chr1    StringTie       transcript      10005   10467   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.4"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    StringTie       exon    10005   10178   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.4"; exon_number "1"; cov "0.0";
chr1    StringTie       exon    10361   10467   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.4"; exon_number "2"; cov "0.0";
chr1    StringTie       transcript      10011   10467   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.5"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    StringTie       exon    10011   10178   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.5"; exon_number "1"; cov "0.0";
chr1    StringTie       exon    10405   10467   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.5"; exon_number "2"; cov "0.0";
chr1    StringTie       transcript      10001   10465   1000    -       .       gene_id "MSTRG.6918"; transcript_id "MSTRG.6918.1"; cov "0.567742"; FPKM "0.066922"; TPM "0.283503";
chr1    StringTie       exon    10001   10465   1000    -       .       gene_id "MSTRG.6918"; transcript_id "MSTRG.6918.1"; exon_number "1"; cov "0.567742";
chr1    StringTie       transcript      11612   14409   .       +       .       gene_id "MSTRG.7557"; transcript_id "MSTRG.7557.1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    StringTie       exon    11612   12697   .       +       .       gene_id "MSTRG.7557"; transcript_id "MSTRG.7557.1"; exon_number "1"; cov "0.0";
chr1    StringTie       exon    12975   13052   .       +       .       gene_id "MSTRG.7557"; transcript_id "MSTRG.7557.1"; exon_number "2"; cov "0.0";
chr1    StringTie       exon    13221   14409   .       +       .       gene_id "MSTRG.7557"; transcript_id "MSTRG.7557.1"; exon_number "3"; cov "0.0";

I actually want the transcript_id and TPM as two columns for each sample of my 150 sample gtfs. How do I do that with awk or any other way to export the two columns from all files into a single file.

RNA-Seq ballgown hisat2 gtf awk • 1.5k views

ADD COMMENT • link updated 3.4 years ago by Ram 43k • written 3.4 years ago by newbie ▴ 120

score 0 · Answer 1 · 2020-12-05

0

Entering edit mode

3.4 years ago

Ram 43k

For each of the 150 files, assign an awk variable with the filename without the .gtf extension. Use ; as a delimiter and where the first column matches "trascript", print 3 columns, the first being the sample id variable you assigned above, the second being the transcript_id field and the third being the TPM field. Append output across all files to a single combined output file.

You should be able to write your awk command using the above logic.

ADD COMMENT • link 3.4 years ago by Ram 43k

0

Entering edit mode

this is what I don't know how to do. Can you please help me with an example. thanq

I actually tried the below command for one of the file

grep -P "\ttranscript\t"  sample.gtf  | cut -f9 | awk '{gsub("\"","",$0);gsub(";","",$0);print $4,$12}'

And this gave an output like below:

MSTRG.7557.1
ENST00000456328.2 0.000000
ENST00000450305.2 0.000000
MSTRG.7557.4
MSTRG.7558.1
MSTRG.7558.3
MSTRG.7558.4
MSTRG.7558.10
MSTRG.7558.14
MSTRG.7558.15
MSTRG.7558.18
MSTRG.7558.23

ADD REPLY • link 3.4 years ago by newbie ▴ 120

0

Entering edit mode

Explore awk online. You'll need the awk -v option. And always use the -F option to specify IFS - better safe than sorry.

If you use cut first and pipe the output to awk, you'll need to use something like a shell variable to get the sample name from the filename. You need neither grep nor awk. awk can operate on lines that match a pattern, like so: $1 ~ /transcript/ { do_something; }

for f in *.gtf
do
  awk -v SMPL_NAME=$(basename $f} -F ";" ' .... ' $f >>overall_output.txt
done

ADD REPLY • link 3.4 years ago by Ram 43k