How to get transcript ID and tpm from the gtf files of ballgown outputs?
1
0
Entering edit mode
3.4 years ago
newbie ▴ 120

I have used ballgown on 150 samples. And the ballgown outputs are like below:

Sample1
    |_____ e2t.ctab
    |_____ e_data.ctab
    |_____ Sample1.gtf
    |_____ i2t.ctab
    |_____ i_data.ctab
    |_____ t_data.ctab

Sample1.gtf looks:

# stringtie -e -B -p 8 -G /path/stringtie_output/stringtie_merged.gtf -o /path/Sample1.gtf /path/Sample1.sorted.bam
# StringTie version 1.3.3
chr1    StringTie       transcript      10001   10390   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    StringTie       exon    10001   10101   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.1"; exon_number "1"; cov "0.0";
chr1    StringTie       exon    10179   10390   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.1"; exon_number "2"; cov "0.0";
chr1    StringTie       transcript      10001   10465   .       -       .       gene_id "MSTRG.6918"; transcript_id "MSTRG.6918.2"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    StringTie       exon    10001   10167   .       -       .       gene_id "MSTRG.6918"; transcript_id "MSTRG.6918.2"; exon_number "1"; cov "0.0";
chr1    StringTie       exon    10423   10465   .       -       .       gene_id "MSTRG.6918"; transcript_id "MSTRG.6918.2"; exon_number "2"; cov "0.0";
chr1    StringTie       transcript      10001   10467   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.2"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    StringTie       exon    10001   10101   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.2"; exon_number "1"; cov "0.0";
chr1    StringTie       exon    10173   10249   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.2"; exon_number "2"; cov "0.0";
chr1    StringTie       exon    10398   10467   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.2"; exon_number "3"; cov "0.0";
chr1    StringTie       transcript      10001   10467   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.3"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    StringTie       exon    10001   10101   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.3"; exon_number "1"; cov "0.0";
chr1    StringTie       exon    10173   10224   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.3"; exon_number "2"; cov "0.0";
chr1    StringTie       exon    10391   10467   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.3"; exon_number "3"; cov "0.0";
chr1    StringTie       transcript      10005   10467   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.4"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    StringTie       exon    10005   10178   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.4"; exon_number "1"; cov "0.0";
chr1    StringTie       exon    10361   10467   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.4"; exon_number "2"; cov "0.0";
chr1    StringTie       transcript      10011   10467   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.5"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    StringTie       exon    10011   10178   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.5"; exon_number "1"; cov "0.0";
chr1    StringTie       exon    10405   10467   .       +       .       gene_id "MSTRG.6917"; transcript_id "MSTRG.6917.5"; exon_number "2"; cov "0.0";
chr1    StringTie       transcript      10001   10465   1000    -       .       gene_id "MSTRG.6918"; transcript_id "MSTRG.6918.1"; cov "0.567742"; FPKM "0.066922"; TPM "0.283503";
chr1    StringTie       exon    10001   10465   1000    -       .       gene_id "MSTRG.6918"; transcript_id "MSTRG.6918.1"; exon_number "1"; cov "0.567742";
chr1    StringTie       transcript      11612   14409   .       +       .       gene_id "MSTRG.7557"; transcript_id "MSTRG.7557.1"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    StringTie       exon    11612   12697   .       +       .       gene_id "MSTRG.7557"; transcript_id "MSTRG.7557.1"; exon_number "1"; cov "0.0";
chr1    StringTie       exon    12975   13052   .       +       .       gene_id "MSTRG.7557"; transcript_id "MSTRG.7557.1"; exon_number "2"; cov "0.0";
chr1    StringTie       exon    13221   14409   .       +       .       gene_id "MSTRG.7557"; transcript_id "MSTRG.7557.1"; exon_number "3"; cov "0.0";

I actually want the transcript_id and TPM as two columns for each sample of my 150 sample gtfs. How do I do that with awk or any other way to export the two columns from all files into a single file.

RNA-Seq ballgown hisat2 gtf awk • 1.5k views
ADD COMMENT
0
Entering edit mode
3.4 years ago
Ram 43k

For each of the 150 files, assign an awk variable with the filename without the .gtf extension. Use ; as a delimiter and where the first column matches "trascript", print 3 columns, the first being the sample id variable you assigned above, the second being the transcript_id field and the third being the TPM field. Append output across all files to a single combined output file.

You should be able to write your awk command using the above logic.

ADD COMMENT
0
Entering edit mode

this is what I don't know how to do. Can you please help me with an example. thanq

I actually tried the below command for one of the file

grep -P "\ttranscript\t"  sample.gtf  | cut -f9 | awk '{gsub("\"","",$0);gsub(";","",$0);print $4,$12}'

And this gave an output like below:

MSTRG.7557.1
ENST00000456328.2 0.000000
ENST00000450305.2 0.000000
MSTRG.7557.4
MSTRG.7558.1
MSTRG.7558.3
MSTRG.7558.4
MSTRG.7558.10
MSTRG.7558.14
MSTRG.7558.15
MSTRG.7558.18
MSTRG.7558.23
ADD REPLY
0
Entering edit mode

Explore awk online. You'll need the awk -v option. And always use the -F option to specify IFS - better safe than sorry.

If you use cut first and pipe the output to awk, you'll need to use something like a shell variable to get the sample name from the filename. You need neither grep nor awk. awk can operate on lines that match a pattern, like so: $1 ~ /transcript/ { do_something; }

for f in *.gtf
do
  awk -v SMPL_NAME=$(basename $f} -F ";" ' .... ' $f >>overall_output.txt
done
ADD REPLY

Login before adding your answer.

Traffic: 1709 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6