Question: Merged.gtf vs. Rnor_6.0.gtf for Raw Counts
0
gravatar for jmsyl.hong
13 months ago by
jmsyl.hong0
jmsyl.hong0 wrote:

Hey guys,

The core facility for RNA-Seq used cuffmerge to generated a merged.gtf file (i.e. a merged assembly file).

I wanted to know what the difference was using the merged.gtf file to get raw counts using subread from .bams vs. using the most recent ensembl .gtf for my species.

When I run the subread with the merged.gtf file, I only get cufflinks IDs for the gene names with no gene symbols. Does anyone know if there is gene symbol or shortname information in the merged.gtf?

Thanks in advance!

raw counts cuffmerge rna-seq • 393 views
ADD COMMENTlink written 13 months ago by jmsyl.hong0
1

Can you show the first few lines of your gtf file, with unix head command?

ADD REPLYlink modified 13 months ago • written 13 months ago by b.nota3.6k

I'm on a mac (am also new with command line) is there something i can run on terminal or perhaps i can try to open this .gtf file with a text editor?

ADD REPLYlink written 13 months ago by jmsyl.hong0
1

Mac should have a bash terminal

Go to the folder with your file and type:

head merged.gtf
ADD REPLYlink modified 13 months ago • written 13 months ago by b.nota3.6k
1   Cufflinks   exon    312155  312360  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "Vom2r3"; oId "ENSRNOT00000044669"; nearest_ref "ENSRNOT00000044669"; class_code "="; tss_id "TSS1"; p_id "P2";
1   Cufflinks   exon    315197  315479  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "Vom2r3"; oId "ENSRNOT00000044669"; nearest_ref "ENSRNOT00000044669"; class_code "="; tss_id "TSS1"; p_id "P2";
1   Cufflinks   exon    315933  316737  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "3"; gene_name "Vom2r3"; oId "ENSRNOT00000044669"; nearest_ref "ENSRNOT00000044669"; class_code "="; tss_id "TSS1"; p_id "P2";
1   Cufflinks   exon    317591  317815  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "4"; gene_name "Vom2r3"; oId "ENSRNOT00000044669"; nearest_ref "ENSRNOT00000044669"; class_code "="; tss_id "TSS1"; p_id "P2";
1   Cufflinks   exon    321633  321756  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "5"; gene_name "Vom2r3"; oId "ENSRNOT00000044669"; nearest_ref "ENSRNOT00000044669"; class_code "="; tss_id "TSS1"; p_id "P2";
1   Cufflinks   exon    322916  323800  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "6"; gene_name "Vom2r3"; oId "ENSRNOT00000044669"; nearest_ref "ENSRNOT00000044669"; class_code "="; tss_id "TSS1"; p_id "P2";
1   Cufflinks   exon    312155  312383  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000005"; exon_number "1"; gene_name "Vom2r3"; oId "ENSRNOT00000046586"; nearest_ref "ENSRNOT00000046586"; class_code "="; tss_id "TSS1"; p_id "P1";
1   Cufflinks   exon    386141  386392  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000005"; exon_number "2"; gene_name "Vom2r3"; oId "ENSRNOT00000046586"; nearest_ref "ENSRNOT00000046586"; class_code "="; tss_id "TSS1"; p_id "P1";
1   Cufflinks   exon    391729  392532  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000005"; exon_number "3"; gene_name "Vom2r3"; oId "ENSRNOT00000046586"; nearest_ref "ENSRNOT00000046586"; class_code "="; tss_id "TSS1"; p_id "P1";
1   Cufflinks   exon    393385  393606  .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000005"; exon_number "4"; gene_name "Vom2r3"; oId "ENSRNOT00000046586"; nearest_ref "ENSRNOT00000046586"; class_code "="; tss_id "TSS1"; p_id "P1";
ADD REPLYlink modified 13 months ago by WouterDeCoster24k • written 13 months ago by jmsyl.hong0

from this it actually seems like there is a ton of information in the merged.gtf, including the gene_names... I'm using RNASeqGUI (because from command-line dumb) to do the raw counts it probably only gets the first column (e.g. gene_id which is the XLOC prefix) instead of the gene name.

is there a way for me to convert my existing raw counts which is mapped to XLOC IDs to the gene_names through this file, perhaps I can open it up in Excel and do some matching?

ADD REPLYlink modified 13 months ago • written 13 months ago by jmsyl.hong0
1

No processing in excel! Is there a 1-on-1 relationship for gene_id "XLOC" and gene_name? There should be a better way. Either you change the raw counts, e.g. by a python script I could write or you change the gtf and repeat the counting with featurecounts and set the -g flag to gene_name.

ADD REPLYlink written 13 months ago by WouterDeCoster24k

I would prefer if I could change the raw counts as I already did them for the entire dataset. That python script would be amazing. I have a tab delimited file that has XLOC in column 1, with each sample as headers thereafter, so essentially, it would have to match the XLOC from the first column of that file to the XLOC in the .gtf then make a new column on the tab delimited file with the gene name.

There is a 1:1 relationship between the XLOC and the gene name.

ADD REPLYlink written 13 months ago by jmsyl.hong0
1

Do you have a bit of experience in programming? It's obviously more interesting for you if you code a bit yourself, and I don't mind guiding you in that for this script.

If not, could you show me a bit of data, first 10 lines for example, so I know what I'm working on?

ADD REPLYlink written 13 months ago by WouterDeCoster24k

Hey, it wouldn't let me post more than 5 posts per 6 hours... I would be very interested in learning how to get through the timecourse pipeline by deseq2 starting with the generation of raw counts. Is there a way we could communicate? I would be happy to sent you tons of coffees for your troubles!

ADD REPLYlink written 13 months ago by james.hong0

I've never done a time course experiment, so I won't be of much use for that. I was mainly talking about converting the gene identifiers.

ADD REPLYlink written 13 months ago by WouterDeCoster24k

Sure, that would definitely help too.

I have two files:

  1. merged.gtf have a gene_name (gene names) column that is matched 1:1 to the gene_id column (XLOC)

  2. treatment1.txt have a gene_id column (XLOC) column with subsequent columns that contain the sample ids with their respective reads for each of the XLOCs.

I would like to match the XLOCs to their gene names from the merged.gtf file and have a column in the treatment1.txt that has the gene names.

Thanks!!!

ADD REPLYlink written 13 months ago by james.hong0
1

Can you show me a part of treatment1.txt? Apart from knowing that structure the script seems ready.

I made some assumptions about the countfile but will update my code if those were not correct, see code below:

ADD REPLYlink modified 13 months ago • written 13 months ago by WouterDeCoster24k

Gene ID 3.4 3.7 3.9 3.3 3.1 3.1S 3.6S 3.2S 1.4 1.8 1.1 1.1 1.11 1.5S 1.1S 1.4S 2.11 2.1 2.15 2.1 2.16 2.7S 2.5S 2.6S 8.16 8.5 8.12 8.13 8.15 8.3S 8.4S 8.5S WT.1 WT.2 WT3 XLOC_000001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 XLOC_000002 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 XLOC_000003 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 XLOC_000004 5 3 1 1 6 5 1 1 7 3 8 4 4 2 7 3 4 7 2 0 5 0 2 0 2 3 4 10 3 1 4 1 0 1 2 XLOC_000005 1 0 0 0 2 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 XLOC_000006 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

ADD REPLYlink written 13 months ago by jmsyl.hong0
1

Script needs modification of one line because the header will be a problem

ADD REPLYlink written 13 months ago by WouterDeCoster24k

thanks so much! let me know how i can execute it after.

ADD REPLYlink written 13 months ago by jmsyl.hong0
1

Instructions are at the top of the script, I edited it on my phone so I hope I didn't make a typo.

ADD REPLYlink written 13 months ago by WouterDeCoster24k

Got an error:

python renameCountIdentifiers.py merged.gtf cervical1.txt
Traceback (most recent call last):
  File "renameCountIdentifiers.py", line 28, in <module>
    changeIdentifierCounts(sys.argv[1], sys.argv[2])
  File "renameCountIdentifiers.py", line 24, in changeIdentifierCounts
    print("{}\t{}".format(gtfdict[line.strip().split('\t')[0]], '\t'.join(line.strip().split('\t')[1:])))
KeyError: 'c'
ADD REPLYlink written 13 months ago by jmsyl.hong0

Odd. Hope to get near computer soon, maybe tomorrow, if not Sunday. Already seen a typo, gene instead of Gene, but doesn't explain the error you obtained. Files are tab separated right? Problem comes from your counts file.

ADD REPLYlink written 13 months ago by WouterDeCoster24k

Should be tab-delimited. I'll try a few things in the meanwhile.

ADD REPLYlink written 13 months ago by james.hong0

Hi James.hong,

I added a few lines to make sure a warning is printed and script isn't halted when a KeyError is found (meaning that the identifier wasn't present in the dictionary generated from the gtf file and as such can't be converted). The offending line will be printed to your terminal and the identifier won't be converted, just kept as it was. This should help us to track down the error.

Let me know the result!

ADD REPLYlink written 13 months ago by WouterDeCoster24k

Yeah there is info about e.g. the gene name (or ensembl code), which is probably what you want. You can use AWK to extract that out of your file, but if you're not into command line, you probably want to ask your core facility for help with this.

ADD REPLYlink modified 13 months ago • written 13 months ago by b.nota3.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1492 users visited in the last hour