Question: How to generate GFF3 format that I want?
0
gravatar for bioinfo
11 months ago by
bioinfo10
bioinfo10 wrote:

Hi I'm trying to analyze bacterial RNA-seq data. To conduct RNAseq, I have annotated file format(gff3) described below

Modified gff3 format I have is like this:

Sequence_Name    .     CDS        StartPosition  EndPosition

But I want to generate gff3 format like this:

Sequence_Name    .     gene       StartPosition  EndPosition
Sequence_Name    .     CDS        StartPosition  EndPosition

When utilizing RNAseq called Cuffdiff, I couldn't get gene name at gene column Is there any way to get gff3 format having two lines for each gene?

Thank you in advance.

rna-seq sequence • 454 views
ADD COMMENTlink written 11 months ago by bioinfo10
1

Why are you talking about gff3 ? Gff3 is a format with 9 columns. Here is a 5 columns tabulated format. It is definitely not gff3 format.

ADD REPLYlink modified 11 months ago • written 11 months ago by Juke-342.1k

Yes I know that GFF3 is 9 columns. I just want to emphasize about the difference of line, not columns. Do you know how to generate like that format I have Ex)

Seq_ID  FIG CDS 2236    3102    .   +   1   gene product

But I want to generate the reference like this format Ex)

Seq_ID  .   gene    2236    3102    .   +   0   gene abbreviation name
Seq_ID  .   CDS     2236    3102    .   +   0   gene product
ADD REPLYlink modified 10 months ago • written 10 months ago by bioinfo10
1

I implemented a parser to check, fix, pad any kind of gtf and gff to create complete and standardize gff3 format. If you are interested you will find it in the github NBISweden repository. Follow the installation instruction here, then the easiest is to use this script gxf_to_gff3.pl

If you only have CDS (and or exon), you will have to use the locus_tag option of this script. Look at the help for more information It will create exon / mRNA and genes features if they are absents.

The output should be fine to feed any kind of downstream tool expecting gff3 like input.

So this approach should do the trick.

ADD REPLYlink written 10 months ago by Juke-342.1k

have you tried the awk-based one-liner that I noted below?

ADD REPLYlink written 10 months ago by Friederike3.6k

Actually, I'm beginner in using programming.. I'm trying to ..

ADD REPLYlink written 10 months ago by bioinfo10

If I have 9 columns and "CDS" containing at column $3, so awk function will be changed

awk  '$3 == "CDS" {OFS="\t"; print $1,$2,"gene",$4,$5,$6,$7,$8,$9}' original.file >> original.file

like this?

ADD REPLYlink written 10 months ago by bioinfo10

did you test this? was the outcome what you expected it to be?

based on the little information you've offered I would think the answer is yes, but I have no way of knowing for sure and whether that's actually what you need. if you had answered my other questions, we might be able to help you more targeted but if you really want to duplicate lines that should do the trick.

ADD REPLYlink written 10 months ago by Friederike3.6k

I'm doing RNAseq using Tuxedo pipeline. Output data from 'Cuffdiff' doesn't have gene name just ' - ' like this. When I did Cuffdiff using reference with two lines per each gene, it works well and I could see gene name.

ADD REPLYlink written 10 months ago by bioinfo10

And Does it work all of the lines? I mean that example is just simple example. The file I have contains thousands of lines. And I want to make one additional line not change CDS->gene

ADD REPLYlink modified 10 months ago • written 10 months ago by bioinfo10

Does it work all of the lines

yes

And I want to make one additional line not change CDS->gene

No idea what this means, but I'm guessing you want to duplicate the lines and just replace "CDS" with "gene" in the third column. Which is what the one-liner does.

Btw, I will not respond to any other message that makes it clear that you haven't even tried out the code.

On a side note: don't use the Tuxedo suite unless you have a better reason than "it's easy to run". Numerous benchmarking papers have shown it to be less well performing in most standard applications. Many people here would probably recommend, for example, the workflow described in this paper

ADD REPLYlink modified 10 months ago • written 10 months ago by Friederike3.6k

I think you need to clarify a couple of details:

  1. What do you need the GFF3 file for?
  2. Do you really want to duplicate the lines with "CDS" in the third column and just replacing "CDS" with "gene"? That could be achieved like this:
awk  '$3 == "CDS" {OFS="\t"; print $1,$2,"gene",$4,$5,$6}' original.file >> original.file

This would read in the original file, check every line for the presence of "CDS" -- if "CDS" is present, columns 1, 2, 4-6 will be printed and "gene" will be put into the third column. This could then be pasted into the original file.

  1. What does Cuffdiff have to do with this?
ADD REPLYlink written 11 months ago by Friederike3.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 769 users visited in the last hour