Question

How to generate GFF3 format that I want?

0

Entering edit mode

5.9 years ago

bioinfo ▴ 10

Hi I'm trying to analyze bacterial RNA-seq data. To conduct RNAseq, I have annotated file format(gff3) described below

Modified gff3 format I have is like this:

Sequence_Name    .     CDS        StartPosition  EndPosition

But I want to generate gff3 format like this:

Sequence_Name    .     gene       StartPosition  EndPosition
Sequence_Name    .     CDS        StartPosition  EndPosition

When utilizing RNAseq called Cuffdiff, I couldn't get gene name at gene column Is there any way to get gff3 format having two lines for each gene?

Thank you in advance.

sequence RNA-Seq • 2.5k views

ADD COMMENT • link 5.9 years ago by bioinfo ▴ 10

1

Entering edit mode

Why are you talking about gff3 ? Gff3 is a format with 9 columns. Here is a 5 columns tabulated format. It is definitely not gff3 format.

ADD REPLY • link 5.9 years ago by Juke34 8.5k

0

Entering edit mode

Yes I know that GFF3 is 9 columns. I just want to emphasize about the difference of line, not columns. Do you know how to generate like that format I have Ex)

Seq_ID  FIG CDS 2236    3102    .   +   1   gene product

But I want to generate the reference like this format Ex)

Seq_ID  .   gene    2236    3102    .   +   0   gene abbreviation name
Seq_ID  .   CDS     2236    3102    .   +   0   gene product

ADD REPLY • link 5.9 years ago by bioinfo ▴ 10

1

Entering edit mode

I implemented a parser to check, fix, pad any kind of gtf and gff to create complete and standardize gff3 format. If you are interested you will find it in the github NBISweden repository. Follow the installation instruction here, then the easiest is to use this script gxf_to_gff3.pl

If you only have CDS (and or exon), you will have to use the locus_tag option of this script. Look at the help for more information It will create exon / mRNA and genes features if they are absents.

The output should be fine to feed any kind of downstream tool expecting gff3 like input.

So this approach should do the trick.

ADD REPLY • link 5.9 years ago by Juke34 8.5k

0

Entering edit mode

have you tried the awk-based one-liner that I noted below?

ADD REPLY • link 5.9 years ago by Friederike 8.9k

0

Entering edit mode

Actually, I'm beginner in using programming.. I'm trying to ..

ADD REPLY • link 5.9 years ago by bioinfo ▴ 10

0

Entering edit mode

If I have 9 columns and "CDS" containing at column $3, so awk function will be changed

awk  '$3 == "CDS" {OFS="\t"; print $1,$2,"gene",$4,$5,$6,$7,$8,$9}' original.file >> original.file

like this?

ADD REPLY • link 5.9 years ago by bioinfo ▴ 10

0

Entering edit mode

did you test this? was the outcome what you expected it to be?

based on the little information you've offered I would think the answer is yes, but I have no way of knowing for sure and whether that's actually what you need. if you had answered my other questions, we might be able to help you more targeted but if you really want to duplicate lines that should do the trick.

ADD REPLY • link 5.9 years ago by Friederike 8.9k

0

Entering edit mode

I'm doing RNAseq using Tuxedo pipeline. Output data from 'Cuffdiff' doesn't have gene name just ' - ' like this. When I did Cuffdiff using reference with two lines per each gene, it works well and I could see gene name.

ADD REPLY • link 5.9 years ago by bioinfo ▴ 10

0

Entering edit mode

And Does it work all of the lines? I mean that example is just simple example. The file I have contains thousands of lines. And I want to make one additional line not change CDS->gene

ADD REPLY • link 5.9 years ago by bioinfo ▴ 10

0

Entering edit mode

Does it work all of the lines

yes

And I want to make one additional line not change CDS->gene

No idea what this means, but I'm guessing you want to duplicate the lines and just replace "CDS" with "gene" in the third column. Which is what the one-liner does.

Btw, I will not respond to any other message that makes it clear that you haven't even tried out the code.

On a side note: don't use the Tuxedo suite unless you have a better reason than "it's easy to run". Numerous benchmarking papers have shown it to be less well performing in most standard applications. Many people here would probably recommend, for example, the workflow described in this paper

ADD REPLY • link 5.9 years ago by Friederike 8.9k

0

Entering edit mode

I think you need to clarify a couple of details:

What do you need the GFF3 file for?
Do you really want to duplicate the lines with "CDS" in the third column and just replacing "CDS" with "gene"? That could be achieved like this:

awk  '$3 == "CDS" {OFS="\t"; print $1,$2,"gene",$4,$5,$6}' original.file >> original.file

This would read in the original file, check every line for the presence of "CDS" -- if "CDS" is present, columns 1, 2, 4-6 will be printed and "gene" will be put into the third column. This could then be pasted into the original file.

What does Cuffdiff have to do with this?

ADD REPLY • link 5.9 years ago by Friederike 8.9k