Entering edit mode
5.9 years ago
bioinfo
▴
10
Hi I'm trying to analyze bacterial RNA-seq data. To conduct RNAseq, I have annotated file format(gff3) described below
Modified gff3 format I have is like this:
Sequence_Name . CDS StartPosition EndPosition
But I want to generate gff3 format like this:
Sequence_Name . gene StartPosition EndPosition
Sequence_Name . CDS StartPosition EndPosition
When utilizing RNAseq called Cuffdiff, I couldn't get gene name at gene column Is there any way to get gff3 format having two lines for each gene?
Thank you in advance.
Why are you talking about gff3 ? Gff3 is a format with 9 columns. Here is a 5 columns tabulated format. It is definitely not gff3 format.
Yes I know that GFF3 is 9 columns. I just want to emphasize about the difference of line, not columns. Do you know how to generate like that format I have Ex)
But I want to generate the reference like this format Ex)
I implemented a parser to check, fix, pad any kind of gtf and gff to create complete and standardize gff3 format. If you are interested you will find it in the github NBISweden repository. Follow the installation instruction here, then the easiest is to use this script gxf_to_gff3.pl
If you only have CDS (and or exon), you will have to use the locus_tag option of this script. Look at the help for more information It will create exon / mRNA and genes features if they are absents.
The output should be fine to feed any kind of downstream tool expecting gff3 like input.
So this approach should do the trick.
have you tried the
awk
-based one-liner that I noted below?Actually, I'm beginner in using programming.. I'm trying to ..
If I have 9 columns and "CDS" containing at column $3, so awk function will be changed
like this?
did you test this? was the outcome what you expected it to be?
based on the little information you've offered I would think the answer is yes, but I have no way of knowing for sure and whether that's actually what you need. if you had answered my other questions, we might be able to help you more targeted but if you really want to duplicate lines that should do the trick.
I'm doing RNAseq using Tuxedo pipeline. Output data from 'Cuffdiff' doesn't have gene name just ' - ' like this. When I did Cuffdiff using reference with two lines per each gene, it works well and I could see gene name.
And Does it work all of the lines? I mean that example is just simple example. The file I have contains thousands of lines. And I want to make one additional line not change CDS->gene
yes
No idea what this means, but I'm guessing you want to duplicate the lines and just replace "CDS" with "gene" in the third column. Which is what the one-liner does.
Btw, I will not respond to any other message that makes it clear that you haven't even tried out the code.
On a side note: don't use the Tuxedo suite unless you have a better reason than "it's easy to run". Numerous benchmarking papers have shown it to be less well performing in most standard applications. Many people here would probably recommend, for example, the workflow described in this paper
I think you need to clarify a couple of details:
This would read in the original file, check every line for the presence of "CDS" -- if "CDS" is present, columns 1, 2, 4-6 will be printed and "gene" will be put into the third column. This could then be pasted into the original file.