Question

DEXSeq error: Count files do not correspond to the flattened annotation file

0

Entering edit mode

9.5 years ago

Assa Yeroslaviz ★ 1.8k

Hi,

I am running the DEXSeq (v. 1.12.1) with two different modi, one with the strand information and one without it. I just wanted to see the differences.

The run with strand information works fine, but the one without gives me the above error message.

I first ran the dexseq_prepare_annotation.py script, but modified it:

exons = HTSeq.GenomicArrayOfSets( "auto", stranded=False )

This gives me a gff file with no strand information, where the 7th column of the file is overall '.'

I than run the dexseq_count.py script with the gff file.

python dexseq_count.py -s no -f bam cDmel_NoStrandInfo_DEXSeq.gff $file

I now get the error message with the genes which are being aggregated, due to the fact the the strand information is missing. I know I can use the -r no option, but I would like to count also these genes.

The first gene name I can't find is XLOC_000002 (these are cufflinks IDs). It is aggregated together with XLOC_001386.

The problem I have is that these genes are aggregated both in my newly created gff file

grep -Hrn "XLOC_000002" NoStrandInfo_DEXSeq.gff
NoStrandInfo_DEXSeq.gff:27:2L     dexseq.py   aggregate_gene  21842   25151   .       .       .       gene_id "XLOC_001386+XLOC_000002"
NoStrandInfo_DEXSeq.gff:28:2L     dexseq.py   exonic_part     21842   21918   .       .       .       transcripts "TCONS_00000004"; exonic_part_number "001"; gene_id "XLOC_001386+XLOC_000002"
NoStrandInfo_DEXSeq.gff:29:2L     dexseq.py   exonic_part     21919   22687   .       .       .       transcripts "TCONS_00000004+TCONS_00003381"; exonic_part_number "002"; gene_id "XLOC_001386+XLOC_00000 "
NoStrandInfo_DEXSeq.gff:30:2L     dexseq.py   exonic_part     22688   22709   .       .       .       transcripts "TCONS_00000004"; exonic_part_number "003"; gene_id "XLOC_001386+XLOC_000002"
NoStrandInfo_DEXSeq.gff:31:2L     dexseq.py   exonic_part     22743   22768   .       .       .       transcripts "TCONS_00003381"; exonic_part_number "004"; gene_id "XLOC_001386+XLOC_000002"
NoStrandInfo_DEXSeq.gff:32:2L     dexseq.py   exonic_part     22769   22935   .       .       .       transcripts "TCONS_00000004+TCONS_00003381"; exonic_part_number "005"; gene_id "XLOC_001386+XLOC_00000 "
...

as well as in the counts file:

 grep -Hrn "XLOC_000002" cufflinks.txt
cufflinks.txt:9237:XLOC_001386+XLOC_000002:001        0
cufflinks.txt:9238:XLOC_001386+XLOC_000002:002        18
cufflinks.txt:9239:XLOC_001386+XLOC_000002:003        1
cufflinks.txt:9240:XLOC_001386+XLOC_000002:004        1
cufflinks.txt:9241:XLOC_001386+XLOC_000002:005        1
...

So I don't understand why they can't be found as a pair. Is there a different reason for this error message?

Thanks
Assa

dexseq gtf • 3.0k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Assa Yeroslaviz ★ 1.8k