Question: HTSeq & featureCounts show no features
0
gravatar for bkinzep
28 days ago by
bkinzep0
bkinzep0 wrote:

Hi - I am trying to use HTSeq (or even featureCounts) to generate counts for RNAseq data. The data set is from a bacteria (Illumina HiSeq PE). It was mapped to the reference genome with Bowtie - and all looks good. When I try to run HTSeq all the mapped reads are designated as "no feature". Below are samples of data/output/what I've tried . Here is what I ran in Bowtie and the results :

bowtie2 -x 206_old --local --no-unal -p 12 -1 ~/bioinformatics3/206/206raw/z44h_1a.fastq.gz -2 ~/bioinformatics3/206/206raw/z44h_1b.fastq.gz -S ~/bioinformatics3/206/Bowtie/PEalgin_44h_1.sam

produces:

20417626 reads; of these:
20417626 (100.00%) were paired; of these:
10819260 (52.99%) aligned concordantly 0 times
8177956 (40.05%) aligned concordantly exactly 1 time
1420410 (6.96%) aligned concordantly >1 times
10819260 pairs aligned concordantly 0 times; of these:
1548865 (14.32%) aligned discordantly 1 time
9270395 pairs aligned 0 times concordantly or discordantly; of these:
18540790 mates make up the pairs; of these:
17336065 (93.50%) aligned 0 times
626560 (3.38%) aligned exactly 1 time
578165 (3.12%) aligned >1 times
57.55% overall alignment rate

When I run HTSeq: htseq-count -s no -t CDS -i ID -a 0 -o ~/bioinformatics3/206/HTSeq/htPE44h_1.sam PEalgin_44h_1.sam ~/bioinformatics3/206/206.gtf

I get:

__no_feature    12053420
__ambiguous     0
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  0

My GTF file:

##gff-version 3
scaffold-0      FIG     CDS     99      914     .       -       0       ID=fig|6666666.35312.peg.1;Name=Thiazole biosynthesis protein ThiG
scaffold-0      FIG     CDS     948     1148    .       -       0       ID=fig|6666666.35312.peg.2;Name=Sulfur carrier protein ThiS
scaffold-0      FIG     CDS     1180    2415    .       -       1       ID=fig|6666666.35312.peg.3;Name=Glycine oxidase ThiO (EC 1.4.3.19);Ontology_term=KEGG_ENZYME:1.4.3.19
scaffold-0      FIG     CDS     2744    2866    .       -       2       ID=fig|6666666.35312.peg.4;Name=hypothetical protein

And a sample of the (sorted by coordinate) Bowtie alignment:

GWNJ-0850:541:GW1903111949th:1:1101:12236:56334 99      fig|6666666.35312.peg.1 1       41     5                                                                 8S92M   =       302     509     CGTGCATCCGGTGGCGGGCGGCTGACGCCGCCGCTTCCGGCCCAACCAGGATTCTTCGATGACG                                                                 TCCCTTCCTTCCGCCGACGCCCTCACGCTGTACGGCGAAACCTTCGCGAGCCGCGTGCTGATCGGCACGTCGCGCTATCCGTCCCT  A-A-FJ-F                                                                 AJAF<JJJFAA<<J<JFJJJJ7JFJF-7A-AJJJFF<JJJFJFJFFFJAJ--7F<J--7A-AFAAAJJAAJJ<JAA-7-AF7FJ7JFJ<A<-7<F-                                                                 FA7<7--A<77FJJ---7F7A7FF7F-7<FA7--7<<--7-A<A<F  AS:i:179        XN:i:0  XM:i:1  XO:i:0  XG:i:0 N                                                                 M:i:1   MD:Z:26G65      YS:i:256        YT:Z:CP
GWNJ-0850:541:GW1903111949th:1:1102:3275:12402  99      fig|6666666.35312.peg.1 1       16     6                                                                 5S80M5S =       1       -155    TCGATGTCGTGCATCCGGTGGCGGGCGGCTGACGCCGCCGCTTCCGGCCCAACCAGGATTCTTC                                                                 GATGACGTCCCTTCCTTCCGCCGACGCGCTCACGCTGTACGGCGAAACCTTCGCGAGCCGCGTGCTGATCGGCACGTCGCGAGATC  AAAFFJJJ                                                                 JJFJJJJJJJJJJJJ-FFFFJJJJJJJFJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAJJ<AAAJJJJJJJJJFJJJJJJJJJ<JFJJJFJJJJ                                                                 JJJAJJFJFFAJJJJJAJJJJJFJJFJJ7FJJFFJJJJJJF7FFFJ  AS:i:160        XS:i:64 XN:i:0  XM:i:0  XO:i:0 X                                                                 G:i:0   NM:i:0  MD:Z:80 YS:i:160        YT:Z:CP
GWNJ-0850:541:GW1903111949th:1:1102:23937:39405 99      fig|6666666.35312.peg.1 1       36     1                                                                 14S36M  =       39      302     GTCGCGCGCACCCAGCACGCGGCGCGTGCGCTCGCCGGCGGCGACCGGCTCGATGTCGTGCATC                                                                 CGGTGGCGGGCGGCTGACGCCGCCGCTTCCGGCCCAACCAGGATTCTTCGATGACGTCCCTTCCTTCCGCCGACGCGCTCACGCTG  AAFFFJJJ                                                                 JJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ                                                                 JJFJJJFJJJJJFAJJAJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ  AS:i:72 XS:i:162        XN:i:0  XM:i:0  XO:i:0 X                                                                 G:i:0   NM:i:0  MD:Z:36 YS:i:300        YT:Z:CP

I have tried sorting by name and position, I've tried using GFF instead of GTF, I've tried using only the forward data instead of PE, I've tried featureCounts and HTSeq, I've run HTSeq with bam and sam files... I'm at a complete loss for what might be the problem and I've read just about every thread/post on related issue. The only think I can thing is it's an issue with the GTF file (downloaded from RAST). Any help or advice would be greatly appreciated!

ADD COMMENTlink modified 27 days ago by h.mon24k • written 28 days ago by bkinzep0
2

Your "GTF" file is actually a malformed GFF file, is there an actual GTF file available for this organism?

ADD REPLYlink written 27 days ago by Devon Ryan89k

Thanks, that is what is was starting to suspect. The "GTF" file was downloaded from RAST under the "GTF" option so I suspect this is an issue with RAST. Is there a good way to generate a GTF file that will be compatible with HTSeq-count? Re-annotation is fine if it will result in a better GTF file (small bacterial genome).

I just noticed I accidentally pasted two versions of the genome above (bowtie vs HTSeq) but the ID's do match when I've run this.

ADD REPLYlink written 27 days ago by bkinzep0
1

Can you update your post so the contig IDs are compatible (just so we can ensure they are)? The easiest solution to your GFF problem is to write a small program to convert it to a proper GTF file. How easy this is will be entirely dependent upon how familiar you are with a scripting language like perl or python. Alternatively, you might be able to import the GFF file into R and export it as a proper GTF file (you'll have to look up the details of that).

ADD REPLYlink written 27 days ago by Devon Ryan89k

Yes the post is updated with the corresponding GTF file for the Bowtie alignment shown.

ADD REPLYlink written 27 days ago by bkinzep0

I noted by michael.ante, you aligned against the transcriptome rather than the genome. Align against the genome and things will work better.

ADD REPLYlink written 25 days ago by Devon Ryan89k

Hi one of the easy solution may be to try converting the gff file to gig using gff2gtf.. As such I guess this is formatting issue with GTE file

ADD REPLYlink written 26 days ago by gurunath.katagi0

You mapped apparently against the transcritome, since the mapping position in the bam file corresponds to the gene I'd of your annotation file.

ADD REPLYlink written 27 days ago by michael.ante3.2k

I intended to map RNAseq data to the genome. I see multiple reads mapping to each gene... is this incorrect? I don't think I understand your comment, sorry.

ADD REPLYlink written 27 days ago by bkinzep0

Your reads are mapped against entities which are like "fig|6666666.35312.peg.1". HTSeq-count is looking for exon covering reads which are located at entries with the name "scaffold-0". Since there is no overlap, HTSeq is reporting only "__no_features".

You need either to map against the genomic reference (the fasta containing scaffold-0 and so on) and use htseq with the corresponding GTF.

Alternatively, you can map against the transcriptome (the fasta containing names as fig|6666666.35312.peg.1), extract the uniquely mapping reads, and preforme a samtools idxstats.

ADD REPLYlink modified 25 days ago • written 25 days ago by michael.ante3.2k

These are the headers on the genomic fasta file:

fig|6666666.35312.peg.1 fig|6666666.35312.peg.2 fig|6666666.35312.peg.3

I am specifically telling HTSeq count to read the ID, which as above in the GFF says ID=fig|6666666.35312.peg.1 So why would it be looking for Scaffold-0? I made the Bowtie index using the genomic fasta file with headers shown above

ADD REPLYlink written 25 days ago by bkinzep0

It looks like using the Scaffold FASTA file worked (with the original GFF)! Thank you everyone for your help, it is greatly appreciated!

ADD REPLYlink written 25 days ago by bkinzep0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 680 users visited in the last hour