Question

Why my HTseq result is all no_feature.

0

Entering edit mode

5.2 years ago

guotingfeng ▴ 10

Hi dear all, I used HTseq to count the number of reads that mapped into each genes through Bwa-mem alignment, but I found no matter what parameter I changed, the output of HTseq is 100% no_feature. Could you help me figure out my mistake please?

1. I used Bwa-mem to align my reads to Bacillus 168 genome fasta format from NCBI.

module load BWA/0.7.17-GCCcore-6.3.0
module load SAMtools/1.7-GCCcore-6.3.0

pe1_1="/scratch/user/guotingfeng/RNAseq/2Mg_S9_L002_R1_001.fastq.gz"
pe1_2="/scratch/user/guotingfeng/RNAseq/2Mg_S9_L002_R2_001.fastq.gz"

read_group_id='2Mggroup'

library='pe'

sample='2Mg'

ref_genome="/scratch/user/guotingfeng/RNAseq/genomereference/twaindex/sequence.fasta"

output_bam="/scratch/user/guotingfeng//RNAseq/Bwa-mem/${sample}.sam"

if [ ! -f ${ref_genome}.bwt ]; then
  bwa index $ref_genome
fi
platform='ILLUMINA'

**bwa \
mem \
-M \
-t $threads \
-R "@RG\tID:$readgroup\tLB:$library\tSM:$sample\tPL:$platform" $ref_genome $pe1_1 $pe1_2 | samtools view -h -Sb | \
samtools sort -m 2G -@ $threads -T $sample/sorted -n -o $output_bam**

I think output is fine for this process:

@HD VN:1.5  SO:queryname
@SQ SN:NC_000964.3  LN:4215606
@RG ID: LB:pe   SM:1Mg  PL:ILLUMINA
@PG ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem -M -t 20 -R @RG\tID:\tLB:pe\tSM:1Mg\tPL:ILLUMINA /scratch/user/guotingfeng/RNAseq/genomereference/twaindex/sequence.fasta /scratch/user/guotingfeng/RNAseq/1Mg_S5_L002_R1_001.fastq.gz /scratch/user/guotingfeng/RNAseq/1Mg_S5_L002_R2_001.fastq.gz

K00333:90:H3TMTBBXY:2:1101:1133:28780   83  NC_000964.3 476475  60  75M =   476412  -138    GAAAAACGGGCGCGGTGAGAGAGTGCCGCGCGAAGTCTGTTATAATAACAGGATGAGCGTGAAAGAAAGAGAAGT     NM:i:1  MD:Z:16T58  MC:Z:75M    AS:i:70 XS:i:0

K00333:90:H3TMTBBXY:2:1101:1133:28780   163 NC_000964.3 476412  60  75M =   476475  138 NATTCACCAAACATCGCCCTCCGCGCAAACCGTCCCTCACGTCATTTTCTCAGAACTTGACCCGACAACCGGGCG 
NM:i:9  MD:Z:0C3A2A12A13A20A5A3A2A6 MC:Z:75M    AS:i:38 XS:i:0

2. The I used Bwa-mem output to analyze read number in the each genes through HTseq count.

module load HTSeq/0.9.1-intel-2017A-Python-2.7.12

pe1_1='/scratch/user/guotingfeng/RNAseq/Bwa-mem/1.sam'
pe1_2='/scratch/user/guotingfeng/RNAseq/genomereference/Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.37.gtf'
sample='/scratch/user/guotingfeng/RNAseq/HTseq/Bwa-mem/bwamem_1sam.txt'

**htseq-count -f sam -m union -s yes -r name -t exon -i gene_id --additional-attr=gene_name $pe1_1 $pe1_2 > $sample**

The output is weird: because all the reads are no_feature.

gene:BSU05210   ydeI    0

gene:BSU05220   ydeJ    0

**gene:BSU05230   ydeK    0**

gene:BSU05240   ydeL    0

**__no_feature            24000765**

__ambiguous             0

__too_low_aQual         258765

__not_aligned           33939

__alignment_not_unique          0

Do you have any ideas or suggests please? Thank you very much. Have a good day.

RNA-Seq Bwa-mem HTseq • 1.9k views

ADD COMMENT • link updated 5.2 years ago by GenoMax 141k • written 5.2 years ago by guotingfeng ▴ 10

0

Entering edit mode

You already have a similar thread open: Why my Hisat2 result is 0% aligned concordantly and HTseq result is 100% not aligned.

Are you using the correct annotation file? If the file is not in right format you will not get any counts. Name of the chromosome has to match what is in your alignment file. What is the output of head -5 /scratch/user/guotingfeng/RNAseq/genomereference/Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.37.gtf?

ADD REPLY • link 5.2 years ago by GenoMax 141k

0

Entering edit mode

You only have 4 genes in your gtf?

ADD REPLY • link 5.2 years ago by swbarnes2 14k

0

Entering edit mode

Hi swbarnes2, no, I have lots. I just show four as an example. The number of count is same. Thanks.

ADD REPLY • link 5.2 years ago by guotingfeng ▴ 10

0

Entering edit mode

Hi Genomax, Thank you for your reply. Here is my content of gtf file.

Chromosome  ena exon    410 1750    .   +   .   transcript_id "transcript:CAB11777"; gene_id "gene:BSU00010"; gene_name "dnaA";
Chromosome  ena CDS 410 1750    .   +   0   transcript_id "transcript:CAB11777"; gene_id "gene:BSU00010"; gene_name "dnaA";
Chromosome  ena exon    1939    3075    .   +   .   transcript_id "transcript:CAB11778"; gene_id "gene:BSU00020"; gene_name "dnaN";
Chromosome  ena CDS 1939    3075    .   +   0   transcript_id "transcript:CAB11778"; gene_id "gene:BSU00020"; gene_name "dnaN";
Chromosome  ena exon    3206    3421    .   +   .   transcript_id "transcript:CAB11779"; gene_id "gene:BSU00030"; gene_name "yaaA";

Does it look OK?

ADD REPLY • link updated 5.2 years ago by GenoMax 141k • written 5.2 years ago by guotingfeng ▴ 10

3

Entering edit mode

The file looks ok. But the chromosome name (Chromosome) in your GTF file does not match the one in your alignment file (NC_000964.3). That is your likely problem with counting. You could replace the name in your GTF file and try counting again.

ADD REPLY • link 5.2 years ago by GenoMax 141k

0

Entering edit mode

Hi genomax, I have tried as you suggested. It is really improved. I can get some counts in several genes, but there are still 90% reads are no_feature. Here is what I did. 1. I change chromosome to NC_000964.3. NC_000964.3 ena exon 410 1750 . + . transcript_id "transcript:CAB11777"; gene_id "gene:BSU00010"; gene_name "dnaA";

NC_000964.3 ena CDS 410 1750 . + 0 transcript_id "transcript:CAB11777"; gene_id "gene:BSU00010"; gene_name "dnaA";

NC_000964.3 ena exon 1939 3075 . + . transcript_id "transcript:CAB11778"; gene_id "gene:BSU00020"; gene_name "dnaN";

NC_000964.3 ena CDS 1939 3075 . + 0 transcript_id "transcript:CAB11778"; gene_id "gene:BSU00020"; gene_name "dnaN";

NC_000964.3 ena exon 3206 3421 . + . transcript_id "transcript:CAB11779"; gene_id "gene:BSU00030"; gene_name "yaaA";

Then I run the HTseq-count as above script but change the gif file. Here is the output, I picked up several genes as example.

gene:BSU40000 yxnA 181

gene:BSU40010 yxaD 84

gene:BSU40021 yxzK 5

gene:BSU40022 yxaC 13

gene:BSU40030 yxaB 7

gene:BSU40040 glxK 44

gene:BSU40050 gntR 16

__no_feature 24555316

__ambiguous 179491

__too_low_aQual 296420

__not_aligned 48137

__alignment_not_unique 0

Could you help me please? Thank you very much.

ADD REPLY • link 5.2 years ago by guotingfeng ▴ 10

0

Entering edit mode

Hi genomax, I change -s yes to -s reverse and it works now.

This is my output:

gene:BSU39790 yxcE 1864

gene:BSU39800 yxcD 502

gene:BSU39810 csbC 23383

gene:BSU39820 htpG 7631

gene:BSU39830 yxcA 610

gene:BSU39840 yxbG 2596

gene:BSU39850 yxbF 928

__no_feature 980707

__ambiguous 3984165

__too_low_aQual 296420

__not_aligned 48137

__alignment_not_unique 0

Thank you for your advice and help. I really appreciate it. You save me lots of time. Have a good time.

ADD REPLY • link 5.2 years ago by guotingfeng ▴ 10

0

Entering edit mode

You are asking for trouble by using variable names with no relation to what they really are:

pe1_1='/scratch/user/guotingfeng/RNAseq/Bwa-mem/1.sam'
pe1_2='/scratch/user/guotingfeng/RNAseq/genomereference/Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.37.gtf'

ADD REPLY • link 5.2 years ago by h.mon 35k