Question: Htseq: High number of no_feature
0
gravatar for nadja.g
8 months ago by
nadja.g0
nadja.g0 wrote:

Hi together,

I’m new in transcriptomics and need your help concerning htseq-counting. I have single-end data from Novaseq600. The alignment was performed with STAR, version 2.6.0c to a reference genome from the dog from ncbi. The percentage of uniquely-mapped reads was adequate and when watching at the bam files, the reads lie on the exon region.

For the counting with htseq (v. 0.9.1) I used the following command:

file=$(ls *.bam | sed -n ${SLURM_ARRAY_TASK_ID}p)

htseq-count -f bam -i gene_name $file /home/ubelix/vetsuisse/ng18q090/CanFam3.1-starIndex/ref_CanFam3.1_Nochr.modi.gtf >$file.count

The genome data files were all modified in this way, that they begin with chr . The lines of the GTF-file look like this:

chr1    Gnomon  exon    251990  252562  .       -       .       transcript_id "rna0"; gene_id "gene0"; gene_name "ENPP1";

My problem is the high number of no_feature after the htseq. 10 Mio makes around 95% of my mapped reads…:

__no_feature    10214983

__ambiguous     95

__too_low_aQual 0

__not_aligned   0

__alignment_not_unique  4758304

Because the bam-files look good, the problem seems not to be a DNA contamination with introns. Just as a try, I also changed the –s parameter and the –p parameter several times, but as expected in single end read data that did not change a lot.

What do you think? Where could lie the problem?

Thanks in advance!

rna-seq htseq-count • 421 views
ADD COMMENTlink modified 8 months ago by genomax67k • written 8 months ago by nadja.g0

What does samtools view -H your.bam show?

ADD REPLYlink written 8 months ago by genomax67k

Do you mean the following parameters:

@HD     VN:1.4  SO:coordinate
@SQ     SN:chr.10       LN:69331447
@SQ     SN:chr.11       LN:74389097
@SQ     SN:chr.12       LN:72498081
@SQ     SN:chr.13       LN:63241923
@SQ     SN:chr.14       LN:60966679
@SQ     SN:chr.15       LN:64190966
@SQ     SN:chr.16       LN:59632846
@SQ     SN:chr.17       LN:64289059
@SQ     SN:chr.18       LN:55844845
@SQ     SN:chr.19       LN:53741614
@SQ     SN:chr.1        LN:122678785
@SQ     SN:chr.20       LN:58134056
@SQ     SN:chr.21       LN:50858623
@SQ     SN:chr.22       LN:61439934
@SQ     SN:chr.23       LN:52294480
@SQ     SN:chr.24       LN:47698779
@SQ     SN:chr.25       LN:51628933
@SQ     SN:chr.26       LN:38964690
@SQ     SN:chr.27       LN:45876710
@SQ     SN:chr.28       LN:41182112
@SQ     SN:chr.29       LN:41845238
@SQ     SN:chr.2        LN:85426708
@SQ     SN:chr.30       LN:40214260
@SQ     SN:chr.31       LN:39895921
@SQ     SN:chr.32       LN:38810281
@SQ     SN:chr.33       LN:31377067
@SQ     SN:chr.34       LN:42124431
@SQ     SN:chr.35       LN:26524999
@SQ     SN:chr.36       LN:30810995
@SQ     SN:chr.37       LN:30902991
@SQ     SN:chr.38       LN:23914537
@SQ     SN:chr.3        LN:91889043
@SQ     SN:chr.4        LN:88276631
@SQ     SN:chr.5        LN:88915250
@SQ     SN:chr.6        LN:77573801
@SQ     SN:chr.7        LN:80974532
@SQ     SN:chr.8        LN:74330416
@SQ     SN:chr.9        LN:61074082
@SQ     SN:chr.MT       LN:16727
@SQ     SN:chr.X        LN:123869142
@PG     ID:STAR PN:STAR VN:STAR_2.6.0c  CL:STAR   --runThreadN 8   --genomeDir /home/ubelix/vetsuisse/ng18q090/CanFam3.1-starIndex/genome_index   --readFilesIn F9_L2_R1_001_L4WxcGBKBUlB.fastq.gz      --readFilesCommand zcat      --outFileNamePrefix ./F9_L2_R1_001_L4WxcGBKBUlB   --outSAMtype BAM   SortedByCoordinate      --outFilterType BySJout   --outFilterMultimapNmax 50   --outFilterMismatchNmax 2   --outFilterMismatchNoverLmax 0.04   --alignIntronMin 20   --alignIntronMax 1000000   --alignMatesGapMax 1000000   --alignSJoverhangMin 1
@CO     user command line: STAR --runThreadN 8 --genomeDir /home/ubelix/vetsuisse/ng18q090/CanFam3.1-starIndex/genome_index --readFilesIn F9_L2_R1_001_L4WxcGBKBUlB.fastq.gz --readFilesCommand zcat --outFilterType BySJout --outFilterMultimapNmax 50 --alignSJoverhangMin 1 --outFilterMismatchNmax 2 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outFileNamePrefix ./F9_L2_R1_001_L4WxcGBKBUlB --outSAMtype BAM SortedByCoordinate

It's just an example of one bam file...

ADD REPLYlink modified 8 months ago by genomax67k • written 8 months ago by nadja.g0
1

As you can see your alignment contains chromosome names in chr.N format where as your GTF file has them in chrN format. So if you edit your GTF file to match the names things should start working. Try this.

sed 's/chr/chr\./'  your.gtf > new.gtf

Use new.gtf with htseq-count.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax67k

Ou, I'm sorry... That was one of the elder .bam files... Here is the newer one

@HD     VN:1.4  SO:coordinate
@SQ     SN:chr10        LN:69331447
@SQ     SN:chr11        LN:74389097
@SQ     SN:chr12        LN:72498081
@SQ     SN:chr13        LN:63241923
@SQ     SN:chr14        LN:60966679
@SQ     SN:chr15        LN:64190966
@SQ     SN:chr16        LN:59632846
@SQ     SN:chr17        LN:64289059
@SQ     SN:chr18        LN:55844845
@SQ     SN:chr19        LN:53741614
@SQ     SN:chr1 LN:122678785
@SQ     SN:chr20        LN:58134056
@SQ     SN:chr21        LN:50858623
@SQ     SN:chr22        LN:61439934
@SQ     SN:chr23        LN:52294480
@SQ     SN:chr24        LN:47698779
@SQ     SN:chr25        LN:51628933
@SQ     SN:chr26        LN:38964690
@SQ     SN:chr27        LN:45876710
@SQ     SN:chr28        LN:41182112
@SQ     SN:chr29        LN:41845238
@SQ     SN:chr2 LN:85426708
@SQ     SN:chr30        LN:40214260
@SQ     SN:chr31        LN:39895921
@SQ     SN:chr32        LN:38810281
@SQ     SN:chr33        LN:31377067
@SQ     SN:chr34        LN:42124431
@SQ     SN:chr35        LN:26524999
@SQ     SN:chr36        LN:30810995
@SQ     SN:chr37        LN:30902991
@SQ     SN:chr38        LN:23914537
@SQ     SN:chr3 LN:91889043
@SQ     SN:chr4 LN:88276631
@SQ     SN:chr5 LN:88915250
@SQ     SN:chr6 LN:77573801
@SQ     SN:chr7 LN:80974532
@SQ     SN:chr8 LN:74330416
@SQ     SN:chr9 LN:61074082
@SQ     SN:chrMT        LN:16727
@SQ     SN:chrX LN:123869142
@PG     ID:STAR PN:STAR VN:STAR_2.6.0c  CL:STAR   --runThreadN 8   --genomeDir /home/ubelix/vetsuisse/ng18q090/GeneDiffAna/genome_index   --readFilesIn F9_L3_R1_001_HllJvKZkHwYR.fastq.gz      --readFilesCommand zcat      --outFileNamePrefix ./F9_L3_R1_001_HllJvKZkHwYR   --outSAMtype BAM   SortedByCoordinate      --outFilterType BySJout   --outFilterMultimapNmax 50   --outFilterMismatchNmax 2   --outFilterMismatchNoverLmax 0.04   --alignIntronMin 20   --alignIntronMax 1000000   --alignMatesGapMax 1000000   --alignSJoverhangMin 1
@CO     user command line: STAR --runThreadN 8 --genomeDir /home/ubelix/vetsuisse/ng18q090/GeneDiffAna/genome_index --readFilesIn F9_L3_R1_001_HllJvKZkHwYR.fastq.gz --readFilesCommand zcat --outFilterType BySJout --outFilterMultimapNmax 50 --alignSJoverhangMin 1 --outFilterMismatchNmax 2 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outFileNamePrefix ./F9_L3_R1_001_HllJvKZkHwYR --outSAMtype BAM SortedByCoordinate

We had it first with chr.N and changed it than to chrN just to be sure...

But thanks for your answer... Do you have another idea?

ADD REPLYlink modified 8 months ago by genomax67k • written 8 months ago by nadja.g0

Are you using the correct GTF/GFF file that was obtained from the same source where your reference sequence came from?

ADD REPLYlink written 8 months ago by genomax67k

Unfortunately I get the modified genome-files from people who have more experience in transcriptomics than I have (I began 2 weeks ago with bioinformatics :P)... They told be to have taken the file from ncbi. And they also do not know an advice now... You suppose the genome-files to be the problem, right?

I was thinking about load them again newly from ncbi and modify them as little as possible. But I'm not sure, which file I should take. Especially I can not find one good fasta-file... It should come from ncbi because that's the best source for the dog.

ADD REPLYlink written 8 months ago by nadja.g0

CanFam GFF file from NCBI can be obtained here and it does not have chr designations so I am not sure where you got the one you are using.

If you are going to do this again then you can find the genome file and the annotation on the dog genome page at NCBI.

ADD REPLYlink written 8 months ago by genomax67k

It's probably from UCSC. At least the one from Ensembl has the normal 1, 2, etc. chromosome names.

ADD REPLYlink written 8 months ago by Devon Ryan90k

I'm sorry for my late comment, on Friday I could only comment 5 times because I'm new on the forum. Thank you for your comments... We took the fasta-file from NCBI and changed it, because it's starting line begins with NC_00 or something like that. We changed it unfortunately to chrN. Then we had to change the gff and the gtf from NCBI also from N to chrN. That's why they have all chrN names. We changed all the files...

ADD REPLYlink written 8 months ago by nadja.g0

When I used -s no the no_features were around 9.5Mio but the ambigous get higher aroung 1500...

ADD REPLYlink written 8 months ago by nadja.g0

Look at the BAM file in IGV, I suspect you have a mismatch between the annotation and the genome used.

ADD REPLYlink written 8 months ago by Devon Ryan90k

We looked at the bam files, to see if the reads are on the genes. How can I see there, if genome and annotation are matching?

ADD REPLYlink written 8 months ago by nadja.g0

If they aren't then the reads will generally not be on the genes. Alternatively, look at the BAM header (samtools view -H) and see if the chromosome lengths match whatever genome version you're expecting.

ADD REPLYlink written 8 months ago by Devon Ryan90k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1955 users visited in the last hour