Question: HTSeq funny output
1
gravatar for --panda--
22 months ago by
--panda--30
Evanston
--panda--30 wrote:

I got some interesting output from the htseq-count.

 My alignment file have 9933765 reads

 My htseq.count have 60525122 none unique alignment

I wonder if someone can have a reasonable explanation that "none unique alignments" are more than my input reads. If not, how can you persuade yourself when you use htseq( eg. as an input for edgeR or DESeq2)?

I put the pipline, tophat summary as well as the htseq-count output below.

Command Line:

  ~/Documents/sratoolkit.2.7.0-ubuntu64/bin/fastq-dump -I --split-files -O ./fastq WTrep1_SRR2547503.sra

  java -jar ~/Documents/Trimmomatic-0.36/trimmomatic-0.36.jar SE -phred33 ./fastq/WTrep1_SRR2547503_1.fastq ./fastq/WTrep1_trim.fastq   ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

  ~/Documents/bowtie2-2.3.3.1/bowtie2-build ./genome/GCF_000005845.2_ASM584v2_genomic.fna e_coli

  ~/Documents/tophat-2.1.1.Linux_x86_64/tophat --no-coverage-search -o WTrep1 e_coli  ./fastq/WTrep1_trim.fastq

  htseq-count -f bam -i gene -t gene  WTrep1/accepted_hits.bam  ./genome/GCF_000005845.2_ASM584v2_genomic.gff > ./WTrep1.txt

Tophat alignment summary:

Reads:
      Input     :  10820715
      Mapped   :   9933765 (91.8% of input)
      of these:   9145766 (92.1%) have multiple alignments (675 have >20)
 91.8% overall read mapping rate.

Last few line of htseq-count output (using HTSeq (0.9.1)):

      __no_feature  760545
      __ambiguous   380
      __too_low_aQual   0
      __not_aligned 0
      __alignment_not_unique    60525122
rna-seq quantification htseq • 826 views
ADD COMMENTlink modified 22 months ago • written 22 months ago by --panda--30
0
gravatar for WouterDeCoster
22 months ago by
Belgium
WouterDeCoster40k wrote:

Your fastq has 10820715 reads, of which 9933765 were mapped by tophat (see also C: Tophat: Can I forbid splice junction? )

BUT 92% of your reads were aligned multiple times, resulting in 60525122 alignments which are not unique -> multimapping reads.

Nothing wrong here.

ADD COMMENTlink written 22 months ago by WouterDeCoster40k

as one of the reasons, could be due to low complexity sequences in fastq file ?

ADD REPLYlink written 22 months ago by Sam120

Absolutely, for RNA-seq reads derived from rRNA is a likely suspect.

ADD REPLYlink written 22 months ago by WouterDeCoster40k

As you may know, tophat only report the best alignment(it also applies to Bowtie2 or HISAT).

tophat2 manual

--report-secondary-alignments
By default TopHat reports best or primary alignments based on alignment scores (AS). Use this option if you want to output additional or secondary alignments (up to 20 alignments will be reported this way, this limit can be changed by using the -g/--max-multihits option above).

So although 92% reads aligned multiple times, htseq has no information about where are the alignments locate except for the best one. If I got 100 reads aligned 200 times, it should say "100 reads has none unique alignments" rather than "20000 not_unique alignments"

ADD REPLYlink modified 22 months ago • written 22 months ago by --panda--30

As you may know, tophat reports up to 20 alignments per read by default:

-g/--max-multihits <int> Instructs TopHat to allow up to this many alignments to the reference for a given read, and choose the alignments based on their alignment scores if there are more than this number. The default is 20 for read mapping. Unless you use --report-secondary-alignments, TopHat will report the alignments with the best alignment score. If there are more alignments with the same score than this number, TopHat will randomly report only this many alignments. In case of using --report-secondary-alignments, TopHat will try to report alignments up to this option value, and TopHat may randomly output some of the alignments with the same score to meet this number.

ADD REPLYlink written 22 months ago by WouterDeCoster40k

Sorry if you felt a bit uncomfortable, I did not mean that. I was just confused.

So if tophat found 25 alignments for one read:

 situation1:   1 has highest score(let's say 0) and the other 24 has lower score(let's say -10), only the one alignment is reported. 

 situation2:   10 has the same highest score(0) and the other 14 has lower score(-10), the first 10 alignments are reported. 

 situation3:   25 has the same highest score(0), random choose 20 out of the 25 alignment and report them.

 situation4:   if  use --report-secondary-alignments and 10 alignments has the same highest score(0) and the other 14 has lower score(-10),  the 10 highest score alignments are reported and 10 out of the 14 lower score reads are reported.

Am I correct? Thanks for you reply and patience

ADD REPLYlink written 22 months ago by --panda--30

I agree that it's confusing - don't worry. I didn't mean to be snarky.

Intuitively I would agree with your statements - although I haven't used Tophat in a long time (and you shouldn't use it too), so I'm not too sure. Perhaps someone else can chime in.

ADD REPLYlink written 22 months ago by WouterDeCoster40k

I agree that using HISAT may provide higher efficiency but I think there is no fundamental improvement in terms of precision.

Thanks!

ADD REPLYlink written 22 months ago by --panda--30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1457 users visited in the last hour