Question

I got extremely low alignment rate running HiSAT2 and Tophat2

0

Entering edit mode

7.2 years ago

oghzzang ▴ 50

Hi. I'm trying to map paired-end rna-seq reads on GRCm38 (mm10) using Hisat2 and Tophat2. But the mapping percentage is almost 0-5%

(hi-seq 2500 and sequencing fragment is 300 bp)

1.fastqc

1) fastqc summary

PASS Basic Statistics

PASS Per base sequence quality

PASS Per tile sequence quality

PASS Per sequence quality scores

FAIL Per base sequence content (file open enter image description here or like this image : https://rtsf.natsci.msu.edu/_rtsf/assets/Image/fastqc_images/TruSeqRNAPerBaseSeqContent.png PASS Per sequence GC content

PASS Per base N content

PASS Sequence Length Distribution

FAIL Sequence Duplication Levels

PASS Overrepresented sequences

PASS Adapter Content

2) read information

Measure Value

Filename sample_1.fastq.gz

File type Conventional base calls

Encoding Sanger / Illumina 1.9

Total Sequences 44728504

Sequences flagged as poor quality 0

Sequence length 101

%GC 50

2.Hisat

1) command

$AnacondaBin/hisat2\

-p 8\

--rg-id=sample \

--rg SM:sample --rg LB:LB --rg PL:Illumina --rg PU:sample\

-x $Reference_dir/Mus_musculus/NCBI/hisatIndex/GRCm38\

--dta \

--rna-strandness FR\

-1 $Fastq_dir/sample_1.fastq.gz\

-2 $Fastq_dir/sample_2.fastq.gz\

-S $Working_dir/Analysis/$Analysis_dir/NCBI/Pre_Tophat/sample_pe.sam 2

2) Result

44728504 reads; of these:

44728504 (100.00%) were paired; of these:

44358669 (99.17%) aligned concordantly 0 times

331704 (0.74%) aligned concordantly exactly 1 time

38131 (0.09%) aligned concordantly >1 times

----

44358669 pairs aligned concordantly 0 times; of these:

  11328 (0.03%) aligned discordantly 1 time

----

44347341 pairs aligned 0 times concordantly or discordantly; of these:

  88694682 mates make up the pairs; of these:

    87830960 (99.03%) aligned 0 times

    735195 (0.83%) aligned exactly 1 time

    128527 (0.14%) aligned >1 times

1.82% overall alignment rate

**3.Tophat

1) command**

$AnacondaBin/tophat2\

 --GTF $Reference_dir//Mus_musculus/UCSC/mm10/Annotation/Archives/archive-2015-07-17-14-33-26/Genes/genes.gtf\ ## from https://ccb.jhu.edu/software/tophat/igenomes.shtml

 --output-dir $Working_dir/Analysis/$Analysis_dir/Tophat\

 --num-threads 1\

 $Reference_dir/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/genome\ ## from ## from https://ccb.jhu.edu/software/tophat/igenomes.shtml


 $Fastq_dir/sample_1.fastq.gz\

 $Fastq_dir/sample_2.fastq.gz\

2) result

Left reads:

      Input     :  44728504

       Mapped   :    355987 ( 0.8% of input)

        of these:      7756 ( 2.2%) have multiple alignments (0 have >20)

Right reads:

      Input     :  44728504

       Mapped   :    347193 ( 0.8% of input)

        of these:      7342 ( 2.1%) have multiple alignments (0 have >20)

0.8% overall read mapping rate.

Aligned pairs: 159136

 of these:      1209 ( 0.8%) have multiple alignments

                 218 ( 0.1%) are discordant alignments

0.4% concordant pair alignment rate.

Other try..

1) first 10 bp trimming from fastq read 1 and read 2 files.

--> But the result was also too extremely low rate alignment.

2) I've been seen this comment.

Reference speices diverse

RNA-Seq MM10 alignment • 5.3k views

ADD COMMENT • link updated 7.2 years ago by Carlo Yague 9.0k • written 7.2 years ago by oghzzang ▴ 50

0

Entering edit mode

I have the same problem! have you downloaded the index from HISAT2? I did, even trying with mm9 I get the same alignment rate, I am using public NGS data :( which it is suposed to be mouse!...

ADD REPLY • link 7.2 years ago by Buffo ★ 2.4k

0

Entering edit mode

Did you check your data source? I checked my data. And I identified my data wasn't mouse sequence. (by Carlo Yague's comment)

After I map my data to human reference, I got 95%. mapping percentage. And I ran hisat index following pipelines.

Download Reference genome
https://ccb.jhu.edu/software/tophat/igenomes.shtml
Build hisat2 index echo "2-1. Build Hisat2 index (Default Options)" $AnacondaBin/hisat2-build\ $Reference_dir/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/hisat2_index/mm10_genome.fa\ mm10_genome

ADD REPLY • link 7.2 years ago by oghzzang ▴ 50

0

Entering edit mode

Yes of course, I have mapped the data to some related genomes including human, finally I will write to the corresponding author :).

ADD REPLY • link 7.2 years ago by Buffo ★ 2.4k

score 1 · Answer 1 · 2018-04-12

1

Entering edit mode

7.2 years ago

Carlo Yague 9.0k

GIven the extremely low mapping rate, my guess would be that your data is not mouse RNA. You can try to manually pick a few reads and blast them.

By the way, where does your data comes from ?

It is also quite unusual to not have overrepresented sequences in RNA-seq data.

ADD COMMENT • link 7.2 years ago by Carlo Yague 9.0k

1

Entering edit mode

Hi Yague, thank you so much for getting back. your comment definitely helps a lot! My RNA sequence file is come from illumina hiseq 2500 platform and Human sample :).

ADD REPLY • link 7.2 years ago by oghzzang ▴ 50