Question: I got extremely low alignment rate running HiSAT2 and Tophat2
0
gravatar for oghzzang
6 weeks ago by
oghzzang10
oghzzang10 wrote:

Hi. I'm trying to map paired-end rna-seq reads on GRCm38 (mm10) using Hisat2 and Tophat2. But the mapping percentage is almost 0-5%

(hi-seq 2500 and sequencing fragment is 300 bp)

1.fastqc

1) fastqc summary

PASS Basic Statistics

PASS Per base sequence quality

PASS Per tile sequence quality

PASS Per sequence quality scores

FAIL Per base sequence content (file openenter image description here or like this image : https://rtsf.natsci.msu.edu/_rtsf/assets/Image/fastqc_images/TruSeqRNAPerBaseSeqContent.png PASS Per sequence GC content

PASS Per base N content

PASS Sequence Length Distribution

FAIL Sequence Duplication Levels

PASS Overrepresented sequences

PASS Adapter Content

2) read information

Measure Value

Filename sample_1.fastq.gz

File type Conventional base calls

Encoding Sanger / Illumina 1.9

Total Sequences 44728504

Sequences flagged as poor quality 0

Sequence length 101

%GC 50

2.Hisat

1) command

$AnacondaBin/hisat2\

-p 8\

--rg-id=sample \

--rg SM:sample --rg LB:LB --rg PL:Illumina --rg PU:sample\

-x $Reference_dir/Mus_musculus/NCBI/hisatIndex/GRCm38\

--dta \

--rna-strandness FR\

-1 $Fastq_dir/sample_1.fastq.gz\

-2 $Fastq_dir/sample_2.fastq.gz\

-S $Working_dir/Analysis/$Analysis_dir/NCBI/Pre_Tophat/sample_pe.sam 2

2) Result

44728504 reads; of these:

44728504 (100.00%) were paired; of these:

44358669 (99.17%) aligned concordantly 0 times

331704 (0.74%) aligned concordantly exactly 1 time

38131 (0.09%) aligned concordantly >1 times

----

44358669 pairs aligned concordantly 0 times; of these:

  11328 (0.03%) aligned discordantly 1 time

----

44347341 pairs aligned 0 times concordantly or discordantly; of these:

  88694682 mates make up the pairs; of these:

    87830960 (99.03%) aligned 0 times

    735195 (0.83%) aligned exactly 1 time

    128527 (0.14%) aligned >1 times

1.82% overall alignment rate

**3.Tophat

1) command**

$AnacondaBin/tophat2\

 --GTF $Reference_dir//Mus_musculus/UCSC/mm10/Annotation/Archives/archive-2015-07-17-14-33-26/Genes/genes.gtf\ ## from https://ccb.jhu.edu/software/tophat/igenomes.shtml

 --output-dir $Working_dir/Analysis/$Analysis_dir/Tophat\

 --num-threads 1\

 $Reference_dir/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/genome\ ## from ## from https://ccb.jhu.edu/software/tophat/igenomes.shtml


 $Fastq_dir/sample_1.fastq.gz\

 $Fastq_dir/sample_2.fastq.gz\

2) result

Left reads:

      Input     :  44728504

       Mapped   :    355987 ( 0.8% of input)

        of these:      7756 ( 2.2%) have multiple alignments (0 have >20)

Right reads:

      Input     :  44728504

       Mapped   :    347193 ( 0.8% of input)

        of these:      7342 ( 2.1%) have multiple alignments (0 have >20)

0.8% overall read mapping rate.

Aligned pairs: 159136

 of these:      1209 ( 0.8%) have multiple alignments

                 218 ( 0.1%) are discordant alignments

0.4% concordant pair alignment rate.

  1. Other try..

1) first 10 bp trimming from fastq read 1 and read 2 files.

--> But the result was also too extremely low rate alignment.

2) I've been seen this comment.

Reference speices diverse

rna-seq alignment mm10 • 163 views
ADD COMMENTlink modified 6 weeks ago by Carlo Yague3.9k • written 6 weeks ago by oghzzang10

I have the same problem! have you downloaded the index from HISAT2? I did, even trying with mm9 I get the same alignment rate, I am using public NGS data :( which it is suposed to be mouse!...

ADD REPLYlink written 6 weeks ago by Buffo1.1k

Did you check your data source? I checked my data. And I identified my data wasn't mouse sequence. (by Carlo Yague's comment)

After I map my data to human reference, I got 95%. mapping percentage. And I ran hisat index following pipelines.

  1. Download Reference genome
    https://ccb.jhu.edu/software/tophat/igenomes.shtml

  2. Build hisat2 index echo "2-1. Build Hisat2 index (Default Options)" $AnacondaBin/hisat2-build\ $Reference_dir/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/hisat2_index/mm10_genome.fa\ mm10_genome

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by oghzzang10

Yes of course, I have mapped the data to some related genomes including human, finally I will write to the corresponding author :).

ADD REPLYlink written 6 weeks ago by Buffo1.1k
0
gravatar for Carlo Yague
6 weeks ago by
Carlo Yague3.9k
Belgium
Carlo Yague3.9k wrote:

GIven the extremely low mapping rate, my guess would be that your data is not mouse RNA. You can try to manually pick a few reads and blast them.

By the way, where does your data comes from ?

It is also quite unusual to not have overrepresented sequences in RNA-seq data.

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by Carlo Yague3.9k
1

Hi Yague, thank you so much for getting back. your comment definitely helps a lot! My RNA sequence file is come from illumina hiseq 2500 platform and Human sample :).

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by oghzzang10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 637 users visited in the last hour