I got extremely low alignment rate running HiSAT2 and Tophat2
1
0
Entering edit mode
4.1 years ago
oghzzang ▴ 40

Hi. I'm trying to map paired-end rna-seq reads on GRCm38 (mm10) using Hisat2 and Tophat2. But the mapping percentage is almost 0-5%

(hi-seq 2500 and sequencing fragment is 300 bp)

1.fastqc

1) fastqc summary

PASS Basic Statistics

PASS Per base sequence quality

PASS Per tile sequence quality

PASS Per sequence quality scores

FAIL Per base sequence content (file openenter image description here or like this image : https://rtsf.natsci.msu.edu/_rtsf/assets/Image/fastqc_images/TruSeqRNAPerBaseSeqContent.png PASS Per sequence GC content

PASS Per base N content

PASS Sequence Length Distribution

FAIL Sequence Duplication Levels

PASS Overrepresented sequences

PASS Adapter Content

2) read information

Measure Value

Filename sample_1.fastq.gz

File type Conventional base calls

Encoding Sanger / Illumina 1.9

Total Sequences 44728504

Sequences flagged as poor quality 0

Sequence length 101

%GC 50

2.Hisat

1) command

$AnacondaBin/hisat2\

-p 8\

--rg-id=sample \

--rg SM:sample --rg LB:LB --rg PL:Illumina --rg PU:sample\

-x $Reference_dir/Mus_musculus/NCBI/hisatIndex/GRCm38\

--dta \

--rna-strandness FR\

-1 $Fastq_dir/sample_1.fastq.gz\

-2 $Fastq_dir/sample_2.fastq.gz\

-S $Working_dir/Analysis/$Analysis_dir/NCBI/Pre_Tophat/sample_pe.sam 2

2) Result

44728504 reads; of these:

44728504 (100.00%) were paired; of these:

44358669 (99.17%) aligned concordantly 0 times

331704 (0.74%) aligned concordantly exactly 1 time

38131 (0.09%) aligned concordantly >1 times

----

44358669 pairs aligned concordantly 0 times; of these:

  11328 (0.03%) aligned discordantly 1 time

----

44347341 pairs aligned 0 times concordantly or discordantly; of these:

  88694682 mates make up the pairs; of these:

    87830960 (99.03%) aligned 0 times

    735195 (0.83%) aligned exactly 1 time

    128527 (0.14%) aligned >1 times

1.82% overall alignment rate

**3.Tophat

1) command**

$AnacondaBin/tophat2\

 --GTF $Reference_dir//Mus_musculus/UCSC/mm10/Annotation/Archives/archive-2015-07-17-14-33-26/Genes/genes.gtf\ ## from https://ccb.jhu.edu/software/tophat/igenomes.shtml

 --output-dir $Working_dir/Analysis/$Analysis_dir/Tophat\

 --num-threads 1\

 $Reference_dir/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/genome\ ## from ## from https://ccb.jhu.edu/software/tophat/igenomes.shtml


 $Fastq_dir/sample_1.fastq.gz\

 $Fastq_dir/sample_2.fastq.gz\

2) result

Left reads:

      Input     :  44728504

       Mapped   :    355987 ( 0.8% of input)

        of these:      7756 ( 2.2%) have multiple alignments (0 have >20)

Right reads:

      Input     :  44728504

       Mapped   :    347193 ( 0.8% of input)

        of these:      7342 ( 2.1%) have multiple alignments (0 have >20)

0.8% overall read mapping rate.

Aligned pairs: 159136

 of these:      1209 ( 0.8%) have multiple alignments

                 218 ( 0.1%) are discordant alignments

0.4% concordant pair alignment rate.

  1. Other try..

1) first 10 bp trimming from fastq read 1 and read 2 files.

--> But the result was also too extremely low rate alignment.

2) I've been seen this comment.

Reference speices diverse

RNA-Seq MM10 alignment • 2.7k views
ADD COMMENT
0
Entering edit mode

I have the same problem! have you downloaded the index from HISAT2? I did, even trying with mm9 I get the same alignment rate, I am using public NGS data :( which it is suposed to be mouse!...

ADD REPLY
0
Entering edit mode

Did you check your data source? I checked my data. And I identified my data wasn't mouse sequence. (by Carlo Yague's comment)

After I map my data to human reference, I got 95%. mapping percentage. And I ran hisat index following pipelines.

  1. Download Reference genome
    https://ccb.jhu.edu/software/tophat/igenomes.shtml

  2. Build hisat2 index echo "2-1. Build Hisat2 index (Default Options)" $AnacondaBin/hisat2-build\ $Reference_dir/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/hisat2_index/mm10_genome.fa\ mm10_genome

ADD REPLY
0
Entering edit mode

Yes of course, I have mapped the data to some related genomes including human, finally I will write to the corresponding author :).

ADD REPLY
1
Entering edit mode
4.1 years ago

GIven the extremely low mapping rate, my guess would be that your data is not mouse RNA. You can try to manually pick a few reads and blast them.

By the way, where does your data comes from ?

It is also quite unusual to not have overrepresented sequences in RNA-seq data.

ADD COMMENT
1
Entering edit mode

Hi Yague, thank you so much for getting back. your comment definitely helps a lot! My RNA sequence file is come from illumina hiseq 2500 platform and Human sample :).

ADD REPLY

Login before adding your answer.

Traffic: 2331 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6