Question

map reads to a allopolyploid species reference using STAR or HISAT2: low percentage of Uniquely mapped reads %

0

Entering edit mode

6.5 years ago

clingyun ▴ 20

Hello,

I am mapping rna-seq reads to a reference genome. Very sure, the reads and the genome reference are generated from same variety. They are both downloaded from internet. I firstly trimmed the paired reads using trimmomatic, by cutting the TruSeq_adapters, cutting the LEADING and TRAILING bases with quality score lower than 4, and removing the low quality score reads (SLIDINGWINDOW:4:15). Then I checked the reads using fastqc. The filtered reads looks fine. I am sure the adapters removed and average quality score per read is 37, and no read with score lower than 26.

Then I run the STAR with default parameter. Result summary: 9472216 reads; of these: 9472216 (100.00%) were paired; of these: 6353973 (67.08%) aligned concordantly 0 times 2570774 (27.14%) aligned concordantly exactly 1 time 547469 (5.78%) aligned concordantly >1 times ---- 6353973 pairs aligned concordantly 0 times; of these: 8552 (0.13%) aligned discordantly 1 time ---- 6345421 pairs aligned 0 times concordantly or discordantly; of these: 12690842 mates make up the pairs; of these: 12622901 (99.46%) aligned 0 times 51272 (0.40%) aligned exactly 1 time 16669 (0.13%) aligned >1 times 33.37% overall alignment rate

Only (27.14%) aligned concordantly exactly 1 time.
I also tried other methods such as Hisat2. It shows ".....2570774 (27.14%) aligned concordantly exactly 1 time......." .

I don't know whether this result is common. But I think the percentage of Uniquely mapped reads may be too low. The data are from an allopolyploid plant. Any comment a good help to me.

Thanks

Chen Lingyun

rna-seq Assembly alignment sequencing • 2.0k views

ADD COMMENT • link 6.5 years ago by clingyun ▴ 20

1

Entering edit mode

That is always the pain with downloaded data. I also had poor mapping recently, but on downloaded human data, requiring that after trimming the trailing base and the average read quality were > 30, and by this removing a lot of junk (maybe degraded material or whatever). Still 1/4 of the samples that survived that treatment had 75% or less mapping rate. The first thing to do would be to extract and blast some of the unmapped read pairs against the NCBI nucleotide collection. Do they map against a plant pathogen or rRNA? In my samples, there was a strong contamination with common hospital germs. I ended up simply removing samples with a final mapping percentage below 75%. It probably in the end will depend on your level of desperation (meaning how urgently do you need exactly these data you downloaded) to decide if the data are good enough to keep them.

ADD REPLY • link 6.5 years ago by ATpoint 82k

0

Entering edit mode

Thank you very much. I am comparing the mapped reads and unmapped reads by blast to the reference genome.

One more curious thing is: Why the Hisat2 and STAR aligned output file (sam format) include some reads which show successfully mapped, but some reads seem to be not really mapped?

For example, the SRR5572144.9952480 seems to be unmapped SRR5572144.9952480 141 * 0 0 * * 0 0 TTGATTCTGATTTTCAGTACGAATACGAACCGTGAAAGCGTGGCCTAACGATCCTTTAGACCTTTTGAATTTAAAGCTAGAGGTGTCA IIIFIIIIIFIIIIFFIIIIIIIIIIIIIIFFIIFIIFFIFIIIFFFFFFFFFFFFFFFBFBBFFFFFBBFFFFFFFFFBBBFBBFFB YT:Z:UP SRR5572144.9952481 99 C_Quinoa_Scaffold_1412 1972533 60 88M = 1972566 121 ATGCAGAATTTCAAGCAACCACACACACTACAAGAATCAGGTCGAATTTCGACGGAGACAGACACGTAATTTCGTAGGAAAAAATCGG IIBFFIFFFIFIIIIIIIIIIIIIIIIIFIIIIIIIIIIFIFIIIIFIIIIFFFFFFFFFFFBBFFFBFFFFFFFFBFFFFBBFBBB7 AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:88 YS:i:0 YT:Z:CP NH:i:1

Thanks

ADD REPLY • link 6.5 years ago by clingyun ▴ 20

0

Entering edit mode

in other words, is there any method/criteria which show the mapping result is good? Thanks!

ADD REPLY • link 6.5 years ago by clingyun ▴ 20