Hello!
I'm attempting to map strand-specific paired-end reads to the Arabidopsis genome (TAIR10) using TopHat v2.0.12. When I upload these data to JBrowse there are reads mapping to both the plus strand and minus strand, so it looks like I'm losing strand specific information.
I've tried solving the problem, and it looks like the issue occurs during the actual tophat mapping step. The command that I use is:
tophat \
--library-type fr-secondstrand \
--read-mismatches 2 \
--read-edit-dist 2 \
--max-multihits 10 \
--b2-very-sensitive \
--transcriptome-max-hits 10 \
--no-coverage-search \
--output-dir topOut_wholeRoot_polyA_rep1_trimmed \
-G TAIR10_GFF3_genes.gtf \
-p 2 \
bowtie2_TAIR10/TAIR10 \
wholeRoot_polyA_rep1_trimmed_1.fastq \
wholeRoot_polyA_rep1_trimmed_2.fastq
I ran bamToBed to take the output bam file and convert it to a bed file so I could see what was going on, and it looks like the read pairs are mapping to opposite strands. For example the bam files reads:
HWI-D00550:272:C75M0ANXX:6:1101:10732:9653 99 Chr1 34896 50 125M = 34898 127 AGCGCATACAGGAAGAAGTCCATGAGAAGCCCACCAAGCAGTTGCAGCAGCTACTGTAGCAGCGGCAATGGCAGTTATACTTGGAGGAGAAGAGCTCATTGGGGTTGATGAATCACCAGAATTCC BCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGEEEGGGGFGGGGGGGGGGGGGGGGGCGGGGGGGGCFEGCGGGGGGGGGGGEGGEGDGGGGGGGGCAGFCGGGGGGGGGGGGGGGG AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:125 YT:Z:UU XS:A:- NH:i:1
HWI-D00550:272:C75M0ANXX:6:1101:10732:9653 147 Chr1 34898 50 125M = 34896 -127 CGCATACAGGAAGAAGTCCATGAGAAGCCCACCAAGCAGTTGCAGCAGCTACTGTAGCAGCGGCAATGGCAGTTATACTTGGAGGAGAAGAGCTCATTGGGGTTGATGAATCACCAGAATTCCCG BGGGGGEGGFGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGBCCCB AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:125 YT:Z:UU XS:A:- NH:i:1
While the bamToBed output reads:
Chr1 34895 35020 HWI-D00550:272:C75M0ANXX:6:1101:10732:9653/1 50 +
Chr1 34897 35022 HWI-D00550:272:C75M0ANXX:6:1101:10732:9653/2 50 -
My understanding is that for all of my downstream analyses these paired reads need to be on the same strand. Is there a problem with my TopHat command? Does anyone have any recommendations to correct this post-mapping?
Thank you for the help!
Thank you for the advice. The pairs ARE "properly paired". How can I got about downstream analyses and uploading this to JBrowse if each fragment maps to both the + and - strands? Is there something I can do after mapping to have paired reads map to the same strand?
Thank you!
I know nothing about JBrowse so can't help here. For all downstream analyses I can think of, what matters after mapping PE reads is if they are flagged as properly paired or not, and if their mapping quality is good.
But if I perform strand specific analyses with HTSeq or even bedtools operations then half of the reads will map to the wrong strand, so the read counts will be off for any genes that overlap on opposite strands. How do you account for that during analyses? Thank you for the help, these are my first paired end data sets I'm analyzing.
You are incorrect about htseq-count. From its manual: