I used bam2fastx to convert the unmapped.bam output from TopHat into new fastq files. Then I re-ran Tophat with those fastq files against a bacterial genome and got the following output:
Mapped: 656 ( 0.0% of input)
of these: 5 ( 0.8%) have multiple alignments (0 have >20)
Mapped: 1303 ( 0.0% of input)
of these: 1714 (131.5%) have multiple alignments (0 have >20)
0.0% overall read alignment rate.
Aligned pairs: 62
of these: 0 ( 0.0%) have multiple alignments
and: 0 ( 0.0%) are discordant alignments
Two things seem off to me 1) a major discrepancy between the number of mapped left versus right reads and 2) having more reads with multiple alignments than the number of mapped reads in the first place. Is that expected? I think it's a sign that something somewhere went wrong. I've read elsewhere that bam2fastx doesn't work well with paired end data because unmapped.bam is missing some samtools flags. Could that be the problem? If so, is there a workaround?
Here were the commands I used to generate these files:
samtools sort -n unmapped.bam unmapped_sort
bam2fastx -q -Q -A -o new.fastq unmapped_sort.bam
tophat -p 8 -I 500 -r 200 --mate-std-dev 80 --no-mixed -o /NewTophat/Directory OrganismAnnotation /Directory/With/Files/new.1.fastq Directory/With/Files/new.2.fastq
Versions: I used TopHat 2.0.9 and samtools 0.1.19.
Another possibility is that unmapped.bam contains a lot of reads with poor quality. When I try to omit those from the new fastq files (by running bam2fastx -q -A instead of bam2fastx -q -Q -A) and then rerun tophat, then I get an error, "could not find mate pair for read...." which suggests that a lot of the reads in unmapped.bam had one read of good quality and another mate of poor quality. So if I remove those low quality reads, tophat throws an error because it's expected paired end data but getting a mix of paired reads and reads whose mates were thrown out for being low quality. If that's the case, then surely there is some way to pull only unmapped mates of good quality from unmapped.bam?
When I run samtools flagstat unmapped_sort.bam, this is the output:
8903498 + 2048970 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
0 + 0 mapped (0.00%:0.00%)
8903498 + 2048970 paired in sequencing
4380074 + 1096160 read1
4523424 + 952810 read2
0 + 0 properly paired (0.00%:0.00%)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:0.00%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
BUT if there really are 0 properly paired reads, then how in the TopHat2 output are there 62 concordant aligned pairs? I think unmapped.bam must be missing samtools flags. Any ideas for a workaround?