I was given some TCGA BAM files and asked to perform a realignment with some specific requirements. While perusing the results of an alignment in IGV I noticed something strange. As far as I can tell, everything in the read data pop-up dialogs tells me that I'm looking at paired-end reads that mapped as pairs, except for the YT tag which is always
The read names in a mapped pair are 100% identical and pulled from separate FASTQ files. I'm seeing this with every read I check, and I've spot checked reads from random places on five different chromosomes.
Here's the tophat v2.0.9 command that I ran:
/usr/local/bin/tophat --output-dir /data/deedee/rnaseq/efb596b4 --max-multihits 2 -p 4 --b2-very-sensitive --library-type fr-unstranded /data/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome efb596b4_R1.fastq efb596b4_R2.fastq
Does anyone have any ideas about what's going on here? More background follows in case it's useful:
In the initial BAM file, the read names were a mess. They had
/2 attached to the end of the read names, sometimes twice. I wrote a script to remove these
/2 values from the ends of the read names. I used
bedtools bamtofastq to convert these query-sorted, cleaned BAM files to a pair of FASTQ files. From there I ran the tophat command above.