While playing around with the ENCODE RNA-seq data-sets, I noticed that some of the pair-end files have weirdly set flags.
The following few lines are from the CSHL/wgEncodeCshlLongRnaSeqAdrenalAdult8wksAlnRep1.bam file.
PAN_0073:1:69:16755:10476#0 163 chr1 3190766 255 76M = 3190867 177 GTGGAATAATTTGTTAATTGTGAAGTGTATGGTTTTGTATTTTGAAACCAAACAACAGTAGCTGAGGTAGTTAAAT hhhghhhhhhhhhhhhhhhhhhgchhhhhhhhhhhhhghhhhhhghghhhgehghhhhhghhhhhhhedhghggef XS:A:- PAN_0073:1:69:16755:10476#0 115 chr1 3190867 255 76M = 3190766 -177 TGAGAGAATGGAGAACCAATGTAAGGAGCCCAGACTCTTGCCATCTGGAAGCAGGCTCACCAAGTATGATGGTTTC ahhfhhhhfehehhhghghgfhhhhhhghhhchghghhhehhfggghhhhghghhghhghhhhhhhdhhhhhhhhh XS:A:- PRESLEY_0042_FC627A8AAXX:2:95:2727:20071#0 163 chr1 3195839 255 76M = 3195919 156 GCCACTAATTGAGAAGAACTATCAGAGGGAAGTTTTTCTTGGAAAGAGCCAGTCTTGACATGAAGCTTCCTACGTG fggggggggggggfcgggggfggggcffdfggggggggggggggggegggggggggfgggggeggggggggggggg XS:A:- PRESLEY_0042_FC627A8AAXX:2:95:2727:20071#0 115 chr1 3195919 255 76M = 3195839 -156 CCTTCTTTCCATGGTAGCCAGGCCTTGCCCTTTCATAAGAAGACATGTGAAGTACCATAATTATGGAGTGGCAGAG hebaee``bb[ahahhghghhgffhgehfhhhhfhhffafchcfhghhhhhghhghhhhhhhhghhhhhhhhfghh XS:A:-
As you can see, the flag for the forward read is set to 163, which is ok, but the one on the reverse strand is set to 115, which designates that the second read in the pair is mapped to the wrong strand.
Is this a feature or a bug of ENCODE data?
If it is a bug, it would be extremely helpful if somebody from the consortium could use their supercomputing powers to fix them.