Question: Tags In Encode Rna Datasets
0
gravatar for Frenkiboy
6.6 years ago by
Frenkiboy240
Frenkiboy240 wrote:

While playing around with the ENCODE RNA-seq data-sets, I noticed that some of the pair-end files have weirdly set flags.

The following few lines are from the CSHL/wgEncodeCshlLongRnaSeqAdrenalAdult8wksAlnRep1.bam file.

PAN_0073:1:69:16755:10476#0     163     chr1    3190766 255     76M     =       3190867 177     GTGGAATAATTTGTTAATTGTGAAGTGTATGGTTTTGTATTTTGAAACCAAACAACAGTAGCTGAGGTAGTTAAAT    hhhghhhhhhhhhhhhhhhhhhgchhhhhhhhhhhhhghhhhhhghghhhgehghhhhhghhhhhhhedhghggef    XS:A:-
PAN_0073:1:69:16755:10476#0     115     chr1    3190867 255     76M     =       3190766 -177    TGAGAGAATGGAGAACCAATGTAAGGAGCCCAGACTCTTGCCATCTGGAAGCAGGCTCACCAAGTATGATGGTTTC   ahhfhhhhfehehhhghghgfhhhhhhghhhchghghhhehhfggghhhhghghhghhghhhhhhhdhhhhhhhhh    XS:A:-
PRESLEY_0042_FC627A8AAXX:2:95:2727:20071#0      163     chr1    3195839 255     76M     =       3195919 156     GCCACTAATTGAGAAGAACTATCAGAGGGAAGTTTTTCTTGGAAAGAGCCAGTCTTGACATGAAGCTTCCTACGTG    fggggggggggggfcgggggfggggcffdfggggggggggggggggegggggggggfgggggeggggggggggggg    XS:A:-
PRESLEY_0042_FC627A8AAXX:2:95:2727:20071#0      115     chr1    3195919 255     76M     =       3195839 -156    CCTTCTTTCCATGGTAGCCAGGCCTTGCCCTTTCATAAGAAGACATGTGAAGTACCATAATTATGGAGTGGCAGAG    hebaee``bb[ahahhghghhgffhgehfhhhhfhhffafchcfhghhhhhghhghhhhhhhhghhhhhhhhfghh    XS:A:-

As you can see, the flag for the forward read is set to 163, which is ok, but the one on the reverse strand is set to 115, which designates that the second read in the pair is mapped to the wrong strand.

Is this a feature or a bug of ENCODE data?

If it is a bug, it would be extremely helpful if somebody from the consortium could use their supercomputing powers to fix them.

seq encode rna • 1.6k views
ADD COMMENTlink modified 5.5 years ago by Biostar ♦♦ 20 • written 6.6 years ago by Frenkiboy240
0
gravatar for matted
6.6 years ago by
matted7.0k
Boston, United States
matted7.0k wrote:

It looks like this is a relatively isolated problem that they've already fixed. If you look at the file annotation list where you presumably downloaded the file, for your BAM you'll see the line:

wgEncodeCshlLongRnaSeqAdrenalAdult8wksAlnRep1.bamobjStatus=replaced - read 2 reverse complemented; project=wgEncode; dataType=RnaSeq;

The key bit being that it's replaced and it has something to do with reverse complement problems, maybe causing the issue you describe with contradicting flags.

The next line shows the fixed BAM, wgEncodeCshlLongRnaSeqAdrenalAdult8wksAlnRep1V2.bam. Your reads are fixed in that one:

PAN_0073:1:69:16755:10476#0 179 chr1 3190766 255 76M = 3190867 177 GTGGAATAATTTGTTAATTGTGAAGTGTATGGTTTTGTATTTTGAAACCAAACAACAGTAGCTGAGGTAGTTAAAT hhhghhhhhhhhhhhhhhhhhhgchhhhhhhhhhhhhghhhhhhghghhhgehghhhhhghhhhhhhedhghggef
PAN_0073:1:69:16755:10476#0 115 chr1 3190867 255 76M = 3190766 -177 TGAGAGAATGGAGAACCAATGTAAGGAGCCCAGACTCTTGCCATCTGGAAGCAGGCTCACCAAGTATGATGGTTTC ahhfhhhhfehehhhghghgfhhhhhhghhhchghghhhehhfggghhhhghghhghhghhhhhhhdhhhhhhhhh
PRESLEY_0042_FC627A8AAXX:2:95:2727:20071#0 179 chr1 3195839 255 76M = 3195919 156 GCCACTAATTGAGAAGAACTATCAGAGGGAAGTTTTTCTTGGAAAGAGCCAGTCTTGACATGAAGCTTCCTACGTG fggggggggggggfcgggggfggggcffdfggggggggggggggggegggggggggfgggggeggggggggggggg
PRESLEY_0042_FC627A8AAXX:2:95:2727:20071#0 115 chr1 3195919 255 76M = 3195839 -156 CCTTCTTTCCATGGTAGCCAGGCCTTGCCCTTTCATAAGAAGACATGTGAAGTACCATAATTATGGAGTGGCAGAG hebaee``bb[ahahhghghhgffhgehfhhhhfhhffafchcfhghhhhhghhghhhhhhhhghhhhhhhhfghh
ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by matted7.0k

I'm sorry to bother you, but in the reads you posted do not have the right flags. Correctly mapped paired end reads should have 2 sets of flags: 99 - 147 83 - 163

This link shows it in a nice way: http://ppotato.files.wordpress.com/2010/08/sam_output2.png

ADD REPLYlink written 6.6 years ago by Frenkiboy240

Your statement and the link are not correct in general. "Mapped in proper pair" is solely the judgement of the aligner, per the samtools spec (the full flag description is "each segment properly aligned according to the aligner"). So the reads can map to the same strand and be properly paired, if the aligner allows that. I assume this is strand-specific RNA-seq or something like that and that the aligner reflects that. Read more about the protocol and aligner options to be sure.

ADD REPLYlink written 6.6 years ago by matted7.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1287 users visited in the last hour