Question

Reproduce Encode/Cshl Long Rna-Seq Data Visualization Viewed In Ucsc, But Failed? [Done]

7

Entering edit mode

11.6 years ago

Puriney ▴ 330

Motivation

The ENCODE data comes out, and luckily they provide both .bam file and .bigwig file. Thus, it occurs to me that I want to give a try to reproduce the data visualization with tool: BEDtools and other related tools.

Result

I'll first upload the difference between my-version and official version: enter image description here

Top to Bottom:

Black: my-version-POSitive-strand.bigwig
Blue: Official-version-POSitive-strand.bigwig
Red: Official-version-REVerse-strand.bigwig
Grey: my-version-REVerse-strand.bigwig

From the image, we will find my-version-data and official-version-data roughly share the same peaks, however, my-version-peaks are somehow masked by certain uniform noises. And it drives me crazy.

Note that I know not all the bioinformatics works can be reproduces, but this issue dose not get involved with much algorithms, decisions, etc. Therefore, it's supposed to be reproducible, I think.

Data Set

ENCODE/CSHL long RNA-seq Data set can be found here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/ And here I use K562-chromatin-subcellular fraction (Rep4) to explore as an example:

BAM file ready for my Data Processing;
bigWig Positive signal file ready for uploading to UCSC;
bigwig Reverse signal file ready for uploading to UCSC.

Data Processing

BAM sort

samtools sort wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqK562ChromatinTotalAlnRep4.bam wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqK562ChromatinTotalAlnRep4.bam.sort

Genome Coverage

I refer to the standard manual of BEDtools, I'll use forward strand as example, and the reverse strand signal is generated in the same way.

genomeCoverageBed -bg -ibam wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqK562ChromatinTotalAlnRep4.bam.sort -g hg19.chromInfo -strand + >K562-Chromatin-POS-4.bedgraph

Note that I've used -strand flag to separate the two strands.

bedgraphtoBigWig

bedGraphToBigWig executive script available from UCSC exe list

bedGraphToBigWig K562-Chromatin-POS-4.bedgraph hg19.chromInfo K562-Chromatin-POS-4.bigwig

Upload to ftp and finally to UCSC genome browser.

Discussion

I was wondering which filtering step I've missed.

I've checked whether all the reads in the .bam file are unique mapped. As the reads are mapped to genome with a tool named, STAR.. According to the manual and common sense, the mapping quality in .sam file equaling 255 means unique mapped reads. Thus, all the reads in the .bam file are unique mapped after I've check the mapping quality.

Another gene difference Thus, any suggestions?

rna-seq encode visualization bedtools • 11k views

ADD COMMENT • link updated 11.6 years ago by Malachi Griffith 19k • written 11.6 years ago by Puriney ▴ 330

1

Entering edit mode

Welcome to BioStar. This is a great post. Well organized and formatted with step-by-step descriptions of what you did, what the problem is, and how you are currently thinking about the solution. Kudos for answering your own question and providing the update here!

ADD REPLY • link 11.6 years ago by Malachi Griffith 19k

Malachi Griffith · Answer 1 · 2012-10-05

3

Entering edit mode

11.6 years ago

Puriney ▴ 330

eh...hey guys, at first, -split flag comes to my mind when I used coverageBED, but based on the manual of BEDtools, I misunderstood the -split will omit all the intron reads (or maybe it is because my mother tongue is not English). Nevertheless, having added the -split flag, I managed to reproduce the data visualization of ENCODE data. Here is the image:

reproduce

The green one is the POSitive strand signal, and it's exactly the same as the official version.

The key is to add the -split option when one runs the genomeCoverageBed command.

Enjoy

ADD COMMENT • link updated 11.6 years ago by Malachi Griffith 19k • written 11.6 years ago by Puriney ▴ 330

1

Entering edit mode

The uniform noise is led by the missing -split option

ADD REPLY • link 11.6 years ago by Puriney ▴ 330

score 3 · Answer 2 · 2012-10-05

As you have figured out, the issue is with the interpretation of the -split option in genomeCoverageBed. From the documentation:

-split

Treat “split” BAM or BED12 entries as distinct BED intervals when computing coverage. For BAM files, this uses the CIGAR "N" and "D" operations to infer the blocks for computing coverage. For BED12 files, this uses the BlockCount, BlockStarts, and BlockEnds fields (i.e., columns 10,11,12).

Since your data is RNA-seq data aligned with a split read (aka splice aware) aligner, some reads consist of blocks of exon alignment separated by large introns indicated by the CIGAR "N" operator. Without the -split option, genomeCoverageBed calculates coverage across the entire interval without taking introns into account:

exon1-<intron>-exon2

With the -split option genomeCoverageBed calculates coverage using the interval of exon1 and exon2 separately and does not add coverage within the introns.