Question: Reproduce Encode/Cshl Long Rna-Seq Data Visualization Viewed In Ucsc, But Failed? [Done]
7
gravatar for Puriney
6.3 years ago by
Puriney330
New York City
Puriney330 wrote:

Motivation

The ENCODE data comes out, and luckily they provide both .bam file and .bigwig file. Thus, it occurs to me that I want to give a try to reproduce the data visualization with tool: BEDtools and other related tools.

Result

I'll first upload the difference between my-version and official version: enter image description here

Top to Bottom:

  • Black: my-version-POSitive-strand.bigwig
  • Blue: Official-version-POSitive-strand.bigwig
  • Red: Official-version-REVerse-strand.bigwig
  • Grey: my-version-REVerse-strand.bigwig

From the image, we will find my-version-data and official-version-data roughly share the same peaks, however, my-version-peaks are somehow masked by certain uniform noises. And it drives me crazy.

Note that I know not all the bioinformatics works can be reproduces, but this issue dose not get involved with much algorithms, decisions, etc. Therefore, it's supposed to be reproducible, I think.

Data Set

ENCODE/CSHL long RNA-seq Data set can be found here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/ And here I use K562-chromatin-subcellular fraction (Rep4) to explore as an example:

Data Processing

BAM sort

samtools sort wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqK562ChromatinTotalAlnRep4.bam wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqK562ChromatinTotalAlnRep4.bam.sort

Genome Coverage

I refer to the standard manual of BEDtools, I'll use forward strand as example, and the reverse strand signal is generated in the same way.

genomeCoverageBed -bg -ibam wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqK562ChromatinTotalAlnRep4.bam.sort -g hg19.chromInfo -strand + >K562-Chromatin-POS-4.bedgraph

Note that I've used -strand flag to separate the two strands.

bedgraphtoBigWig

bedGraphToBigWig executive script available from UCSC exe list

bedGraphToBigWig K562-Chromatin-POS-4.bedgraph hg19.chromInfo K562-Chromatin-POS-4.bigwig

Upload to ftp and finally to UCSC genome browser.

Discussion

I was wondering which filtering step I've missed.

I've checked whether all the reads in the .bam file are unique mapped. As the reads are mapped to genome with a tool named, STAR.. According to the manual and common sense, the mapping quality in .sam file equaling 255 means unique mapped reads. Thus, all the reads in the .bam file are unique mapped after I've check the mapping quality.

Another gene difference Thus, any suggestions?

ADD COMMENTlink modified 6.3 years ago by Malachi Griffith17k • written 6.3 years ago by Puriney330
1

Welcome to BioStar. This is a great post. Well organized and formatted with step-by-step descriptions of what you did, what the problem is, and how you are currently thinking about the solution. Kudos for answering your own question and providing the update here!

ADD REPLYlink written 6.3 years ago by Malachi Griffith17k
3
gravatar for Puriney
6.3 years ago by
Puriney330
New York City
Puriney330 wrote:

eh...hey guys, at first, -split flag comes to my mind when I used coverageBED, but based on the manual of BEDtools, I misunderstood the -split will omit all the intron reads (or maybe it is because my mother tongue is not English). Nevertheless, having added the -split flag, I managed to reproduce the data visualization of ENCODE data. Here is the image:

reproduce

The green one is the POSitive strand signal, and it's exactly the same as the official version.

The key is to add the -split option when one runs the genomeCoverageBed command.

Enjoy

ADD COMMENTlink modified 6.3 years ago by Malachi Griffith17k • written 6.3 years ago by Puriney330
1

The uniform noise is led by the missing -split option

ADD REPLYlink written 6.3 years ago by Puriney330
3
gravatar for Malachi Griffith
6.3 years ago by
Washington University School of Medicine, St. Louis, USA
Malachi Griffith17k wrote:

As you have figured out, the issue is with the interpretation of the -split option in genomeCoverageBed. From the documentation:

-split

Treat “split” BAM or BED12 entries as distinct BED intervals when computing coverage. For BAM files, this uses the CIGAR "N" and "D" operations to infer the blocks for computing coverage. For BED12 files, this uses the BlockCount, BlockStarts, and BlockEnds fields (i.e., columns 10,11,12).

Since your data is RNA-seq data aligned with a split read (aka splice aware) aligner, some reads consist of blocks of exon alignment separated by large introns indicated by the CIGAR "N" operator. Without the -split option, genomeCoverageBed calculates coverage across the entire interval without taking introns into account:

exon1-<intron>-exon2

With the -split option genomeCoverageBed calculates coverage using the interval of exon1 and exon2 separately and does not add coverage within the introns.

ADD COMMENTlink written 6.3 years ago by Malachi Griffith17k

Thanks for your comments. And yes, exactly. The uniform noise is resulted by the missing -split flag. I mistook genomeCoverageBed will omit all the intron reads. I added -split because I think genomeCoverageBed and coverageBed are twins. I knew coverageBed will not omit thus I give it a try when I'm later running genomeCoverageBed. Then, everything goes well.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Puriney330
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1261 users visited in the last hour