Reproduce Encode/Cshl Long Rna-Seq Data Visualization Viewed In Ucsc, But Failed? [Done]
2
7
Entering edit mode
9.9 years ago
Puriney ▴ 330

## Motivation

The ENCODE data comes out, and luckily they provide both .bam file and .bigwig file. Thus, it occurs to me that I want to give a try to reproduce the data visualization with tool: BEDtools and other related tools.

## Result

I'll first upload the difference between my-version and official version:

Top to Bottom:

• Black: my-version-POSitive-strand.bigwig
• Blue: Official-version-POSitive-strand.bigwig
• Red: Official-version-REVerse-strand.bigwig
• Grey: my-version-REVerse-strand.bigwig

From the image, we will find my-version-data and official-version-data roughly share the same peaks, however, my-version-peaks are somehow masked by certain uniform noises. And it drives me crazy.

Note that I know not all the bioinformatics works can be reproduces, but this issue dose not get involved with much algorithms, decisions, etc. Therefore, it's supposed to be reproducible, I think.

## Data Set

ENCODE/CSHL long RNA-seq Data set can be found here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/ And here I use K562-chromatin-subcellular fraction (Rep4) to explore as an example:

## Data Processing

### BAM sort

 samtools sort wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqK562ChromatinTotalAlnRep4.bam wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqK562ChromatinTotalAlnRep4.bam.sort 

### Genome Coverage

I refer to the standard manual of BEDtools, I'll use forward strand as example, and the reverse strand signal is generated in the same way.

 genomeCoverageBed -bg -ibam wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqK562ChromatinTotalAlnRep4.bam.sort -g hg19.chromInfo -strand + >K562-Chromatin-POS-4.bedgraph 

Note that I've used -strand flag to separate the two strands.

### bedgraphtoBigWig

bedGraphToBigWig executive script available from UCSC exe list

bedGraphToBigWig K562-Chromatin-POS-4.bedgraph hg19.chromInfo K562-Chromatin-POS-4.bigwig

### Discussion

I was wondering which filtering step I've missed.

I've checked whether all the reads in the .bam file are unique mapped. As the reads are mapped to genome with a tool named, STAR.. According to the manual and common sense, the mapping quality in .sam file equaling 255 means unique mapped reads. Thus, all the reads in the .bam file are unique mapped after I've check the mapping quality.

Thus, any suggestions?

rna-seq encode visualization bedtools • 9.6k views
1
Entering edit mode

Welcome to BioStar. This is a great post. Well organized and formatted with step-by-step descriptions of what you did, what the problem is, and how you are currently thinking about the solution. Kudos for answering your own question and providing the update here!

3
Entering edit mode
9.9 years ago
Puriney ▴ 330

eh...hey guys, at first, -split flag comes to my mind when I used coverageBED, but based on the manual of BEDtools, I misunderstood the -split will omit all the intron reads (or maybe it is because my mother tongue is not English). Nevertheless, having added the -split flag, I managed to reproduce the data visualization of ENCODE data. Here is the image:

The green one is the POSitive strand signal, and it's exactly the same as the official version.

The key is to add the -split option when one runs the genomeCoverageBed command.

Enjoy

1
Entering edit mode

The uniform noise is led by the missing -split option

3
Entering edit mode
9.9 years ago

As you have figured out, the issue is with the interpretation of the -split option in genomeCoverageBed. From the documentation:

-split

Treat “split” BAM or BED12 entries as distinct BED intervals when computing coverage. For BAM files, this uses the CIGAR "N" and "D" operations to infer the blocks for computing coverage. For BED12 files, this uses the BlockCount, BlockStarts, and BlockEnds fields (i.e., columns 10,11,12).

Since your data is RNA-seq data aligned with a split read (aka splice aware) aligner, some reads consist of blocks of exon alignment separated by large introns indicated by the CIGAR "N" operator. Without the -split option, genomeCoverageBed calculates coverage across the entire interval without taking introns into account:

exon1-<intron>-exon2

With the -split option genomeCoverageBed calculates coverage using the interval of exon1 and exon2 separately and does not add coverage within the introns.

0
Entering edit mode

Thanks for your comments. And yes, exactly. The uniform noise is resulted by the missing -split flag. I mistook genomeCoverageBed will omit all the intron reads. I added -split because I think genomeCoverageBed and coverageBed are twins. I knew coverageBed will not omit thus I give it a try when I'm later running genomeCoverageBed. Then, everything goes well.