Question

How to analyze CAGE-Seq data?

1

Entering edit mode

7.0 years ago

heir_of_isildur88 ▴ 30

Hi all,

I'm now 6 months into the field of NGS and analysis of sequencing data. I have been working on RNA-Seq data and recently, just started to venture into CAGE-Seq data.

I wanted to ask how do we actually map CAGE-Seq data? We did a paired-end sequencing for the CAGE data and then got the fastq files. After cleaning, I got the clean reads files for read1 and read2 but both of them are of different size. When I run them on STAR, it said that mapping could not be done as the run finished for 1 read while the other 1 is still not.

Is this normal for CAGE-Seq data? Or should we just map read1 only as we are only interested in the TSS i.e. reads seq from 5' end?

I am a bit confused how to process CAGE data here.

Please give some guidance & advice. Thank you very much.

CAGE • 3.8k views

ADD COMMENT • link updated 7.0 years ago by Charles Plessy ★ 2.9k • written 7.0 years ago by heir_of_isildur88 ▴ 30

0

Entering edit mode

After cleaning, I got the clean reads files for read1 and read2 but both of them are of different size.

Can you elaborate on the "cleaning" part?
And do you mean different read lengths or different number of reads in R1 vs R2?

ADD REPLY • link 7.0 years ago by WouterDeCoster 47k

0

Entering edit mode

Cleaning is where I trimmed off 4 basepairs off the reads which correspond to the index of the samples they represent.

Yes, I get different number of reads for R1 & R2.

ADD REPLY • link 7.0 years ago by heir_of_isildur88 ▴ 30

0

Entering edit mode

Please post names and versions of the programs you used, and also the exact commands. You should clean and map R1+R2 as paired files, i. e., simultaneously and keeping proper pair information.

ADD REPLY • link 7.0 years ago by h.mon 35k

0

Entering edit mode

Here's the reads processing before mapping...

Index identification of samples - using custom perl file

read_skipper.pl R1_step1.fq CAC

Trim away the index

fastx_trimmer -f 4 -i R1_step1.fq -o R1_trimmed.fq -Q33

Using perl file to remove reads with Q<20

perl ../IndexQuality_CAGE_20.pl R1_trimmed.fq R1_trimmed.fq I.fq R1_20.fq R1_20.2.fq I_20.fq

Reads cleaning using QCleaner (I have to check what does this clean as it's in Japanese)

qcleaner_renew_v3.1.pl --i ./R1_step1_skip.fq --o R1_clean.fastq --log qclog.txt

qcleaner_renew_v3.1.pl --i ./Undetermined_S0_L001_R2_001.fastq --o R2_clean.fastq --log qclog.txt

ADD REPLY • link 7.0 years ago by heir_of_isildur88 ▴ 30

0

Entering edit mode

fastx does not preserve pairing, use Trimmomatic or BBDuk do trim adapters and low quality.

ADD REPLY • link 7.0 years ago by h.mon 35k

0

Entering edit mode

Thank you for your suggestion. I will try it out and see if it works.

ADD REPLY • link 7.0 years ago by heir_of_isildur88 ▴ 30

score 2 · Answer 1 · 2017-04-14

2

Entering edit mode

7.0 years ago

Charles Plessy ★ 2.9k

If your CAGE data is paired-end, then I recommend to align it paired end, and to only transform it to TSS positions at the end.

Here is a toy example on how to process CAGE data (the nanoCAGE variant, which can be sequenced paired-end).

https://github.com/Population-Transcriptomics/C1-CAGE-preview/blob/master/OP-WORKFLOW-CAGEscan-short-reads-v2.0.ipynb

And here is a preprint showing more or less the same on a different dataset with a different workflow system.

http://biorxiv.org/content/early/2017/04/11/126474

Recent versions of CAGEr can load paired-end CAGE data in BAM or BED format.

ADD COMMENT • link 7.0 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

How do you transform aligned reads to TSS positions?

Thank you very much for your references.

ADD REPLY • link 7.0 years ago by heir_of_isildur88 ▴ 30

0

Entering edit mode

For paired-end data my favourite approach is to convert paired alignments from BAM format, where each mate is represented on separate lines, to BED12 format, where each pair is on one line, using the pairedBamToBed12 tool. The 5′ end of the BED entries is the CAGE TSS. CAGEr supports loading data in BAM, BED, and other formats. I recommend you to read its vignette.