Question

How To Remove The Intronic Reads Before Counting

1

Entering edit mode

10.9 years ago

camelbbs ▴ 710

I got RNASeq data in several samples. I checked the FastQC, seems the read quality are good (Hiseq 2000). But the problem is many reads are mapped to intronic region, and the regions have no any reference exons there (Refseq, ensembl, gencode). We don't know what they are. We guess the problem happend in library preparation, the concentration was low. Now the data has come out and we can't re-sequencing, so we want to remove the reads mapped to intronic region, is there a method to do that? Or anyone have an idea about the intronic reads. Thanks.

rnaseq rna-seq • 4.5k views

ADD COMMENT • link updated 10.9 years ago by Alex Reynolds 35k • written 10.9 years ago by camelbbs ▴ 710

3

Entering edit mode

"We don't know what they are."

Ask yourself: "why should I remove intronic reads?" Do you want to remove outcome that you do not understand, until your experiment fits your expectations?

"We guess the problem happend in library preparation, the concentration was low."

What does low concentration have to do with getting unwanted reads, what 'makes up sequences' that are not real in case of low concentration? See also Why are there many RNA-seq hits to intronic regions? Intronic sequences might be novel transcripts, remains of nascent RNA, lincRNA, antisense RNA, if close to exons, wrong exon boundaries in the annotation.

ADD REPLY • link 10.9 years ago by Michael 54k

Ram · Accepted Answer · 2013-06-07

2

Entering edit mode

10.9 years ago

swbarnes2 14k

If you have a bed file of exonic regions, or gtf, something like that, you can use BEDTools to filter your .bam for reads that fall in the desired coordinates, using intersectBed

ADD COMMENT • link 10.9 years ago by swbarnes2 14k

0

Entering edit mode

Thanks. It will be like this?

intersectBed -abam s1.bam -b hg19ensembl.gtf > s1.filter.bam

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 10.9 years ago by camelbbs ▴ 710

Ram · Accepted Answer · 2013-06-09

You can easily use BEDOPS to solve this problem quickly. It includes bedops and various conversion scripts for putting data into BED format, which bedops can process.

Assuming your reads are in BAM format:

$ bam2bed < reads.bam \
    | bedops --not-element-of -1 - introns.bed \
    > reads-not-in-introns.bed

The file reads-not-in-introns.bed is a sorted BED file containing all reads that do not overlap intronic elements.

You can then pass this result to bedmap to do counting of reads over other region sets (whole-genome or subsets).

Note that we assume your introns are in BED format and are sorted, e.g.:

$ sort-bed unsorted-introns.bed > introns.bed

Alternatively, if your introns are in some other format — say, GTF — then BEDOPS conversion scripts will losslessly turn them into sorted BED, e.g.:

$ gtf2bed < introns.gtf > introns.bed