I got RNASeq data in several samples. I checked the FastQC, seems the read quality are good (Hiseq 2000). But the problem is many reads are mapped to intronic region, and the regions have no any reference exons there (Refseq, ensembl, gencode). We don't know what they are. We guess the problem happend in library preparation, the concentration was low. Now the data has come out and we can't re-sequencing, so we want to remove the reads mapped to intronic region, is there a method to do that? Or anyone have an idea about the intronic reads. Thanks.
You can easily use BEDOPS to solve this problem quickly. It includes
bedops and various conversion scripts for putting data into BED format, which
bedops can process.
Assuming your reads are in BAM format:
$ bam2bed < reads.bam \ | bedops --not-element-of -1 - introns.bed \ > reads-not-in-introns.bed
reads-not-in-introns.bed is a sorted BED file containing all reads that do not overlap intronic elements.
You can then pass this result to
bedmap to do counting of reads over other region sets (whole-genome or subsets).
Note that we assume your introns are in BED format and are sorted, e.g.:
$ sort-bed unsorted-introns.bed > introns.bed
Alternatively, if your introns are in some other format — say, GTF — then BEDOPS conversion scripts will losslessly turn them into sorted BED, e.g.:
$ gtf2bed < introns.gtf > introns.bed