Question

intersectBed extremely large read lengths

0

Entering edit mode

5.8 years ago

fadhil.abubaker ▴ 20

I've run intersectBed as follows

intersectBed -a introns.bed -b accepted_hits.bam -wao > result.bed

Where introns.bed is a bed file containing all introns in hg38 and accepted_hits.bam is from STAR.

Now, the average read input length as reported by STAR is 101.

However when I look at at a few lines in result.bed I often get features that are a lot more that 101 reads long.

For example here are a few lines from result.bed:

chr5 172045022 172052928 intron:NM_005990:18 . - refGene intron . ID=intron:NM_005990:18;Parent=NM_005990 chr5 172045136 172045237 SRR2149928_MCF10A_R1.22432598 50 + 101

chr5 172045022 172052928 intron:NM_005990:18 . - refGene intron . ID=intron:NM_005990:18;Parent=NM_005990 chr5 172045179 172045280 SRR2149928_MCF10A_R1.4036492 50 + 101

chr5 172045022 172052928 intron:NM_005990:18 . - refGene intron . ID=intron:NM_005990:18;Parent=NM_005990 chr5 172045441 172045543 SRR2149928_MCF10A_R1.16781978 50 + 102

chr5 172045022 172052928 intron:NM_005990:18 . - refGene intron . ID=intron:NM_005990:18;Parent=NM_005990 chr5 172044940 172052947 SRR2149928_MCF10A_R1.8875135 50 + 7906

chr5 172045022 172052928 intron:NM_005990:18 . - refGene intron . ID=intron:NM_005990:18;Parent=NM_005990 chr5 172044945 172052952 SRR2149928_MCF10A_R1.4292036 50 + 7906

As you can see there are a few features that have extremely large read length, namely the last 2 in this example.

I'm not sure why this is the case. Why are these read there if the average read length is 101 and is it okay to discard these when performing downstream analysis?

RNA-Seq rna-seq sequencing • 861 views

ADD COMMENT • link updated 5.8 years ago by ATpoint 81k • written 5.8 years ago by fadhil.abubaker ▴ 20

0

Entering edit mode

the length of the read is different from the length of the covered reference: if a read overlaps a intron junction (N operator in the cigar string), the covered reference might be large (>size of the intron).

ADD REPLY • link 5.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

My question is why does such a long read exist if STAR says the average read input length is 101?

ADD REPLY • link 5.8 years ago by fadhil.abubaker ▴ 20

score 0 · Answer 1 · 2018-06-28

As Pierre said, this e.g. 7906 is not the read length but the genomic distance that is covered. Do not mix this up with the transcriptomic distance. As you hopefully know, an eucaryotic gene is composed of exons and introns. Introns are most commonly removed during the transcriptional process to produce the final messengerRNA (mRNA) that then (in case of a protein-coding gene) serves as template for protein translation. This mRNA is what is sequenced. So say you have a gene with exon1 (chr1-1-100) and exon2 (chr1-1000-2000), then the intron would be (chr1-101-999). As the mRNA is only composed of exons (because splicing), these two intervals are directly adjacent in the transcript which you sequence. If the read now spans this junction between the exons, you get a 101bp read covering parts of both exons. Still, if you map this back to the genome (not transcriptome) you still have this 900bp of the intron in between which explains the large number of bp being covered, but again, this is genome coverage, which must not be mistaken for transcriptome coverage.

WikiGene