I've run intersectBed as follows
intersectBed -a introns.bed -b accepted_hits.bam -wao > result.bed
Where introns.bed
is a bed file containing all introns in hg38 and accepted_hits.bam
is from STAR.
Now, the average read input length as reported by STAR is 101.
However when I look at at a few lines in result.bed
I often get features that are a lot more that 101 reads long.
For example here are a few lines from result.bed
:
chr5 172045022 172052928 intron:NM_005990:18 . - refGene intron . ID=intron:NM_005990:18;Parent=NM_005990 chr5 172045136 172045237 SRR2149928_MCF10A_R1.22432598 50 + 101
chr5 172045022 172052928 intron:NM_005990:18 . - refGene intron . ID=intron:NM_005990:18;Parent=NM_005990 chr5 172045179 172045280 SRR2149928_MCF10A_R1.4036492 50 + 101
chr5 172045022 172052928 intron:NM_005990:18 . - refGene intron . ID=intron:NM_005990:18;Parent=NM_005990 chr5 172045441 172045543 SRR2149928_MCF10A_R1.16781978 50 + 102
chr5 172045022 172052928 intron:NM_005990:18 . - refGene intron . ID=intron:NM_005990:18;Parent=NM_005990 chr5 172044940 172052947 SRR2149928_MCF10A_R1.8875135 50 + 7906
chr5 172045022 172052928 intron:NM_005990:18 . - refGene intron . ID=intron:NM_005990:18;Parent=NM_005990 chr5 172044945 172052952 SRR2149928_MCF10A_R1.4292036 50 + 7906
As you can see there are a few features that have extremely large read length, namely the last 2 in this example.
I'm not sure why this is the case. Why are these read there if the average read length is 101 and is it okay to discard these when performing downstream analysis?
the length of the read is different from the length of the covered reference: if a read overlaps a intron junction (N operator in the cigar string), the covered reference might be large (>size of the intron).
My question is why does such a long read exist if STAR says the average read input length is 101?