STAR extract reads supporting junction from bam file
0
0
Entering edit mode
4.2 years ago
snishtala03 ▴ 70

Hello,

I am working with a viral genome and using STAR to find splice junctions. I have 150 bps paired end reads and my reference genotype is ~3500 bps. This is my command -

STAR --genomeDir XXX/GenomeInd/ --runThreadN 42 --readFilesIn R1_001.fastq R2_001.fastq --genomeSAindexNbases 5 --sjdbOverhang 149 --outSAMtype BAM SortedByCoordinate --bamRemoveDuplicatesType UniqueIdentical --outFilterType BySJout --twopassMode Basic --peOverlapNbasesMin 10 --outFileNamePrefix xxx/xxx

My aim here is to construct a histogram of distribution of read lengths (overhangs) on either ends of the junction. I use the CIGAR string to construct this, however, I am not able to find the same number of reads crossing the junction that STAR is reporting.

This is how I am generating alignment using CIGAR strings in R -

get_alignment = function(cigar,read){
cig = strsplit(trimws(gsub("([A-Z])", "\\1 \\2", as.character(cigar)),"both"), "\\s+")[[1]]
start_read = 0
new_read = ''
for ( i in 1: length(cig)){
if (length(grep('M', cig[i])) > 0){
  m = as.numeric(gsub('M', '',cig[i]))
  new_read = paste0(new_read, substr(read, start = start_read+1, stop = start_read + m))
  start_read = start_read + m
}
else if (length(grep('S', cig[i])) > 0){
  s = as.numeric(gsub('S', '',cig[i]))
  start_read = start_read + s
}
else if(length(grep('D', cig[i])) > 0){
  d = as.numeric(gsub('D', '',cig[i]))
  new_read = paste0(new_read, paste(rep('-',d), collapse = ''))
}
else if(length(grep('N', cig[i])) > 0){
  n = as.numeric(gsub('N', '',cig[i]))
  new_read = paste0(new_read, paste(rep('-',n), collapse = ''))
}
else if(length(grep('I', cig[i])) > 0){
  ins = as.numeric(gsub('I', '',cig[i]))
  start_read = start_read + ins
}
else{
  print(paste(read,"------",cigar,"-------",cig[i],"     check cigar!"))
}
} 
 return(new_read)
}

I would really appreciate your help in pointing out where I am going wrong or if there is anything I am missing.

Thanks!

RNA-Seq STAR splicing • 1.1k views
ADD COMMENT
0
Entering edit mode

Is there a reason you're using length(grep()) instead of the more straightforward grepl()?

ADD REPLY
0
Entering edit mode

No reason, thanks for pointing that out :) I can make that change.

ADD REPLY

Login before adding your answer.

Traffic: 2889 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6