Hello,
I am working with a viral genome and using STAR to find splice junctions. I have 150 bps paired end reads and my reference genotype is ~3500 bps. This is my command -
STAR --genomeDir XXX/GenomeInd/ --runThreadN 42 --readFilesIn R1_001.fastq R2_001.fastq --genomeSAindexNbases 5 --sjdbOverhang 149 --outSAMtype BAM SortedByCoordinate --bamRemoveDuplicatesType UniqueIdentical --outFilterType BySJout --twopassMode Basic --peOverlapNbasesMin 10 --outFileNamePrefix xxx/xxx
My aim here is to construct a histogram of distribution of read lengths (overhangs) on either ends of the junction. I use the CIGAR string to construct this, however, I am not able to find the same number of reads crossing the junction that STAR is reporting.
This is how I am generating alignment using CIGAR strings in R -
get_alignment = function(cigar,read){
cig = strsplit(trimws(gsub("([A-Z])", "\\1 \\2", as.character(cigar)),"both"), "\\s+")[[1]]
start_read = 0
new_read = ''
for ( i in 1: length(cig)){
if (length(grep('M', cig[i])) > 0){
m = as.numeric(gsub('M', '',cig[i]))
new_read = paste0(new_read, substr(read, start = start_read+1, stop = start_read + m))
start_read = start_read + m
}
else if (length(grep('S', cig[i])) > 0){
s = as.numeric(gsub('S', '',cig[i]))
start_read = start_read + s
}
else if(length(grep('D', cig[i])) > 0){
d = as.numeric(gsub('D', '',cig[i]))
new_read = paste0(new_read, paste(rep('-',d), collapse = ''))
}
else if(length(grep('N', cig[i])) > 0){
n = as.numeric(gsub('N', '',cig[i]))
new_read = paste0(new_read, paste(rep('-',n), collapse = ''))
}
else if(length(grep('I', cig[i])) > 0){
ins = as.numeric(gsub('I', '',cig[i]))
start_read = start_read + ins
}
else{
print(paste(read,"------",cigar,"-------",cig[i]," check cigar!"))
}
}
return(new_read)
}
I would really appreciate your help in pointing out where I am going wrong or if there is anything I am missing.
Thanks!
Is there a reason you're using
length(grep())
instead of the more straightforwardgrepl()
?No reason, thanks for pointing that out :) I can make that change.