I know how to check for strandedness on IGV, however, I am trying to determine this for hundreds of samples prepared by ~5 different groups (using IGV command line has been extremely difficult for me on my work computer). I have a suspicion that some of these are unstranded while most are stranded (I did a spot check using 5 CRAM files on the IGV GUI: 4 were stranded, 1 was not). With that, I generated htseq counts for all the samples, getting both unstranded and stranded counts for each.
I thought it would make sense to use sex genes to see if samples were prepared with a stranded or unstranded protocol.That being said, I organized the samples by gender and have isolated counts for the following sex genes: XIST, RPS4Y1, RPS4Y2, DDX3Y, and USP9Y. I'm seeing that those 4 that were stranded (3 females, 1 male) have counts in the XIST gene, which makes sense, but the females also have a few counts in some of the Y-chromosome genes. I'm seeing more nonsensical data when I look at the unstranded counts for these 4, so I feel like at least I'm on the right track with the library construction protocol. I'm not sure, however, why there are any counts at all in the Y chromosome genes for these 3 females.
Anyway, it's a long-winded, two part question, but I wanted to ask if anyone had a good way to determine strandedness on a large set of data using raw counts, and how I can make sense of the counts I'm seeing for some of the females - is this perhaps contamination? Thanks for the help!
If you want to do that efficiently and systematically the I would take these CRAMs, convert some reads (maybe 1million or so) back to fastq and then quantify them against a transcriptome with salmon as salmon has an automated mode to determine strandedness, check its documentation and its
-l A
argument as well asprevious posts on strandedness inference. Just using raw counts appears cumbersome.