How to extract aligned reads that contain splice junctions from STAR
1
2
Entering edit mode
21 months ago
ysantos101 ▴ 20

Hello.

In an attempt to filter out DNA contamination from an RNA-seq experiment, I am looking for a way to extract ONLY aligned reads that contain Annotated splices.

While of course this will eliminate much of the data, the idea is that the only aligned reads that we can be absolutely sure originated from mRNA rather than DNA will contain an annotated splice.

I understand that STAR produces the splice output 'SJ.out.tab'- while this is somewhat useful, I need a way to find the read ID for each entry here.

In short, I would like the read ID for each read that contributes to the 'Number of splices: Annotated (sjdb) | ' field within 'Log.final.out'.

Cheers, Yale

RNA-Seq STAR Splice Site Splice Junctions • 1.6k views
2
Entering edit mode
21 months ago

You can find spliced reads by looking a the CIGAR string - reads with 'N' in them are spliced.

So you can do something like:

samtools view -h my_reads.bam | awk '$0 ~ /^@/ ||$5 ~ /N/' | samtools view -b > filtered.bam


The first view statement converts the bam to sam and includes the header (-h). The awk statement fileters reads that start with @ (therefore are header), or contain N in the 5th column (the CIGAR string), then the second view statement converts back to bam.

0
Entering edit mode

Thank you very much!

Please let me know if I am interpreting this incorrectly- but I believe this will provide a list of all annotated and unannotated splices.

I am really only interested in the annotated ones- and I'm not sure how to identify those that STAR counts as annotated.

Thank you very much for your time!

-Yale

0
Entering edit mode

I'm afraid, as far as I am aware there is no easy way to do this for a bam file that has both in.

You could write a custom script to do this, for example using pysam.

• Start by reading in the the SJ.tab.out file and go through one junction at a time.
• Fetch all reads that overlap a junction, and then for each of these reads, check if it contains that splice junction
• There is a function in pysam that allows you to return tuples of reference co-ordinates for alinged parts of the read .get_aligned_blocks() or something - you would need to go through this. You be looking for cases where one block ended on an intron start position, and the next block ended on an intron end position.
• Once you've found such reads output them (and their mates)

An alternative would be to run STAR telling it only to output spliced reads if they match a known junction. I can't remember what the switch is called, but it does exist. Then run it through the filter in my first answer to isolate the spliced reads.

0
Entering edit mode

Great, thanks for the direction.

Have a nice day!