How to extract aligned reads that contain splice junctions from STAR
1
2
Entering edit mode
4.2 years ago
ysantos101 ▴ 20

Hello.

In an attempt to filter out DNA contamination from an RNA-seq experiment, I am looking for a way to extract ONLY aligned reads that contain Annotated splices.

While of course this will eliminate much of the data, the idea is that the only aligned reads that we can be absolutely sure originated from mRNA rather than DNA will contain an annotated splice.

I understand that STAR produces the splice output 'SJ.out.tab'- while this is somewhat useful, I need a way to find the read ID for each entry here.

In short, I would like the read ID for each read that contributes to the 'Number of splices: Annotated (sjdb) | ' field within 'Log.final.out'.

Thanks in advance!!

Cheers, Yale

RNA-Seq STAR Splice Site Splice Junctions • 4.1k views
ADD COMMENT
2
Entering edit mode
4.2 years ago

You can find spliced reads by looking a the CIGAR string - reads with 'N' in them are spliced.

So you can do something like:

samtools view -h my_reads.bam | awk '$0 ~ /^@/ || $5 ~ /N/' | samtools view -b > filtered.bam

The first view statement converts the bam to sam and includes the header (-h). The awk statement fileters reads that start with @ (therefore are header), or contain N in the 5th column (the CIGAR string), then the second view statement converts back to bam.

ADD COMMENT
0
Entering edit mode

Thank you very much!

Please let me know if I am interpreting this incorrectly- but I believe this will provide a list of all annotated and unannotated splices.

I am really only interested in the annotated ones- and I'm not sure how to identify those that STAR counts as annotated.

Thank you very much for your time!

-Yale

ADD REPLY
0
Entering edit mode

I'm afraid, as far as I am aware there is no easy way to do this for a bam file that has both in.

You could write a custom script to do this, for example using pysam.

  • Start by reading in the the SJ.tab.out file and go through one junction at a time.
  • Fetch all reads that overlap a junction, and then for each of these reads, check if it contains that splice junction
  • There is a function in pysam that allows you to return tuples of reference co-ordinates for alinged parts of the read .get_aligned_blocks() or something - you would need to go through this. You be looking for cases where one block ended on an intron start position, and the next block ended on an intron end position.
  • Once you've found such reads output them (and their mates)

An alternative would be to run STAR telling it only to output spliced reads if they match a known junction. I can't remember what the switch is called, but it does exist. Then run it through the filter in my first answer to isolate the spliced reads.

ADD REPLY
0
Entering edit mode

Great, thanks for the direction.

Have a nice day!

ADD REPLY
0
Entering edit mode

I don't get, why you are filtering reads that start with @ (therefore are header) here? Could you please clarify that?

ADD REPLY
0
Entering edit mode

You need to include header lines in the output so that it can be converted back to BAM. if you just filtered lines that had an N in the cigar string, you would have no header, and wouldn't be able to convert back to valid bam.

ADD REPLY

Login before adding your answer.

Traffic: 2468 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6