Question

Help manually processing strand in paired-end reads when fishing for lariats in bulk RNA-seq

0

Entering edit mode

2.1 years ago

txema.heredia ▴ 250

Hi,

I'm trying to detect lariat loops in RNA-seq data (circular RNAs produced during splicing). I am following the methods used in Pineda & Bradley 2018

Pineda & Bradley 2018 figure 1A

Branchpoint detection algorithm:
Our branchpoint detection algorithm was based on the split-read alignment strategy used in Mercer et al. (2015).

1- Prefilter reads:
First, filter out reads with >5% Ns or other ambiguous characters. Next, sequentially invoke Bowtie2 as follows for the transcriptome and genome: bowtie2 -x <index file for transcriptome or genome> --end-to-end --sensitive --score-min L,0,-0.24 -k 1 --n-ceil L,0,0.05 -U <FASTQ file of reads>. Finally, discard the aligned reads and use the unaligned reads as input for the next step.

2- Map 5′ splice site sequences to reads:
First, build a Bowtie index for a FASTA file of prefiltered reads. Next, build a FASTA file holding 5′ splice site sequences (the first 20 nt of each intron or, alternately, the 20 nt downstream from each 5′ splice site, including the 5′ splice site itself). Finally, map 5′ splice site sequences to reads as follows: bowtie2 -x <index file for reads> --end-to-end --sensitive --k 10000 --no-unal -f -U <FASTA file of 5′ splice site sequences>. 3- Restrict to reads aligned to a 5′ splice site and trim reads:
First, restrict to alignments between 5′ splice site sequences and reads with no mismatches and no indels. Second, restrict to reads that align to a single 5′ splice site. Third, trim off the portion of each read starting at the 5′ splice site alignment and continuing to the end of the read. Finally, restrict to trimmed reads of >=20-nt length.

4- Map trimmed reads to 3′ splice site sequences
First, build a FASTA file holding the 3′ splice site sequences (the last 250 nt of each intron or, alternately, the 250 nt upstream of each 3′ splice site, including the 3′ splice site itself). Next, build a Bowtie index for these sequences. Finally, map trimmed reads to 3′ splice sites as follows: bowtie2 -x <index file for 3′ splice sites> --end-to-end --sensitive -k 10 --no-unal -f -U <FASTA file of trimmed reads>.

5- Infer branchpoint positions from split-read alignments
First, restrict to trimmed read alignments with five or fewer mismatches, <=10% mismatch rate, and at most a single indel of <=3-nt length in the 3′ splice site-aligning portion of the read. Second, restrict to alignments that score as well as the best-scoring alignment for each read (e.g., remove lower-scoring alignments). Third, restrict to reads with inverted alignments (e.g., where the “left” half of the read aligns near the 3′ splice site, while the “right” half of the read aligns to the 5′ splice site). Fourth, restrict to reads for which the 5′ and 3′ splice site-aligning portions of the read map to splice sites within a single gene. Fifth, compute the branchpoint position as the last nucleotide of the trimmed read alignment. Sixth, restrict to reads with a mismatch at the inferred branchpoint position. Finally, assemble a final set of 5′ splice site–branchpoint pairs.

My samples were prepared with CORALL mRNA-Seq v2 Library Prep Kit and sequenced on an Illumina NovaSeq 6000. Thus, I have paired-end stranded reads. As far as I understand, this means that Read1 from the pair should map to the same strand as the gene of interest, while Read2 maps to the opposite strand.

Step 1:

I used a previous mapping of the data using STAR and I obtained the unmapped reads using samtools view -b -f 4 ${file}. I could use also flag 0x8 (next segment in the template unmapped) to keep only reads where both pairs are unmapped. However, I expect the lariat-branchpoint-containing-reads will fit in a single read of the pair while the other read could map properly to the middle of the intron within the lariat loop (or to an exon depending on the splicing pattern).

I transformed the unmapped-reads bam file into a fastq file with samtools fasta ${sample}_unmapped.bam. Each read ends in /1 or /2 depending on being readpair 1 or 2.

Step 2:

I first generated a list of introns. Previously I generated a list of collapsed exons using DEXSeq's script dexseq_prepare_annotation.py on my annotation file to get a list of all "chunks of the gene" deemed as an exon in at least 1 transcript of that gene. I subtracted these exons from whole gene coordinates, producing a list of regions that are intronic in all isoforms.

I am fully aware that this is a can of worms. What happens when 2 transcripts use the same "core" of an exon but with different 5'/3' splicing sites? To detect lariats, one requisite is that my reads map uniquely to a single intronic sequence. If I have 2 versions of the intron, depending on the exon being used, I could have reads mapping to both, which is incompatible with the lariat-detection method. Do you have any idea on how to deal with this?

From this list of introns I have to obtain:

the 20bp sequence after to the 5' splicing site
- + strand: intron start site -> start + 20 .
- - strand: intron end site - 20 -> end.
the 250bp sequence before to the 3' splicing site
- + strand: intron end site - 250 -> end.
- - strand: intron start site -> start + 250.

And then generate the intronic sequences with

bedtools getfasta -name -fi ${refdir}/hg38.fa -bed gencode.v38.annotation.introns_5prima.gtf

In order to easily identify the intronic sequences, I edited field #2 from these .gtf files so the header of each read in the .fa file contains a name that looks like this:

$ head introns_5prima.fasta 
>ENSG00000223972.5_1_2_chr1_12228_12612_+::chr1:12227-12248
GTAAGTAGTGCTTGTGCTCAT
>ENSG00000223972.5_2_3_chr1_12722_12974_+::chr1:12721-12742
GTGAGAGGAGAGTAGACAGTG
>ENSG00000223972.5_3_4_chr1_13053_13220_+::chr1:13052-13073
GCAAGCCTGGCTGCCTCCAGC
>ENSG00000227232.5_1_2_chr1_14502_15004_-::chr1:14983-15004
ACCAGCTTGTTGAAGAGATCC
>ENSG00000227232.5_2_3_chr1_15039_15795_-::chr1:15774-15795
CAAGCTCCAGGGCCCGCTCAC

The format is:

ensembl_id
previous exon number (by sequence order, not checking the strand)
next exon number
chromosome
start position of the 20bp or 250bp chunk
end position
strand

The info after : doesn't appear in the bam/sam file after alignment.

Finally, I mapped the introns-5'.fastq into the sample-reads index with bowtie2:

mkdir -p bt_index
bowtie2-build ${sample}_unmapped.fasta bt_index/${sample}_unmapped
bowtie2 -x bt_index/${sample}_unmapped --end-to-end --sensitive --k 10000 --no-unal --no-sq -p 8 -f -U ../introns_5prima.fasta >  ${sample}_unmapped_TO_introns5prima.sam

I decided to use the flag --no-sq to remove 10,788,016 header @SQ lines from a sam file with 17,167,454 total reads.

$ head 22-157/22-157_unmapped_TO_introns5prima.sam  | grep -v ^@ 
ENSG00000227232.5_1_2_chr1_14502_15004_-::chr1:14983-15004  0   A00383:379:HH35YDRX2:1:2259:2645:2378/2 43  1   21M *   0   0   ACCAGCTTGTTGAAGAGATCC   IIIIIIIIIIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:21 YT:Z:UU
ENSG00000227232.5_1_2_chr1_14502_15004_-::chr1:14983-15004  256 A00383:379:HH35YDRX2:1:2143:29812:1141/2    43  255 21M *   0   0   ACCAGCTTGTTGAAGAGATCC   IIIIIIIIIIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:21 YT:Z:UU
ENSG00000227232.5_1_2_chr1_14502_15004_-::chr1:14983-15004  272 A00383:379:HH35YDRX2:1:2277:32452:8876/1    33  255 21M *   0   0   GGATCTCTTCAACAAGCTGGT   IIIIIIIIIIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:21 YT:Z:UU
ENSG00000227232.5_1_2_chr1_14502_15004_-::chr1:14983-15004  272 A00383:379:HH35YDRX2:1:2143:29812:1141/1    22  255 21M *   0   0   GGATCTCTTCAACAAGCTGGT   IIIIIIIIIIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:21 YT:Z:UU
ENSG00000227232.5_1_2_chr1_14502_15004_-::chr1:14983-15004  256 A00383:379:HH35YDRX2:1:2271:17960:11303/2   111 255 21M *   0   0   ACCAGCTTGTTGAAGAGATCC   IIIIIIIIIIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:21 YT:Z:UU
ENSG00000227232.5_1_2_chr1_14502_15004_-::chr1:14983-15004  256 A00383:379:HH35YDRX2:1:2113:9444:9424/2 22  255 21M *   0   0   ACCAGCTTGTTGAAGAGATCC   IIIIIIIIIIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:21 YT:Z:UU
ENSG00000227232.5_1_2_chr1_14502_15004_-::chr1:14983-15004  272 A00383:379:HH35YDRX2:1:2259:2645:2378/1 40  255 21M *   0   0   GGATCTCTTCAACAAGCTGGT   IIIIIIIIIIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:21 YT:Z:UU
ENSG00000227232.5_1_2_chr1_14502_15004_-::chr1:14983-15004  272 A00383:379:HH35YDRX2:1:2271:17960:11303/1   19  255 21M *   0   0   GGATCTCTTCAACAAGCTGGT   IIIIIIIIIIIIIIIIIIIII   AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:21 YT:Z:UU

Step 3

Here is where I get lost and I need help.

First, restrict to alignments between 5′ splice site sequences and reads with no mismatches and no indels.

I can do this easily by keeping only reads where

flag 0x4 is 0 (read is mapped)
the CIGAR string is 21M
the flags NM:i:0 and MD:Z:21 are present.

From 17,167,454 total reads I end up with 6,297,864 reads after applying these filters

Second, restrict to reads that align to a single 5′ splice site.

To do this, I was planning to sort the sam file by field #3 (Reference sequence name = sample-unmapped-read id) and keep only reads with 1 alignment.

After applying this I am left with 123,148 intronic sequences mapping to the lariat-reads.

However, at this point I started pondering about how to process the strand information. I've been thinking about this for a couple of days and I got to the point where I feel I am not smart enough to understand/solve this on my own. If I have information about :

the strand of the intron (although the sequence used comes from the + strand). Information contained in the read name (field 1 of the bam file)
the sample-read is paired and stranded and comes from Readpair 1 or 2
the intron-sequence maps to the sample-read in the forward or reverse strand. Flag 0x10

Because the reads originate from a RNA lariat loop, R1 has to match the strand of the gene/intron. Did I get that right? However, whatever the strand of the annotated gene is, the intron-sequences that I generated use the + strand. Thus:

If the intron-sequence maps to a sample-read from R1 (RNAME ends in /1)
- The strand information from the intron-sequence name and the mapping flag 0x10 should match
  - + strand intron & mapping to + strand ( 0x10 == FALSE )
  - - strand intron & mapping to - strand ( 0x10 == TRUE )
If the intron-sequence maps to a sample-read from R2 (RNAME ends in /2)
- The strand information from the intron-sequence name and the mapping flag 0x10 should be opposite
  - + strand intron & mapping to - strand ( 0x10 == TRUE )
  - - strand intron & mapping to + strand ( 0x10 == FALSE )

And any other combination would imply a spurious alignment that cannot be trusted to originate from a lariat and should be removed from further analysis.

Should I do care about properly mapping to the strand at this point?

Third, trim off the portion of each read starting at the 5′ splice site alignment and continuing to the end of the read. Finally, restrict to trimmed reads of >=20-nt length.

Here comes a new big issue with strand. As far as I understand, if the original sample-read is stranded and it comes from 1st mate of the read pair, then its sequence should be in the same strand as the lariat, and follow the end_of_intron-BranchPoint-start_of_intron pattern.

However, if the sample-read comes from R2, its sequence will belong to the complementary strand of the lariat. Meaning that the pattern would be reversed and it would be start_of_intron-BranchPoint-end_of_intron.

In that case, I should trim the reads (in gawk) like

R1
- substr( $1, 13, mapping_pos-1 )
- Keep the beginning of the read (start at nucleotide 13 because R1 contains 12nt with an UMI identifier) up to mapping start position. Remove from mapping start to the end.
R2
- substr( $1, mapping_pos - 1 + 21, length($1) )
- Keep the end of the read (from mapping start position + mapping length). Remove from the start of the read to the end of the mapping (start+length).

Am I getting this right? Am I overthinking it? Did I get lost in some point? Did I forget to consider extra cases?

Thanks in advance,
Txema

lariat splicing rna-seq • 1.1k views

ADD COMMENT • link 2.1 years ago by txema.heredia ▴ 250

score 0 · Answer 1 · 2023-06-20

I kept working on this, but I'm facing an issue in the last step. The method describes

Second, restrict to alignments that score as well as the best-scoring alignment for each read (e.g., remove lower-scoring alignments).

And the mapping command used was

bowtie2 -x <index file for 3′ splice sites> --end-to-end --sensitive -k 10 --no-unal -f -U <FASTA file of trimmed reads>

However, bowtie2's current version help says:

$ bowtie2 --help
[...]
Reporting:
  (default)          look for multiple alignments, report best, with MAPQ
   OR
  -k <int>           report up to <int> alns per read; MAPQ not meaningful
   OR
  -a/--all           report all alignments; very slow, MAPQ not meaningful
[...]

Is this consequence of a "recent" change in bowtie2? Is there a way to obtain some sort of mapping quality of each alignment when using -k ?

In bowtie2's manual it states:

https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml

Optional fields. Fields are tab-separated. bowtie2 outputs zero or more of these optional fields for each alignment, depending on the type of the alignment:

AS:i:<N> Alignment score. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if SAM record is for an aligned read.

XS:i:<N> Alignment score for the best-scoring alignment found other than the alignment reported. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if the SAM record is for an aligned read and more than one alignment was found for the read. Note that, when the read is part of a concordantly-aligned pair, this score could be greater than AS:i.

YS:i:<N> Alignment score for opposite mate in the paired-end alignment. Only present if the SAM record is for a read that aligned as part of a paired-end alignment.

Is XS reporting the best score of *any read in **this reference location or of this read in any* reference location ?

Should I just keep reads where AS == XS ?

However, by following the method's last filtering step:

Sixth, restrict to reads with a mismatch at the inferred branchpoint position.

I should require my reads to have at least 1 mismatch. Which would mean AS <= -6. Thus, it is possible for a read to match perfectly to some random location in the genome, and cause the proper mapping to the proper gene to be ignored.