How to find rna strand direction before alignment?
5
0
Entering edit mode
8 weeks ago
pubsurfted ▴ 40

Hello. I'm working on rna seq pipeline and I would like to find the strand direction of the data before doing alignment via hisat2.

RNA-Seq • 992 views
4
Entering edit mode
8 weeks ago
Jack Tierney ▴ 280

If you want to work out strandedness within a pipeline the maybe how_are_we_stranded_here would work? I believe it does pretty much exactly what @benformatics and @swbarnes2 said above re aligning and figuring out

1
Entering edit mode

That’s pity they didn’t mention our work(GuessmyLT) in their paper.

2
Entering edit mode
8 weeks ago
Juke34 7.5k

You can give a try to "Guess My LibraryType": GUESSmyLT

0
Entering edit mode

Hello, thank you so much for responding. I had downloaded the GUESSmyLT docker image on my pop os. I am using the following command to run it:

    #!/usr/bin/env bash

cd ~/Desktop/fastq/

for infile in *_1.fastq
do
base=$(basename${infile} _1.fastq)

docker run quay.io/biocontainers/guessmylt:0.2.5--py_0 GUESSmyLT \
GUESSmyLT --reads ${infile}${base}_2.fastq \
--reference Glycine_max.Glycine_max_v2.1.dna.toplevel.fa --mode genome \
--annotation Glycine_max.Glycine_max_v2.1.55.gff3 --organism euk

done


BUT it keeps giving me the error:

edit: solved the previous error, now have this new one.

Error. Cannot open --reference '/home/x/Desktop/fastq/Glycine_max.Glycine_max_v2.1.dna.toplevel.fa'. Make sure it exists.

The file does exist in that directory.

How do i go around solving this?

0
Entering edit mode

Do you have a copy of this genome in IGV? Rather than using some specialized tool, you should just do your HISAT alignment on the top 10,000 reads as unstranded and then look at the alignments in IGV. It should be super easy to tell from that.

0
Entering edit mode

Thank you for your reply. How would i incorporate your suggestion into a pipeline? I dont think its feasible.

1
Entering edit mode

You want the pipeline to make a suggestion (so a script decides the stranded-ness)?

What I do is subset around the first 10-100k reads from the two read files (assuming here standard RNA-seq). Then I align them to my genome + GTF (I use the aligner STAR). Then I have a small R script that counts the number of overlaps (gene counts) across the transcriptome. I split the counts by those fragments aligning in the sense and anti-sense direction. (You could do it a bit more advanced if you have some custom RNA-seq library kits). In theory. the reads could be FF, FR, RF, RR but in general that is not applicable to most situations. However, if you think it is you could code that in and separate the reads from each pair by strand to verify.

Anyway for the standard case for instance like take a random gene:

SRSF2 Plus-strand fragment counts: 498 Minus-strand fragment counts: 512

I take the ratio of all these counts genome wide (+/-). If the fraction is 0.4-0.6 I assume an unstranded library. If the fraction is < 0.4 then it is stranded-antisense and if it is >0.6 then it is stranded-sense. In this case it's close to 50% so this unstranded.

Also the standard/most common Illumina RNA-seq library kit produces anti-sense fragments so that is what you will likely see the most.

You should run a few test cases but this works for me in 99% of situations.

0
Entering edit mode

if its possible for you, could you please share a link to your script so I could try running it?

Best wishes.

0
Entering edit mode

Once again it is more or less what does GUESSmyLT

0
Entering edit mode

It is more or less what does GUESSmyLT

0
Entering edit mode

Did you load your data within the container? e.g. if your data are all within the current folder:

  docker run -v ${PWD}:/data quay.io/biocontainers/guessmylt:0.2.5--py_0 GUESSmyLT \ --reads /data/${infile} \${base}_2.fastq \
--reference /data/Glycine_max.Glycine_max_v2.1.dna.toplevel.fa --mode genome \
--annotation /data/Glycine_max.Glycine_max_v2.1.55.gff3 --organism euk


If you can list all foder where stand the different data and add all those path using -v docker option

1
Entering edit mode
8 weeks ago

Read the method section of the publication associated with your SRA files where they should list the kit used for the library preparation. Then based on the protocol you should be able to surmise what strand orientation is expected.

Otherwise as @swbarnes2 already stated. It is much easier to align a small subset of the FASTQ and identify the strand bias that way.

1
Entering edit mode
8 weeks ago
KoppesEA ▴ 80

https://rseqc.sourceforge.net/ infer_experiment.py

0
Entering edit mode

OP said he didn't want to align. That script takes a bad and a bam as input.

0
Entering edit mode

Right, I mean you can't get strandedness without aligning. The first two answers just align a subset and align reads and the second actually uses rseqc infer_experiment.py.

1
Entering edit mode

I'm not sure they wanted to do it without aligning just before their Hisat2 alignment step. Well that's how I read it at least.

0
Entering edit mode
8 weeks ago

Ask the people who did the library prep what kit they used.

Otherwise, just align and figure it out.

1
Entering edit mode

I'm using publicly available data from sra database. 90% of the time no strand information is given