How to find rna strand direction before alignment?
5
0
Entering edit mode
21 months ago
pubsurfted ▴ 40

Hello. I'm working on rna seq pipeline and I would like to find the strand direction of the data before doing alignment via hisat2.

RNA-Seq • 3.0k views
ADD COMMENT
5
Entering edit mode
21 months ago
Jack Tierney ▴ 400

If you want to work out strandedness within a pipeline the maybe how_are_we_stranded_here would work? I believe it does pretty much exactly what @benformatics and @swbarnes2 said above re aligning and figuring out

ADD COMMENT
1
Entering edit mode

That’s pity they didn’t mention our work(GuessmyLT) in their paper.

ADD REPLY
2
Entering edit mode
21 months ago
Juke34 8.7k

You can give a try to "Guess My LibraryType": GUESSmyLT

ADD COMMENT
0
Entering edit mode

Hello, thank you so much for responding. I had downloaded the GUESSmyLT docker image on my pop os. I am using the following command to run it:

    #!/usr/bin/env bash

cd ~/Desktop/fastq/


for infile in *_1.fastq
do
   base=$(basename ${infile} _1.fastq)

    docker run quay.io/biocontainers/guessmylt:0.2.5--py_0 GUESSmyLT \
    GUESSmyLT --reads ${infile} ${base}_2.fastq \
    --reference Glycine_max.Glycine_max_v2.1.dna.toplevel.fa --mode genome \ 
    --annotation Glycine_max.Glycine_max_v2.1.55.gff3 --organism euk

done

BUT it keeps giving me the error:

edit: solved the previous error, now have this new one.

Error. Cannot open --reference '/home/x/Desktop/fastq/Glycine_max.Glycine_max_v2.1.dna.toplevel.fa'. Make sure it exists.

The file does exist in that directory.

How do i go around solving this?

ADD REPLY
0
Entering edit mode

Do you have a copy of this genome in IGV? Rather than using some specialized tool, you should just do your HISAT alignment on the top 10,000 reads as unstranded and then look at the alignments in IGV. It should be super easy to tell from that.

ADD REPLY
0
Entering edit mode

Thank you for your reply. How would i incorporate your suggestion into a pipeline? I dont think its feasible.

ADD REPLY
1
Entering edit mode

You want the pipeline to make a suggestion (so a script decides the stranded-ness)?

What I do is subset around the first 10-100k reads from the two read files (assuming here standard RNA-seq). Then I align them to my genome + GTF (I use the aligner STAR). Then I have a small R script that counts the number of overlaps (gene counts) across the transcriptome. I split the counts by those fragments aligning in the sense and anti-sense direction. (You could do it a bit more advanced if you have some custom RNA-seq library kits). In theory. the reads could be FF, FR, RF, RR but in general that is not applicable to most situations. However, if you think it is you could code that in and separate the reads from each pair by strand to verify.

Anyway for the standard case for instance like take a random gene:

SRSF2 Plus-strand fragment counts: 498 Minus-strand fragment counts: 512

I take the ratio of all these counts genome wide (+/-). If the fraction is 0.4-0.6 I assume an unstranded library. If the fraction is < 0.4 then it is stranded-antisense and if it is >0.6 then it is stranded-sense. In this case it's close to 50% so this unstranded.

Also the standard/most common Illumina RNA-seq library kit produces anti-sense fragments so that is what you will likely see the most.

You should run a few test cases but this works for me in 99% of situations.

ADD REPLY
0
Entering edit mode

if its possible for you, could you please share a link to your script so I could try running it?

Best wishes.

ADD REPLY
0
Entering edit mode

Once again it is more or less what does GUESSmyLT

ADD REPLY
0
Entering edit mode

It is more or less what does GUESSmyLT

ADD REPLY
0
Entering edit mode

Did you load your data within the container? e.g. if your data are all within the current folder:

  docker run -v ${PWD}:/data quay.io/biocontainers/guessmylt:0.2.5--py_0 GUESSmyLT \
    --reads /data/${infile} ${base}_2.fastq \
    --reference /data/Glycine_max.Glycine_max_v2.1.dna.toplevel.fa --mode genome \ 
    --annotation /data/Glycine_max.Glycine_max_v2.1.55.gff3 --organism euk

If you can list all foder where stand the different data and add all those path using -v docker option

ADD REPLY
1
Entering edit mode
21 months ago

Read the method section of the publication associated with your SRA files where they should list the kit used for the library preparation. Then based on the protocol you should be able to surmise what strand orientation is expected.

Otherwise as @swbarnes2 already stated. It is much easier to align a small subset of the FASTQ and identify the strand bias that way.

ADD COMMENT
1
Entering edit mode
21 months ago
KoppesEA ▴ 80

https://rseqc.sourceforge.net/ infer_experiment.py

ADD COMMENT
0
Entering edit mode

OP said he didn't want to align. That script takes a bad and a bam as input.

ADD REPLY
0
Entering edit mode

Right, I mean you can't get strandedness without aligning. The first two answers just align a subset and align reads and the second actually uses rseqc infer_experiment.py.

ADD REPLY
1
Entering edit mode

I'm not sure they wanted to do it without aligning just before their Hisat2 alignment step. Well that's how I read it at least.

ADD REPLY
0
Entering edit mode
21 months ago

Ask the people who did the library prep what kit they used.

Otherwise, just align and figure it out.

ADD COMMENT
1
Entering edit mode

I'm using publicly available data from sra database. 90% of the time no strand information is given

ADD REPLY

Login before adding your answer.

Traffic: 1339 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6