Which strand specific option need to be used with HISAT2?
2
2
Entering edit mode
4.7 years ago
Vasu ▴ 710

I'm working with RNA-seq data. For the analysis I will be using HISAT2 to align sequencing reads to the human genome GRCh38. Samples are subjected to strand-specific RNA sequencing using poly-A selection protocol and sequenced on the Illumina HiSeq 2000 system. They are paired-end reads.

I would like to know how the Hisat2 command need to be given for strand-specific option.

--rna-strandness should be RF or FR


Thank you

RNA-Seq hisat2 strandspecific • 9.8k views
0
Entering edit mode

It is very convenient to use the pseudo mapper salmon with the -l A option. Salmon evaluates all forms of strandness at the same time with the "A" option and will reveal in the final log lib_format_counts.json file the kind of strandness in a question of minutes, requiring a very low amount of RAM. You can download the indexes used by salmon from this page

Look the content of my lib_format_count.json files using a paired dUTP stranded library (used most of the time). The nomenclature for the king of strandness can be read HERE. Additional information about fragment library types can be also found here

{
"read_files": "[ FASTQ/S737/S737_1.fq.gz, FASTQ/S737/S737_2.fq.gz]",
"expected_format": "ISR",
"compatible_fragment_ratio": 1.0,
"num_compatible_fragments": 28060711,
"num_assigned_fragments": 28060711,
"num_frags_with_concordant_consistent_mappings": 27382742,
"num_frags_with_inconsistent_or_orphan_mappings": 727783,
"strand_mapping_bias": 0.0,
"MSF": 0,
"OSF": 0,
"ISF": 0,
"MSR": 0,
"OSR": 0,
"ISR": 27382742,
"SF": 390621,
"SR": 337162,
"MU": 0,
"OU": 0,
"IU": 0,
"U": 0
}

8
Entering edit mode
4.7 years ago
apeltzer ▴ 150

For Hisat2 there is two things to take care of:

If its single ended data and forward stranded you need to set: -rna-strandness F If its paired end data and forward stranded you need to set: -rna-strandness FR

Similar, if its reverse stranded for SE data: -rna-strandness R or (Paired End, reverse stranded) -rna-strandness RF

You may have to check in more detail what the protocol actually does.

0
Entering edit mode

Is there a possibility to know whether it is forward stranded or reverse stranded from fatsq files I have (sample1_1.fastq, sample1_2.fatsq) ?

2
Entering edit mode

If you don't know if for sure, you can try using RSeQC to infer the strandedness:

http://rseqc.sourceforge.net

0
Entering edit mode

Hi,

Which Bam file I should use? Regarding bed file can I use the one which is given in the Rseqc webiste (https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/)

0
Entering edit mode

Which option in RSeQC will infer strandedness? I have single-end data and don't know if it's stranded. I run HISAT2 but only used the -x -U -S options. Thanks!

2
Entering edit mode
4.7 years ago
JJ ▴ 630

Most are fr-firststrand (dUTP based methods). But yes it's good to check with infer_experiment.py from RSeQC. For fr-firststrand you should use RF.

0
Entering edit mode

ok. Which Bam file I should use? Regarding bed file can I use the one which is given in the Rseqc webiste (https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/)

1
Entering edit mode

You feed in the mapping - the bam file - e.g., align with HISAT2 first without setting the strangeness (or do ... it doesn't really matter for this application) and then feed it into infer_experiment.py from RSeQC. Bed file - well this depends on the reference you have used - you need to pick the right one. Alternatively, convert the gtf or gff file corresponding to your reference (e.g., from Ensembl or Gencode) to bed e.g. using galaxy.

0
Entering edit mode

Thanks for the reply. Regarding BED file - I didn't use any reference gtf for alignment with HISAt2, I just gave the GRCh38 reference genome index. But I will be using the gencode new version gtf in Stringtie step. So Can I convert that gencode gtf to bed and use it for the infer_experiment.py ?

0
Entering edit mode

If you use the gencode reference, you can download the same annotation also as gff3 which you can then convert to bed using galaxy. Alternatively, use this tool (I haven't tried it myself though but came across it) for gtf to bed conversion.

0
Entering edit mode

Yes, but gtf2bed doesn't give 12 column bed file. And RSeqc needs 12 column bed file right? Or I can also do it this way how to get Homo_sapiens.GRCh38.84.bed file

0
Entering edit mode

As I said, I have not tried that tool. I downloaded the gff3 from the gencode website and converted it with galaxy to bed (you find it under conversion tools). This worked fine for me. RSeQC can work with it. However, use the awk command of Devon if you don't like galaxy :)

0
Entering edit mode

On another note, you should build the index with a gtf file - not only based on the reference sequence. See protocol for how to do this.

0
Entering edit mode

I'm using the standard index from HISAT2 website. So, do I need to build the index with the gencode gtf file first and then use it for alignment and then convert the gtf to bed and then use the infer_experiment.py. Am I right?

0
Entering edit mode

You can use the standard index of course which already contains the transcripts (if you use the proper one - of course) - as far as I know it's based on Ensembl. I build my own so I have the newest version and the corresponding gtf / gff3 files from gencode. So yes, if you want to use the gencode annotation, I would use it 'all the way through'. Having said that it wouldn't make much difference just for the RSeQC I suppose but my guess is that you want to do something further down with the alignment where you need an annotation - so then I would stick with the same annotation for all steps.

0
Entering edit mode

Yes, Thank you. I downloaded the gff3 file from gencode and In Galaxy (https://toolshed.g2.bx.psu.edu/) I dont see any gff3tobed converter. Could you please help me with this.

1
Entering edit mode
0
Entering edit mode

Ok. I did the conversion. And used the infer_experiment.py This is what I see

Total 9230 usable reads were sampled

This is PairEnd Data Fraction of reads failed to determine: 0.0000 Fraction of reads explained by "1++,1--,2+-,2-+": 0.0047 Fraction of reads explained by "1+-,1-+,2++,2--": 0.9953

Does this mean fr-firststrand which RF for --rna strandness in HISAt2?

0
Entering edit mode

Ok I got the answer from your old post [RSeQC Output from infer_experiment.py - what does it mean?] Thanks a lot for the help.