What is the prefered script for HISAT2 genome-guided using draft genome? --dta or --exon ?
0
1
Entering edit mode
2.8 years ago
Farbod ★ 3.3k

Dear Biostars, Hi,

I have the RNA-seq data of a fish (3 cond1 and 3 cond2 as biological replicates) and I have done Trinity de novo assembly and DEG analysis on these data. Now the draft genome of that species have released. I want to run a genome-guided DEG analysis, too, to compare the results.

Using @Kevin and other Biostars helps, I select HISAT2 -> StringTie -> Ballgown pipeline.

At the first step, I have indexed my genome:

 ./hisat2-build -p 6  '/home/salmon-genome-2018/GCF_SSa_v1.0_genomic.fna'  ht2_base_salmon_genome


BUT it seems that there is several options/switches I can add to HISAT2 mapping script:

My first script for one of the replicates (C1) was as:

./hisat2 -p 6 -x ht2_base_salmon_genome -1 '/RNA_Seq_Data/C1_clean_left.fq' -2 '/RNA_Seq_Data/C1_clean_right.fq' -S '/RNA_Seq_Data/C1.sam' &> C1.sam.info


and 6 SAM files have been created, But then I found in the StringTie that

"be sure to run HISAT2 with the --dta option for alignment, or your results will suffer."

I have asked here and @Vijay Lakhujani believed that using --dta is a better idea.

Then I used this script and re-run all 6 mapping, again:

./hisat2 -p 6 -x  --dta ht2_base_salmon_genome -1 '/RNA_Seq_Data/C1_clean_left.fq' -2 '/RNA_Seq_Data/C1_clean_right.fq' -S '/RNA_Seq_Data/C1.sam' &> C1.sam.info


Now, there is another comment/hint in StringTie manual as:

It is highly recommended to use the reference annotation information when mapping the reads, which can be either embedded in the genome index (built with the --ss and --exon options, see HISAT2 manual), or provided separately at run time (using the --known-splicesite-infile option of HISAT2).

Q: What is the standard/preferred script for HISAT2 program for mapping? What must I do now? re-run all 6 mapping adding --ss and --exon to my previous script? How I can find splice site information of this newly released genome?

~Thanks

RNA-Seq alignment HISAT2 genome guided • 1.9k views
0
Entering edit mode

@Farbod: The quote you posted above gives you pointers on what to do. You can

• Either recreate the genome indexes with --ss and --exon options and then re-align your data.
• Or provide a file of known splice sites --known-splicesite-infile and re-align the data using the current genome index.
0
Entering edit mode

Dear @genomax, Hi

I do not have any "file of known splice sites", So in this case you mean I should re-create a new indexed genome using "--ss and --exon" and then map all the reads again using "--dta". yes?

Can we say it is the preferred / standard approach of using HISAT2 for genome-guided?

0
Entering edit mode

Don't think so , when you don't provide the file of known splice sites it will (probably) use a default set of potential splice sites. The advice to use this kind of option (same for the --ss and --exon) is that it can do more specific mapping as it then can filter out alignments that do not coincide with a known splice site (might even speed up the alignment step for the same reason) .

the consequence is that you will get less novel genes (== not present in the gff file) or models with alternative splice sites. It all depends on what your goal is

0
Entering edit mode

Dear @lieven.sterck, hi and thanks.

It seems that your idea is different from @genomax,

You believe that as I do not have "the file of known splice sites", I should use the SAM files obtained from my script using "--dta" and proceed to the next level. Correct?

0
Entering edit mode

I don't think I differ a lot from what genomax is telling you. If it is in the manual I might as well consider to (re-)run it the --ss and --exon activated (for the genome index building).

I merely wanted to point out that if, for some reason, you don't have the required info (known splice sites) or you do not want to rerun the mapping (building a new index) you could proceed with what you have. If you have access to the genome annotation file, I would consider to use it. This of course given that the gene prediction result is any good. If it is of low quality then proceed without it.

0
Entering edit mode

Thank you,

How I can understand that there is any known splice sites information for this "whole genome shotgun sequence" ?

it's structure is as :

chromosome 1

chromosome 2

.

.

chromosome 33

chromosome 34

AND many "unplaced genomic scaffold " !

1
Entering edit mode

Since you don't have a file with known transcript models you don't have the splice sites file.

You can use stringtie to create new transcript models. Without known transcripts there could be a large number of false positives that you would need to deal with. Since you have trinity assembled transcripts you could use those to compare with stringtie generated ones and see if you can reconcile them into a usable dataset.

0
Entering edit mode

Hi @genomax, You mean using this genome that is not well-annotated, the genome guided approach is not so much valuable, correct?

of course they have run some RNA-seq in their genome sequencing project, too (would you please have a look?).

By "since I have trinity assembled transcripts" , can I use this so-called genome-guided approach and check for similar (overlapped between two methods) DEGs and probable alternative splicing and consider them as true DEGs?