Question

Need help changing RNA-seq code from histat aligner into Star aligner please.

0

Entering edit mode

3.8 years ago

screadore ▴ 20

I need help converting histat aligner into star aligner for this script. Please let me know what I could do. Link to the full code is here: https://github.com/JustinGibbons/USF_Omics_Hub/blob/master/RNA-Seq_Pipeline/rna_seq_pipeline.py

def write_out_hisat_array_slurm_file(arguments_file,outfile="hisat_array_commands.sh"):
"""Uses data from input file to write out the hisat commands as an array job"""

arguments_dic=read_key_word_input_file(arguments_file,write_output_dir=True)
#hisat_array_name="hisat_array"
#sam_array="sam_array" I don't think I need this for right now
samtools_sort_array_name="samtools_sort_array"
#bam_array="bam_array" I don't think I need this for anything right now
task_id="$SLURM_ARRAY_TASK_ID"
workdir=arguments_dic["main_output_dir"]
print workdir
#Create the directory for the alignment files
alignment_dir="Alignments"
os.mkdir(workdir+"/"+alignment_dir)
#Dictionary pointing from sample names to reads
if "Data_Paired_End" not in arguments_dic:
    data_paired_end=True
    read_dict=find_matching_reads(get_fastq_files(arguments_dic["sample_dir"]))
else:
    data_paired_end=arguments_dic["Data_Paired_End"]
    if data_paired_end=="TRUE" or data_paired_end=="true" or data_paired_end=="True" or data_paired_end=="T":
        data_paired_end=True
        read_dict=find_matching_reads(get_fastq_files(arguments_dic["sample_dir"]))
    else:
        data_paired_end=False
        read_dict=find_unmatched_reads(get_fastq_files(arguments_dic["sample_dir"]))

#List of slurm options to be written on their own lines
    slurm_commands=create_slurm_header(workdir=arguments_dic["main_output_dir"],
                                    time=":".join([arguments_dic["hours"],arguments_dic["minutes"],"00"]),
                                    ntasks=arguments_dic["threads"],
                                    mem=arguments_dic["mem"],
                                    mail_user=arguments_dic["email_to_send_output_to"],submit_to_rra=arguments_dic["submit_to_rra"])

slurm_commands2=create_slurm_header(workdir=arguments_dic["main_output_dir"],
                                    time=":".join([arguments_dic["hours"],arguments_dic["minutes"],"00"]),
                                    ntasks=arguments_dic["threads"],
                                    mem=arguments_dic["mem"],
                                    mail_user=arguments_dic["email_to_send_output_to"],submit_to_rra=arguments_dic["submit_to_rra"])

hisat_command_input,sam_files=construct_hisat_array_command(index_files_path_and_basename=arguments_dic["genome_index"],
                                                    read_dict=read_dict,
                                                    number_of_processors=arguments_dic["threads"],
                                                    input_dict=arguments_dic,alignment_dir=alignment_dir,data_paired_end=data_paired_end)


if data_paired_end==True:
    read1_array,read2_array,hisat_output_array,hisat_command,deref_hisat_outfiles=hisat_command_input
elif data_paired_end==False:
    unpaired_reads_array,hisat_output_array,hisat_command,deref_hisat_outfiles=hisat_command_input
#print hisat_command_input
bam_array,samtools_sort_command,bam_files=construct_samtools_commands(sam_files,deref_hisat_outfiles,threads=arguments_dic["threads"])
arguments_dic["alignment_files"]=bam_files
job_size,cufflinks_input_array_string,cufflinks_outputdirs_array_string,cufflinks_command=construct_cufflinks_array_command(arguments_dic)
gff_file=arguments_dic["gff"]
if "gff_annotation" in arguments_dic:
    id_value=arguments_dic["gff_annotation"]
else:
     id_value="gene_id"
featurecounts_command_list=construct_feature_counts_command(gff_file,bam_files,id_value=id_value)
featurecounts_command=" ".join(featurecounts_command_list)
normalization_groups=arguments_dic["normalization_groups"]
normalization_method=arguments_dic["normalization_method"]
cuffnorm_command_list=construct_cuffnorm_command(gff_file,bam_files,alignment_dir+"/",groupings=normalization_groups,normalization_method=normalization_method)
cuffnorm_command=" ".join(cuffnorm_command_list)
narray=len(sam_files)
##May want to include a warning here if only one job is allowed to run at a time
narray_arg="#SBATCH --array=0-"+str(narray-1)+"%"+str(narray)
slurm_commands.append(narray_arg)
slurm_commands.append("#SBATCH --output=rna-seq_pipeline.out")
slurm_commands.append("module purge")
slurm_commands.append("module load apps/hisat2/2.1.0")
slurm_commands.append("module load apps/samtools/1.3.1")
slurm_commands.append("module load apps/cufflinks/2.2.1")

slurm_commands2.append("#SBATCH --output=rna-seq_pipeline2.out")
slurm_commands2.append("module load apps/cufflinks/2.2.1")
slurm_commands2.append("module load apps/subread/1.6.3")

RNA-Seq • 1.4k views

ADD COMMENT • link 3.8 years ago by screadore ▴ 20

2

Entering edit mode

This is a huge mess of a python script that writes a script to be submitted to slurm based on input parameters. You're better off using the sample script file in the github repo and replacing hisat2 with STAR in that. Changing this script is going to take a TON of time as it is insanely specific to the hisat2 pipeline and the environment it's running. It's spaghetti code at its best (worst?).

ADD REPLY • link 3.8 years ago by Ram 43k

2

Entering edit mode

most of this script has nothing to do with HISAT2 or STAR and is just the developer reinventing what pipeline frameworks do

ADD REPLY • link 3.8 years ago by Jeremy Leipzig 22k

2

Entering edit mode

Exactly. This is what obliterates any sense of portability in the scripts - they're tied too closely to the tools AND the environment. There's no plug and play here.

ADD REPLY • link 3.8 years ago by Ram 43k

0

Entering edit mode

Thank you very much for the reply. Could you give me some insight into how to go about replacing the hisat2 with STAR in the sample script file? I'd greatly appreciate it. Thank you again.

ADD REPLY • link 3.8 years ago by screadore ▴ 20

0

Entering edit mode

Sorry, that is not a level of help I'm willing to offer. Please read the hisat2 and STAR manual and see what changes need to be made.

ADD REPLY • link 3.8 years ago by Ram 43k

1

Entering edit mode

Just use nf-core RNAseq - it can do both

ADD REPLY • link 3.8 years ago by Asaf 10k

0

Entering edit mode

The only issue is for the lab we need to use the pipeline provided above.

ADD REPLY • link 3.8 years ago by screadore ▴ 20

1

Entering edit mode

Dare I ask why? If you are replacing the core program, it is not the same pipeline anymore.

ADD REPLY • link 3.8 years ago by GenoMax 141k

0

Entering edit mode

I honestly don't know it's one of my colleagues issue and I'm just trying to help her figure it out.

ADD REPLY • link 3.8 years ago by screadore ▴ 20

1

Entering edit mode

I would try to figure out why she is doing this in the first place. If change one step, downstream steps may break. You could be troubleshooting for a while and it may be irrelevant in the end.

ADD REPLY • link 3.8 years ago by igor 13k

1

Entering edit mode

If that's the case, your colleague should be the one making the post here. Any effort put into making this script work with STAR is wasted, as you'll spend 10x the time understanding this script as you will into switching to STAR.

ADD REPLY • link 3.8 years ago by Ram 43k

1

Entering edit mode

then you'll need to look at the construct_hisat_array_command and write a corresponding construct_star_array_command and obviously change the module command

ADD REPLY • link 3.8 years ago by Jeremy Leipzig 22k