Question

Python scripts to map multiple pair end files to the genome

1

Entering edit mode

6.9 years ago

Bioinfonext ▴ 460

I read python basic codes and I am fimiliar with for loops, But not able to understand how to make a python programme to run hisat2 on multiple pair-end reads.

I want to run hisat2 mapping tool on mutiple pair end reads to map these on Genome.

I already have prepared genome index.

1)my input will be hisat2 location and genome index, pair-end read file 2) How should I give multiple hisat2 arguments and how to store sam file in a separate folder for each pair-end reads.

Can you help to make this python script ?

Location of hisat2:
 /home/yog/software/hisat2-2.0.4/hisat2

Location of genome index:
data/Analysis/Radish_index

Location of pair end files:

 /data/Analysis/mapped/

multiple pair end files like this:

216_5W_Ca1_R1.fq

216_5W_Ca1_R2.fq

218_5W_Ca1_R1.fq

218_5W_Ca1_R1.fq



216_5W_Co1_R1.fq

216_5W_Co1_R2.fq

218_5W_Co1_R1.fq

218_5W_Co1_R1.fq

Now I am running hisat2 on individual files by using this commonad:

/home/yog/software/hisat2-2.0.4/hisat2 -p 8 -x /data/Analysis/Radish_index  --fr -1 /data/Analysis/mapped/216_5W_Ca1_R1.fq -2 /data/Analysis/mapped/216_5W_Ca1_R2.fq -S 216_5W_Ca1.sam

RNA-Seq • 4.3k views

ADD COMMENT • link updated 6.9 years ago by Philipp Bayer 8.3k • written 6.9 years ago by Bioinfonext ▴ 460

score 7 · Accepted Answer · 2017-05-19

7

Entering edit mode

6.9 years ago

Philipp Bayer 8.3k

It would be easier to do this in bash:

for r1 in  /data/Analysis/mapped/*R1.fq
do 
  #replace R1 by R2
  r2=${r1/R1/R2}
  # snip off the fq and add sam to the output file name
  out=${r1%%_R1.fq}.sam
  /home/yog/software/hisat2-2.0.4/hisat2 -p 8 -x /data/Analysis/Radish_index  --fr -1 $r1 -2 $r2 -S $out
done

In Python it's similar:

import glob
import os
for r1 in glob.glob('/data/Analysis/mapped/*R1.fq'):
  r2 = r1.replace('R1', 'R2')
  out = r1.replace('_R1.fq', '.sam')
  os.popen('/home/yog/software/hisat2-2.0.4/hisat2 -p 8 -x /data/Analysis/Radish_index  --fr -1 %s -2 %s -S %s'%(r1, r2, out))

None of those are multithreaded - you can use the subprocess library instead of os.popen in Python so that each job runs in the background instead of it running each job one after the other, and you can end the command using & in bash to run all of the jobs at the same time, like this:

/home/yog/software/hisat2-2.0.4/hisat2 -p 8 -x /data/Analysis/Radish_index  --fr -1 $r1 -2 $r2 -S $out &

Here I'm also not doing anything special to capture the standard error or the standard output, it would be better for you to have a look at all of the standard error.

ADD COMMENT • link 6.9 years ago by Philipp Bayer 8.3k

0

Entering edit mode

@Philipp Bayer Thanks, this bash script is awesome as I tried it on my data. But, my paired-end files are named in the following pattern:

sample1_lane1_R1_001.pass_sample1.fq     sample1_lane1_R2_001.pass_sample1.fq
sample2_lane1_R1_001.pass_sample2.fq     sample2_lane1_R2_001.pass_sample2.fq

I am trying to edit the script to get the following pattern in the name of sam and bam output, but did not work:

sample1_lane1_pass_sample1.sam
sample2_lane1_pass_sample2.sam

However, the editing of :

SAM=${r1%%_R1_.fq}.sam

/home/yog/software/hisat2-2.0.4/hisat2 -p 8 -x /data/Analysis/Radish_index  --fr -1 $r1 -2 $r2 -S $out

to

 SAM=${r1%%_R1_.fq}.sam

 BAM=${r1%%_R1_.fq}.bam

 /home/yog/software/hisat2-2.0.4/hisat2 -p 8 -x /data/Analysis/Radish_index  --fr -1 $r1 -2 $r2 -S  $SAM | samtools view -bS > $BAM

worked. :)

ADD REPLY • link 6.5 years ago by bioinfo8 ▴ 230

0

Entering edit mode

I figured out:

  SAM=${r1%.fq}.sam
  BAM=${r1%.fq}.bam

Ref: How to remove the extension of a file?

ADD REPLY • link 6.5 years ago by bioinfo8 ▴ 230