Question

Python script to automatically read in, name output files & parse R1 & R2 fastq.gz files when given an input directory with raw data files

0

Entering edit mode

9 months ago

eorr ▴ 30

Hi,

I've inherited a project with a custom python script (counts barcodes in a specific way) that is called from a batch submission job on my institutions server.

Here is the linux code to call the script on the server:

python3 /path/to/files/customscript.py --R1=R1filename.fastq.gz --R2=R2filename.fastq.gz--name=Sample1 2>&1 > Sample1

Our current workflow is to call the script manually for each individual sample (10 total) specifying each filename for read1 and read 2, as well as the name for the output. (we make a bash script with 10 commands). I'm hoping to streamline that process.

My goal is to modify the customscript.py script so that when I call it I can just specify the input and output directories, to look something like this:

python3 /scratch/path/to/files/customscript.py  -inputdir fastq.gzfiles_dir -outputdir output_files

The new code would iterate over all the filenames in the input directory and find both read files from the same sample (ie. "sample1_R1.fastq.gz", "sample1_R2.fastq.gz") name the output according to the text in the filename up until the "_R1" or "_R2" characters ("sample1"), and pass both files to the custom script.

I'm not exactly sure where to start, I'm pretty new to python and fumbling through. I've been playing around with os, glob and path modules. I think I'm on the right track I think but hoping someone has worked on this type of thing before.

Thanks in advance.

python fastq • 1.0k views

ADD COMMENT • link 9 months ago by eorr ▴ 30

2

Entering edit mode

My goal is to modify the customscript.py script so that when I call it I can just specify the input and output directories

You said there are 10 directories, meaning 10 commands. Given your inexperience with python, do you really think it is worth your time - which in this case also means our time - to come up with an automated way just so you can type one command instead of ten? You could have done it 10 times over by now, and it will probably go to 100 times over by the time the script gets modified.

ADD REPLY • link 9 months ago by Mensur Dlakic ★ 27k

0

Entering edit mode

There are 10 samples so 20 files in one directory. It will help our workflow in the future as this analysis is ongoing, and could be used in other analyses that we will be implementing.

ADD REPLY • link 9 months ago by eorr ▴ 30

1

Entering edit mode

If you add the iteration over the files into the customscript.py, wouldn't that make it harder to parallelize these jobs? Presumably it's easier to just use shell scripting and parameter expansion where a one-liner could generate your 10 submission lines which could then be batch submitted. I understand sometimes things need to be pythonized, but not sure its the case here?

ADD REPLY • link 9 months ago by rfran010 ▴ 900

1

Entering edit mode

That is a good point about parallelization, I certainly want to retain that. I didn't think about using the shell scripting to accomplish this and I hadn't heard of parameter expansion before, I will look into that. Thanks for the suggestions!

ADD REPLY • link 9 months ago by eorr ▴ 30

0

Entering edit mode

I've been playing around with os, glob and path modules. I think I'm on the right track

show us the code please.

ADD REPLY • link 9 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

I haven't gotten very far, but I've tried the 2 examples below, both print out the whole path before the filename and prints all files including the hidden ones (I think thats what they are- they have a /._ before the filename). The listdir function prints them out into a list so I think I like that one better.

import os from os import path

1:

path = "/path/to//fastq.gz_files"

all_fastq = (os.listdir(path))

print(all_fastq)

2:

directory = "/path/to/fastq.gz_files"

files = Path(directory).glob('*')

for file in files:

print(file)

ADD REPLY • link 9 months ago by eorr ▴ 30