How to assign keys and values in a directory by using python
2.5 years ago
caro-ca ▴ 20

Hi! I want to map Illumina pair-end reads against a reference genome. I have a directory in which I only need to use the files that end with paired_R1.fastq.gz and paired_R2.fastq.gz for the paired reads. I am creating a script in which the paired_R1 are the keys and the paired_R2 are the values; however, I am having difficulties in assigning the keys and values in a for loop. I understand the file1 and file2 are not defined but I don't know how to set the output of "endswith" to a key and value respectively.

if __name__=='__main__':
path = os.getcwd()
dir_files = os.listdir(path)
for file in dir_files:
if file.endswith("_paired_R1.fastq.gz"):
file = file1
if file.endswith("_paired_R2.fastq.gz"):
file = file2


What is the expected output? I am sure this can be done with a one-liner via the command line.

I will use Tepid which is going to map the paired reads against the reference genome. But the command for TEPID is tepid-map -1 SRR4209894_paired_R1.fastq.gz -2 SRR4209894_paired_R2.fastq.gz -n SRR4209894 -x /../S288C/S288C -y /../S288C/S288C_reference_sequence_R64-2-1_20150113.X15_01_65525S -p 36 -s 350. For this reason, I need to assign the paired reads from my directory.

Use a simple bash script.

for r1file in *_R1.fastq.gz
do
tepid-map -1 ${r1file} -2${r1file/_R1/_R2} -n ${r1file%%_*} -x /../S288C/S288C -y /../S288C/S288C_reference_sequence_R64-2-1_20150113.X15_01_65525S -p 36 -s 350 done  See here to understand how the ${} parameter expansions work.

my_key = "hey there"
my_value = "ho there"
my_dict = {}
my_dict[my_key] = my_value

The problem is that instead of a variable, I want to assign keys and values to files in a directory.

2.5 years ago
Brice Sarver ★ 3.7k

There are good suggestions in the comments, but (reading between the lines) I think you're having problems because you're building a dictionary where your key:value pairs are the R1 and R2 reads.

What about storing as a tuple and unpacking? You know what needs to be appended to form the read pairs (i.e., _paired_R1.fastq.gz). Grab the stem, then assign the reads based on that.

import re
results = {}
dir_files = os.listdir(".")
# modify here as needed - you want to grab the file's stem;
# lots of ways to do this.
# I've inferred here from your code above, but a simple x.split()
# will work depending on your stem.
file_stems = [
re.sub("_paired_R1.fastq.gz", "", x) for x in dir_files
if x.endswith("_paired_R1.fastq.gz")
]
# build a tuple with the R1 and R2 names
for stem in file_stems:
R1 = stem + "_paired_R1.fastq.gz"
R2 = stem +  + "_paired_R2.fastq.gz"
results[stem] = (R1, R2)


The rest is pretty straightforward. You simply iterate across your dictionary, and you'll be able to unpack with R1, R2 = results['key']. This can easily be passed to subprocess.call() or similar.

EDIT: wrapping list comprehension to avoid cutoff.