prefix extraction and preparation for mapping and variant calling
1
0
Entering edit mode
10 days ago
Human • 0

hello humans,

I am struggling with a bash script that should actually work as far as I can see. I need to extract prefixes of 262 files in a directory that contains reads. I will map them for later variance calling etc. However, whilst trouble shooting because the mapping always stuck, I found that actually not all prefixes got extracted properly. When I print the output to a .txt file I see that just like 200 sth are properly extracted and some with mistakes as well. And I don't know why... I think the problem might be this prefix extraction step, but maybe also my mapping line is not correct. maybe someone here knows how to fix this.

So my input files look like this. 262 files, with the naming convention:

AH-00001_S117_L001_R1_P_trimmed.001.fastq
AH-00001_S117_L001_R2_P_trimmed.001.fastq
KewS01_S470_L001_R1_P_trimmed.001.fastq
KewS01_S470_L001_R2_P_trimmed.001.fastq


(...)

My script looks like this :

#!/bin/bash
(..)
#SBATCH --array=1-262

filenameFWD=$(ls -1 *_L001_R1_P_trimmed.001.fastq | tail -n +${SLURM_ARRAY_TASK_ID} | head -1)
filenameREV=$(ls -1 *_L001_R2_P_trimmed.001.fastq | tail -n +${SLURM_ARRAY_TASK_ID} | head -1)

prefixFWD=$(echo$filenameFWD | sed 's/_L001_R1_P_trimmed\.001\.fastq$//') prefixREV=$(echo $filenameREV | sed 's/_L001_R2_P_trimmed\.001\.fastq$//')

ngm -r Reference.fasta -1 ${prefixFWD}L001_R1_P_trimmed.001.fastq -2${prefixREV}L001_R1_P_trimmed.001.fastq -t 12 -o Test1.sam

samtools view -bS Test1.sam | samtools sort -o Test1.bam
samtools index Test1.bam


why doesn't it work properly? And also, would you recommend to use set -o errexit set -o nounset in the beginning of the script?

bash • 535 views
0
Entering edit mode
tail -n +${SLURM_ARRAY_TASK_ID}  whoa ADD REPLY 1 Entering edit mode you should use a workflow manager like snakemake or nextflow. Your futur self will thank you ADD REPLY 0 Entering edit mode what's wrong about that huh? :D ADD REPLY 0 Entering edit mode well, how do you manage things when 10 fastqs out of the 262 fastqs failed ? ADD REPLY 0 Entering edit mode what would you suggest ? ADD REPLY 1 Entering edit mode you should use a workflow manager like snakemake or nextflow. Your futur self will thank you ADD REPLY 1 Entering edit mode 10 days ago Joe 20k Don't rely on parsing the output of ls. You can essentially just do: for file in *.fastq ; do name="${file%_*_*_*_*}"
#do stuff with the name here
done


Note this specific iteration will only work if there are always the same number of underscores, but you can still use the same approach with sed or other tools.

You don't need to capture R1 and R2 if you're then specifying the suffix in your command (which is probably why it isn't working). Either capture both, or capture just the common stem and be explicit with the suffixes when you use them downstream,.

0
Entering edit mode

hey, thanks. But could you explain what you mean? bash and sequence analysis workflows are rather new for me. how would you write the prefix suffix part to achieve what I tried to achieve? Thanks:)

0
Entering edit mode

You already tried it here:

ngm -r Reference.fasta -1 ${prefixFWD}L001_R1_P_trimmed.001.fastq -2${prefixREV}L001_R1_P_trimmed.001.fastq -t 12 -o Test1.sam


You are trying to smash the variable $prefixFWD to a string (suffix). Your syntax is a bit wrong, but the general idea is reasonably sound. The steps are: 1. Take KewS01_S470_L001_R1_P_trimmed.fastq and strip part of the name away. 2. The bash operation I used above ("${file%_*_*_*_*}") applied to these strings will yield: KewS01_S470.
3. You can stitch strings together afterward: "$prefixFWd"_suffix.ext This uses bash parameter expansion. I would suggest spending some time getting familiar with it if you plan to use bash for this. https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html ADD REPLY 0 Entering edit mode hey , thank you but how would I write the whole code then ? Do I have to define a suffix for FWD and REV or is it automatically stored in the suffix for each file respectively ? so can the code look like this? #!/bin/bash (...) #SBATCH --array=1-262 for file in *.fastq ; do prefix="${file%_*_*_*_*}"

done

ngm -r TEASHIRT.fasta -1 "$prefix"_suffix.ext -t 12 -o Test1.sam samtools view -bS Test1.sam | samtools sort -o Test1.bam samtools index Test1.bam  ADD REPLY 0 Entering edit mode Or what do you think about this: #!/bin/bash (...) #SBATCH --array=1-262 # Load required modules module load NextGenMap module load SAMtools/1.3.1 # Use the SLURM_ARRAY_TASK_ID to get the current file file=$(ls *fastq | sed -n ${SLURM_ARRAY_TASK_ID}p) # Extract the prefix of the filename prefix=$(basename ${file} .fastq) # Map reads to the reference genome using NextGenMap ngm -r TEASHIRT.fasta -1${prefix}_suffix_1.fastq -2 ${prefix}_suffix_2.fastq -t 12 -o Test1.sam # Convert the SAM file to BAM format and sort it samtools view -bS Test1.sam | samtools sort -o Test1.bam # Index the sorted BAM file samtools index Test1.bam  ADD REPLY 0 Entering edit mode You are explicitly specifying the suffixes in your commands, so it's entirely up to you if you stick to this, or you wish to define them earlier and re-call them. For example here: ngm -r TEASHIRT.fasta -1${prefix}_suffix_1.fastq ...


You are explicitly defining the first suffix ("_suffix_1.fastq") in the command call.