Question

Variant Calling - for multiple samples need advice

0

Entering edit mode

7.3 years ago

David_emir ▴ 490

Hello Friends,

i am having around 108 fastq files(Paired ends) with 3 technical replicates each sample. now i am really confused to analyse these files.I am following the following procedure 1. Concatenate the samples (tech replicates) into two fastq files, i.e. one for a forward and another for a reveres seq.(samp1_1.fq.gz and samp1_2.fq.gz ...........samp108_1.fq.ga & samp108_2.fq.gz)

Alignment – Map to Reference genome..> bwa mem -M ref input_1 input_2 > aligned_reads.sam

Now i am having issues in this step, should i individually map all 108 samples and gather the aligned_reads1.sam to aligned_reads108.sam and merge to Sort SAM file by coordinate and convert to BAM? or when should i merge these files?

I may be wrong but right now i am literally confused . If you guys have a script where in i can run these samples in a go will be of a great help for me. I may sound stupid, but trust me i am clueless in this case.

Thanks a lot , David Emir

Variant Calling pipeline • 2.2k views

ADD COMMENT • link updated 7.3 years ago by WouterDeCoster 47k • written 7.3 years ago by David_emir ▴ 490

score 2 · Answer 1 · 2016-12-22

Hi David,

What would be the purpose of having technical replicates here? Were those generated intentionally, or merely just by sequencing on multiple lanes of a sequencer? In the latter case, it's definitely sensible to merge the fastq files. I don't think technical replicates are commonly used for variant calling, I don't really see the point of that.

As Brian wrote, you would definitely want to align those separate samples. But obviously you don't want to type the command 108 times, right?

One solution (the easiest) would be a for loop. I'll assume that you use the following name pattern for your (merged) files to give you an example:

sample1.R1.fq.gz
sample1.R2.fq.gz
sample2.R1.fq.gz
sample2.R2.fq.gz
and so on

The for loop will first get all fq.gz files and get the sample specific part out, and then sequentially perform the alignment, pipe to bam and sort

for sample in `ls *.fq.gz | cut -f1 -d'.' | sort -u`
do
bwa mem ref ${sample}.R1.fq.gz ${sample}.R2.fq.gz  | samtools view -b - | samtools sort - > ${sample}.bam
done

Mind that you can customise the command by adding more threads for samtools sort or bwa mem to speed things up, depending on the IT infrastructure you have available.

For more advanced tools/pipelines, you can have a look at a gnu-parallel solution:

ls *.fq.gz | cut -f1 -d'.' | sort -u | parallel -j 4 'bwa mem ref {}.R1.fq.gz {}.R2.fq.gz  | samtools view -b - | samtools sort - > {}.bam'

with many more customization options, now running 4 processes simultaneously.

score 0 · Answer 2 · 2016-12-22

Hello Friends,

i am having around 108 fastq files(Paired ends) with 3 technical replicates each sample. now i am really confused to analyse these files.I am following the following procedure 1. Concatenate the samples (tech replicates) into two fastq files, i.e. one for a forward and another for a reveres seq.(samp1_1.fq.gz and samp1_2.fq.gz ...........samp108_1.fq.ga & samp108_2.fq.gz)

Why would you ever do that? It completely defeats the purpose of technical replicates. Who or where did you get the advice to do that?

Alignment – Map to Reference genome..> bwa mem -M ref input_1 input_2

aligned_reads.sam Now i am having issues in this step, should i individually map all 108 samples and gather the aligned_reads1.sam to aligned_reads108.sam and merge to Sort SAM file by coordinate and convert to BAM? or when should i merge these files?

Why would you merge them? You have 108 samples. Obviously, you need to map them all individually and consider them all individually. Would you like to be lumped in with 107 other random people and considered in bulk, and given some kind of random diagnosis that possibly applies to a majority of the samples you happened to be grouped with?

I may be wrong but right now i am literally confused . If you guys have a script where in i can run these samples in a go will be of a great help for me. I may sound stupid, but trust me i am clueless in this case.

Thanks a lot , David Emir

....I won't comment, other than to say that I agree with you. I pity your patients. I am not usually this negative, but I think you're a danger to society.