Question: Variant Calling - for multiple samples need advice
0
gravatar for David_emir
2.3 years ago by
David_emir320
India
David_emir320 wrote:

Hello Friends,

i am having around 108 fastq files(Paired ends) with 3 technical replicates each sample. now i am really confused to analyse these files.I am following the following procedure 1. Concatenate the samples (tech replicates) into two fastq files, i.e. one for a forward and another for a reveres seq.(samp1_1.fq.gz and samp1_2.fq.gz ...........samp108_1.fq.ga & samp108_2.fq.gz)

  1. Alignment – Map to Reference genome..> bwa mem -M ref input_1 input_2 > aligned_reads.sam

Now i am having issues in this step, should i individually map all 108 samples and gather the aligned_reads1.sam to aligned_reads108.sam and merge to Sort SAM file by coordinate and convert to BAM? or when should i merge these files?

I may be wrong but right now i am literally confused . If you guys have a script where in i can run these samples in a go will be of a great help for me. I may sound stupid, but trust me i am clueless in this case.

Thanks a lot , David Emir

pipeline variant calling • 915 views
ADD COMMENTlink modified 2.3 years ago by WouterDeCoster38k • written 2.3 years ago by David_emir320
2
gravatar for WouterDeCoster
2.3 years ago by
Belgium
WouterDeCoster38k wrote:

Hi David,

What would be the purpose of having technical replicates here? Were those generated intentionally, or merely just by sequencing on multiple lanes of a sequencer? In the latter case, it's definitely sensible to merge the fastq files. I don't think technical replicates are commonly used for variant calling, I don't really see the point of that.

As Brian wrote, you would definitely want to align those separate samples. But obviously you don't want to type the command 108 times, right?

One solution (the easiest) would be a for loop. I'll assume that you use the following name pattern for your (merged) files to give you an example:

sample1.R1.fq.gz
sample1.R2.fq.gz
sample2.R1.fq.gz
sample2.R2.fq.gz
and so on

The for loop will first get all fq.gz files and get the sample specific part out, and then sequentially perform the alignment, pipe to bam and sort

for sample in `ls *.fq.gz | cut -f1 -d'.' | sort -u`
do
bwa mem ref ${sample}.R1.fq.gz ${sample}.R2.fq.gz  | samtools view -b - | samtools sort - > ${sample}.bam
done

Mind that you can customise the command by adding more threads for samtools sort or bwa mem to speed things up, depending on the IT infrastructure you have available.

For more advanced tools/pipelines, you can have a look at a gnu-parallel solution:

ls *.fq.gz | cut -f1 -d'.' | sort -u | parallel -j 4 'bwa mem ref {}.R1.fq.gz {}.R2.fq.gz  | samtools view -b - | samtools sort - > {}.bam'

with many more customization options, now running 4 processes simultaneously.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by WouterDeCoster38k
2

In case of using GATK He needs to add the @RG while aligning; so he needs to know it from the fastq or he can add it after aligning but he needs to extract it from fasta/q file or have it from serves provider

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Medhat8.2k
0
gravatar for Brian Bushnell
2.3 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

Hello Friends,

i am having around 108 fastq files(Paired ends) with 3 technical replicates each sample. now i am really confused to analyse these files.I am following the following procedure 1. Concatenate the samples (tech replicates) into two fastq files, i.e. one for a forward and another for a reveres seq.(samp1_1.fq.gz and samp1_2.fq.gz ...........samp108_1.fq.ga & samp108_2.fq.gz)

Why would you ever do that? It completely defeats the purpose of technical replicates. Who or where did you get the advice to do that?

Alignment – Map to Reference genome..> bwa mem -M ref input_1 input_2

aligned_reads.sam Now i am having issues in this step, should i individually map all 108 samples and gather the aligned_reads1.sam to aligned_reads108.sam and merge to Sort SAM file by coordinate and convert to BAM? or when should i merge these files?

Why would you merge them? You have 108 samples. Obviously, you need to map them all individually and consider them all individually. Would you like to be lumped in with 107 other random people and considered in bulk, and given some kind of random diagnosis that possibly applies to a majority of the samples you happened to be grouped with?

I may be wrong but right now i am literally confused . If you guys have a script where in i can run these samples in a go will be of a great help for me. I may sound stupid, but trust me i am clueless in this case.

Thanks a lot , David Emir

....I won't comment, other than to say that I agree with you. I pity your patients. I am not usually this negative, but I think you're a danger to society.

ADD COMMENTlink written 2.3 years ago by Brian Bushnell16k
1

Brian, eat a snickers. It's safe to assume that most of us were a danger to society until we weren't. Let's assume he's not in a position directly influencing patients.

ADD REPLYlink written 2.3 years ago by WouterDeCoster38k

Hmm, maybe I was a bit harsh, sorry :)

ADD REPLYlink written 2.3 years ago by Brian Bushnell16k

Thanks a lot for your "COMMENTS" Brain, i am delighted by your words !!! I don't want to argue i may be wrong , coz this is not my field. I got this IDEA from so many people, How To Merge Two Fastq.Gz Files? https://sourceforge.net/p/bio-bwa/mailman/message/31052880/ http://seqanswers.com/forums/showthread.php?t=23207 Fastq Files From Different Flowcells combining fasta files How To Merge Two Fastq.Gz Files?

:) being dangerous to the society, its way too rude BRAIN !!! Happy new year in advance and have a great year ahead !!! David Emir!

ADD REPLYlink written 2.3 years ago by David_emir320
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1133 users visited in the last hour