Variant Calling - for multiple samples need advice
2
0
Entering edit mode
7.3 years ago
David_emir ▴ 490

Hello Friends,

i am having around 108 fastq files(Paired ends) with 3 technical replicates each sample. now i am really confused to analyse these files.I am following the following procedure 1. Concatenate the samples (tech replicates) into two fastq files, i.e. one for a forward and another for a reveres seq.(samp1_1.fq.gz and samp1_2.fq.gz ...........samp108_1.fq.ga & samp108_2.fq.gz)

  1. Alignment – Map to Reference genome..> bwa mem -M ref input_1 input_2 > aligned_reads.sam

Now i am having issues in this step, should i individually map all 108 samples and gather the aligned_reads1.sam to aligned_reads108.sam and merge to Sort SAM file by coordinate and convert to BAM? or when should i merge these files?

I may be wrong but right now i am literally confused . If you guys have a script where in i can run these samples in a go will be of a great help for me. I may sound stupid, but trust me i am clueless in this case.

Thanks a lot , David Emir

Variant Calling pipeline • 2.2k views
ADD COMMENT
2
Entering edit mode
7.3 years ago

Hi David,

What would be the purpose of having technical replicates here? Were those generated intentionally, or merely just by sequencing on multiple lanes of a sequencer? In the latter case, it's definitely sensible to merge the fastq files. I don't think technical replicates are commonly used for variant calling, I don't really see the point of that.

As Brian wrote, you would definitely want to align those separate samples. But obviously you don't want to type the command 108 times, right?

One solution (the easiest) would be a for loop. I'll assume that you use the following name pattern for your (merged) files to give you an example:

sample1.R1.fq.gz
sample1.R2.fq.gz
sample2.R1.fq.gz
sample2.R2.fq.gz
and so on

The for loop will first get all fq.gz files and get the sample specific part out, and then sequentially perform the alignment, pipe to bam and sort

for sample in `ls *.fq.gz | cut -f1 -d'.' | sort -u`
do
bwa mem ref ${sample}.R1.fq.gz ${sample}.R2.fq.gz  | samtools view -b - | samtools sort - > ${sample}.bam
done

Mind that you can customise the command by adding more threads for samtools sort or bwa mem to speed things up, depending on the IT infrastructure you have available.

For more advanced tools/pipelines, you can have a look at a gnu-parallel solution:

ls *.fq.gz | cut -f1 -d'.' | sort -u | parallel -j 4 'bwa mem ref {}.R1.fq.gz {}.R2.fq.gz  | samtools view -b - | samtools sort - > {}.bam'

with many more customization options, now running 4 processes simultaneously.

ADD COMMENT
2
Entering edit mode

In case of using GATK He needs to add the @RG while aligning; so he needs to know it from the fastq or he can add it after aligning but he needs to extract it from fasta/q file or have it from serves provider

ADD REPLY
0
Entering edit mode
7.3 years ago

Hello Friends,

i am having around 108 fastq files(Paired ends) with 3 technical replicates each sample. now i am really confused to analyse these files.I am following the following procedure 1. Concatenate the samples (tech replicates) into two fastq files, i.e. one for a forward and another for a reveres seq.(samp1_1.fq.gz and samp1_2.fq.gz ...........samp108_1.fq.ga & samp108_2.fq.gz)

Why would you ever do that? It completely defeats the purpose of technical replicates. Who or where did you get the advice to do that?

Alignment – Map to Reference genome..> bwa mem -M ref input_1 input_2

aligned_reads.sam Now i am having issues in this step, should i individually map all 108 samples and gather the aligned_reads1.sam to aligned_reads108.sam and merge to Sort SAM file by coordinate and convert to BAM? or when should i merge these files?

Why would you merge them? You have 108 samples. Obviously, you need to map them all individually and consider them all individually. Would you like to be lumped in with 107 other random people and considered in bulk, and given some kind of random diagnosis that possibly applies to a majority of the samples you happened to be grouped with?

I may be wrong but right now i am literally confused . If you guys have a script where in i can run these samples in a go will be of a great help for me. I may sound stupid, but trust me i am clueless in this case.

Thanks a lot , David Emir

....I won't comment, other than to say that I agree with you. I pity your patients. I am not usually this negative, but I think you're a danger to society.

ADD COMMENT
1
Entering edit mode

Brian, eat a snickers. It's safe to assume that most of us were a danger to society until we weren't. Let's assume he's not in a position directly influencing patients.

ADD REPLY
0
Entering edit mode

Hmm, maybe I was a bit harsh, sorry :)

ADD REPLY
0
Entering edit mode

Thanks a lot for your "COMMENTS" Brain, i am delighted by your words !!! I don't want to argue i may be wrong , coz this is not my field. I got this IDEA from so many people, How To Merge Two Fastq.Gz Files? https://sourceforge.net/p/bio-bwa/mailman/message/31052880/ http://seqanswers.com/forums/showthread.php?t=23207 Fastq Files From Different Flowcells combining fasta files How To Merge Two Fastq.Gz Files?

:) being dangerous to the society, its way too rude BRAIN !!! Happy new year in advance and have a great year ahead !!! David Emir!

ADD REPLY

Login before adding your answer.

Traffic: 2025 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6