Hi Guys, I´m looking for your expertise on SNP identification of pooled samples.
More details: I want to Identify SNPs relevant on insecticide-resistant A. aegypti mosquitoes using pooled exome sequencing of treated vs. untreated samples.
My experimental design: Treatment group (insecticide-resistant): Replicates: HLR1, HLR2 (each = 25-pooled mosquitoes)
Control group (untreated): Replicates: SP1, SP2 (each = 25-pooled mosquitoes)
Sequencing: Exome capture (custom probes), Illumina PE, 40× depth
Bioinformatic pipeline used until now:
- Quality control: FastQC
- Alignment: BWA
- BAM processing: samtools view/sort/index + Picard AddOrReplaceReadGroups/MarkDuplicates
- Variant calling/filtering: GATK (VCF files generated)
My specific quesiton: How can I compare pooled replicates (HLR1/HLR2 vs. SP1/SP2) to identify SNPs relevant in treated samples? Is it possible to perform the pairwise comparison usng the VCF I just obtained?How can I prioritize relevant SNPs?
This is the code I´ve used:
#Indexing using BWA
bwa index VectorBase-68_AaegyptiLVP_AGWG_Genome.fasta
#Mapping with BWA
bwa mem -t 30 -M /mnt/disc2/grupobcei/ewas/index/VectorBase-68_AaegyptiLVP_AGWG_Genome.fasta _R1_subsample.fastq _R2_subsample.fastq > _subsample_BWA.sam
#Samtools SAM to BAM requirements
samtools view -S -b HLR1_subsample_BWA.sam > HLR1_subsample_BWA.bam
samtools sort -o HLR1_sbs_BWA_sortd.bam HLR1_subsample_BWA.bam
samtools index HLR1_sortd_index.bam
#Picards AddOrReplaceReadGroups
/mnt/disc2/grupobcei/java/jdk-17.0.12/bin/java -jar /mnt/disc2/grupobcei/picard/picard.jar AddOrReplaceReadGroups I=HLR1_sbs_BWA_sortd.bam O=HLR1_BWA_rg.bam SO=coordinate CREATE_INDEX=true RGID=HLR1 RGLB=lib1 RGPL=illumina RGPU=HLR1 RGSM=sample1
#Picards MarkDuplicates
/mnt/disc2/grupobcei/java/jdk-17.0.12/bin/java -jar /mnt/disc2/grupobcei/picard/picard.jar MarkDuplicates I=HLR1_BWA_rg.bam O=HLR1_BWA_-nodups.bam M=HLR1_BWA.metrics REMOVE_DUPLICATES=TRUE
#Variant Callint Using GATK
/mnt/disc2/grupobcei/java/jdk-17.0.12/bin/java "-Xmx30G" -jar /home/administrador/programas/gatk-4.6.1.0/gatk-package-4.6.1.0-local.jar HaplotypeCaller -I HLR1_BWA_-nodups.bam -R /mnt/disc2/grupobcei/ewas/index/VectorBase-68_AaegyptiLVP_AGWG_Genome.fasta -O HLR1.variants.vcf --native-pair-hmm-threads 30
#Variant Filtering Using GATK
nohup /mnt/disc2/grupobcei/java/jdk-17.0.12/bin/java -jar /mnt/disc2/grupobcei/gatk/gatk-4.6.1.0/gatk-package-4.6.1.0-local.jar VariantFiltration -R /mnt/disc2/grupobcei/ewas/index/VectorBase-68_AaegyptiLVP_AGWG_Genome.fasta -V HLR1.variants.vcf --filter-name FAIL --filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || DP < 10 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" -O HLR1_filt_10x.vcf
1) using
nohup
is most always a bad idea . Learn screen/tmux, learn make/nextflow/snakemake2) AddOrReplaceReadGroups should be useless if you use the option '-R' of bwa
Hi!
thanks
Your server should have a program called
tmux
that should allow you to work in a session that would be unaffected by internet disconnections: https://hamvocke.com/blog/a-quick-and-easy-guide-to-tmux/You can add the read groups during the
bwa mem
alignment step: https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups You then don't needAddOrReplaceReadGroups
step.I understand now. Any experience on the pooled exome seq?