Hello ยก! I want to map the reads from fastq files obtained from illumina sequencing (150 bp) over a reference genome (RIMD2210633.fasta), I want to analyze a specific region (pathogenic island) over the genome reference and review if the studied genome present that zone, and plot the mapping sequences over that region.
The VPaI-7 (the pathogenic island 7 ) is available in the chromosome 2 (chr2) from the reference genome: RIMD 2210633 (RIMD2210633.fasta), it is located from the 1390967 to 1467005 bp (chr2). If you want to see the reference: https://pmc.ncbi.nlm.nih.gov/articles/PMC2491623/ VPaI-7 (T3SS-2), from gene_id VPA1312 (1390967 bp) to VPA1395 (1467005 bp), described in Boyd et al., (2005). Molecular analysis of the emergence of pandemic Vibrio parahaemolyticus.
The reference genome (RIMD2210633.fasta) present 2 chromosomes (chr1 and chr2), with 3.2 Mb for chr1, and 1.8 MB for chr2.
grep ">" index/RIMD2210633.fasta
> chr1
> chr2
the sequences start with > as any fasta, but I don't how do it in this page !!!!
The studied genome is the Vp_1362 correspond to Vp_1362_R1.fastq and Vp_1362_R2.fastq files. The fastq files were trimmed with fastx_toolkit, and then implemented with the next pipeline.
I generate an environment in conda called mapping and install the next programs and version:
bedtools 2.31.1
bwa 0.7.19
samtools 1.20
ucsc-bedgraphtobigwig 482
bcftools 1.20
and then I ran the pipeline:
indexing the reference genome
bwa index index/RIMD2210633.fasta
create fai file
samtools faidx index/RIMD2210633.fasta
Map the Vp_1362 genome reads to reference genome (RIMD 2210633)
bwa mem -t 6 index/RIMD2210633.fasta Vp_1362_R1.fastq Vp_1362_R2.fastq | samtools sort -o output.sorted.bam
samtools index output.sorted.bam
Call variants only in the VPaI-7 region
bcftools mpileup -Ou -r chr2:1390967-1467005 -f index/RIMD2210633.fasta output.sorted.bam | bcftools call --ploidy 1 -mv -Oz -o VPaI7.Vp_1362.vcf.gz
tabix -p vcf VPaI7.CAIM1362.vcf.gz
Extract VPaI-7 region from reference genome
samtools faidx index/RIMD2210633.fasta chr2:1390967-1467005 > RIMD.ref.region.fa
Apply variants to generate Vp_1362 consensus over the VPaI-7 region
bcftools consensus -f RIMD.ref.region.fa VPaI7.Vp_1362.vcf.gz > VPaI7.Vp1362.fasta
I want to know if the pipeline is the correct?.
And I just want verified if the reads (the short reds) were mapped over that zone ?. maybe generate a plot of VCF !!!
And if the consensus file (VPaI7.Vp1362.fasta) correspond to the evaluated genome Vp_1362 ??, I suppose that its from the studied genome (and NOT from the reference genome !!!!)
Any suggestion to the pipeline ?
Thanks
If you want to visualise the mapping; take a look at using IGV. It just requires the genome you used to map the reads and the BAM output (needs to be indexed
samtools index *.bam).Your pipeline is trying to generate a consensus sequence based on the VCF. To do this the sequences in the VCF (your chromosomes) generally have to correspond to that of the fasta input. Considering you extracted your region of interest prior to generating the consensus, then fed that the bcftools consensus, the name of the sequence has changed (usually to the coordinates using faidx) and therefore will not be found in the VCF. I.e. bcftools will not find anything in your fasta file that corresponds to sequences in the VCF. So if nothing has changed in the 'consensus' sequence, as I suspect, try generating the consensus before extracting the region. However I don't know what the sequences in the VCF look like when using mpileup with a specific region so maybe it isn't an issue.