How to Detect Presence of HPV-33 in Sample Data (FASTQ, BAM, VCF on GRCh37)
3
0
Entering edit mode
9 months ago
adam • 0

Hello everyone,

I have a tissue sample for which I have sequencing data available in several formats - FASTQ, BAM, and VCF. The alignment has been done against the GRCh37 reference genome.

I am interested in finding out whether this sample contains sequences from Human Papillomavirus type 33 (HPV-33). I have been looking into various bioinformatics tools and methods, but I am a bit uncertain about the best way to proceed.

For the FASTQ files, I was considering using a tool like Bowtie2 or BWA to align the reads against the HPV-33 reference genome, but I am unsure if this is the optimal approach.

For the BAM and VCF files, would it be more appropriate to use a tool that can identify viral integration sites? I've heard of tools like VirStrain and VIcaller, but I haven't used them before.

I would be grateful for any advice or suggestions. In particular, I'm interested in:

Recommended methods for detecting HPV-33 in FASTQ, BAM, and VCF files Any specific tools or databases that might be helpful Any quality control steps or considerations that I should keep in mind

Any specific workflow examples would be extremely helpful.

Thank you for your help!

sequence hpv alignment • 612 views
ADD COMMENT
0
Entering edit mode

You don't mention the source of your sequencing data - RNA? DNA? WES? Cell lines vs. tissue?

ADD REPLY
1
Entering edit mode
9 months ago

I used to do this a lot. The best approach is to avoid withholding information from the mapper by combining the human genome fasta and the virus with cat before you start.

cat human.fa virus.fa > human_plus_virus.fa
#reindex with bwa or whatever

Then align, convert to bam, use samtools index and samtools idxstats to get an idea of the number of reads mapping.

Next - filter, filter, filter, by mapping quality, number of mismatches etc.

You might also want to include other viral suspects in your custom reference genome. Human cancer references can include about 10 viruses from memory in addition to the human ref genome.

ADD COMMENT
0
Entering edit mode
9 months ago

I am interested in finding out whether this sample contains sequences from Human Papillomavirus type 33

map to human genome, discard the mapped reads and then map on Papillomavirus genome

something like (not tested)

bwa mem human.fa R1.fq.gz R2.fq.gz | samtools view -f 4 | samtools fastq |bwa mem virus.fa |  samtools view -F 4 -O BAM -o virus.bam
ADD COMMENT
0
Entering edit mode
9 months ago
GenoMax 141k

You can use bbsplit.sh from BBMap suite that will allow you to map the reads to both genomes at the same time. This will ensure that you handle multimappers appropriately. You will start with fastq data.

For more : BBSplit syntax for generating builds for the reference genome and how to call different builds.

ADD COMMENT

Login before adding your answer.

Traffic: 1774 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6