Question

Need opinion on correct bioinfo analysis; NGS on virus competition exp

0

Entering edit mode

4.0 years ago

Adrian Pelin ★ 2.6k

Hi all,

We took Vaccinia virus, dsDNA virus which has about 200 genes and removed roughly 30 genes from it, let call this deleted_Vaccinia. We then cloned these genes back in deleted_Vaccinia individually, so we have 30 viruses, each with a different viral gene. We pooled these together and passaged them on cancer cells as a competition experiment and did NGS DNA-seq on input and output.

I am analyzing this data, and the way I did it is to estimate reads per individual genes using Seal. Then I calculated TPM (https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/):

1. Divide the read counts by the length of each gene in kilobases. This gives you reads per kilobase (RPK).
2. Count up all the RPK values in a sample and divide this number by 1,000,000. This is your “per million” scaling factor.
3. Divide the RPK values by the “per million” scaling factor. This gives you TPM.

What I was wondering, should I have done an RNA-seq type analysis instead? Map to wild-type Vaccinia with HISAT and used StringTie to quantify each gene, deleted and not deleted?

The purpose is to find which gene, out of the 30, contributes to viral fitness.

Let me know what you think, I hope I explained properly what we did, please ask questions.

Abundace RNA-Seq • 985 views

ADD COMMENT • link 4.0 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

Were those 30 samples sequenced separately so you know which set of reads came from which virus?

ADD REPLY • link 4.0 years ago by GenoMax 141k

0

Entering edit mode

Those 30 viruses were pooled. The only difference between them is which of the 30 genes they encode. Overall, 170 genes are the same between the 30 viruses, and each virus has one gene as a "tag" that it expresses and may make it a better replicating virus or worse. So the gene coding sequence in reads is what I am looking for.

ADD REPLY • link 4.0 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

Ah you are you only looking to see which gene(s) get enriched at the end of the experiment.

What is the result of your analysis? How did you handle 170 genes which are bound to show multi-mapping? Which of these modes did you use with seal.sh? Can you post the entire command line you used?

first: Use the first best-matching reference sequence.
toss: Consider unmapped.
random: Select one best-matching reference sequence randomly.
all: Use all best-matching reference sequences.

Are you seeing a range of genes surviving? Or are there clear winners?

Thinking aloud. You could split the virus into 170+30 genes. Use that as a reference with

Set “refnames=t”. This will report results on a per-reference-file basis rather than a per-sequence basis,

ADD REPLY • link 4.0 years ago by GenoMax 141k

0

Entering edit mode

Amazing question!

seal.sh:

${bbmap}/seal.sh -Xmx40g threads=16 ref=Copenhagen_M35027_CDS_v2.fasta in=reads/${i}_AdaptTrim_R#.fastq.gz stats=seal_stats/${i}

In the fasta file... each fasta contig ">" is a gene from ATG to stop. So there are 200 contigs in the fasta file for each gene.

ADD REPLY • link 4.0 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

Also, to your other question, how did I handle the 170 genes. I didn't do anything special... I am not sure how much should I care about them. I care about how abundance (fold change) of each one of the 30 genes changes between input and output. The 170 genes are present in 100% of viruses, so they shouldn't change at all... perhaps they are some sort of a control?

ADD REPLY • link 4.0 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

The 170 genes are present in 100% of viruses, so they shouldn't change at all...

That is a logical assumption but is holding true? Are you getting clear differences in stats for the genes re-inserted? Default ambig= option is random so that should distribute the reads among 170 equally (if they multimap).

ADD REPLY • link 4.0 years ago by GenoMax 141k