Question

Map human cell line RNA-seq data to virus

0

Entering edit mode

8.7 years ago

frida.danielsson ▴ 50

Hi,

I have RNA-seq data for a cell line that was transformed by viral transfection with the SV40 virus, I would like to calculate the expression levels of the viral gene large-T in these samples. I have started out by downloading the viral genome from ncbi and merged it with the human genome by adding it to the FASTA file as an extra chromosome. I have built a bowtie index and mapped the reads with bowtie, and now I wonder how to proceed and evaluate how many reads have mapped to the viral genome.

Frida

virus Cancer RNA-Seq Large-T SV40 • 3.0k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.7 years ago by frida.danielsson ▴ 50

1

Entering edit mode

The third column of SAM/BAM file tells the reference sequence name. You can filter based on third column and count the reads. Lets say your "viral chromosome" name in fasta file is ">SV40", the following command gives you the number of reads mapped to "viral chromosome"

samtools view input.bam | awk '{ if ($3=="SV40") print }' | wc -l

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.7 years ago by GouthamAtla 12k

0

Entering edit mode

Hi,

What is the different between

samtools index test_sorted.bam test_sorted.bai

and

samtools idxstats BJb20_sorted.bam | head -n 26

The last one just lists all the chromosomes.

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.6 years ago by frida.danielsson ▴ 50

0

Entering edit mode

8.6 years ago

frida.danielsson ▴ 50

Hi again, it worked now when I changed the name of the chromosome in the fasta file and reran the mapping, but interestingly I get different results when I use the command

samtools idxstats BJb20_sorted.bam | head -n 26

..to see the number of mapped reads to each chromosome (26 in total with my added SV40 virus), can anyone help me understand the difference between these commands?

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by frida.danielsson ▴ 50

Ram · Accepted Answer · 2015-08-15

2

Entering edit mode

8.7 years ago

Charles Plessy ★ 2.9k

Here is an even shorter awk command:

awk '$3=="SV40"'

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.7 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

Thanks a lot! And do you know how I could be even more specific and find out the number of reads mapped to a specific gene in the viral genome?

ADD REPLY • link 8.7 years ago by frida.danielsson ▴ 50

0

Entering edit mode

If you know the coordinates of your gene of interest in the the SV40 genome, you can indicate this region to the samtools view command (after sorting and indexing the BAM file). You can find details in the samtools manpage. Note that it will also count partial overlaps. If you need to be more precise, for instance to take splicing into account, then it may be better to use generic transcriptome software (tophat, ...) in which you would add the SV40 RNAs of interest to the reference annotations (RefSeq, GENCODE, ...).

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.7 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

Thanks, how would the command look then?

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.7 years ago by frida.danielsson ▴ 50

1

Entering edit mode

If the name of the viral chromosome in the FASTA file that was indexed for your aligner was "SV40", and if the coordinates of interest were from position 100 to position 200, then the command would look like:

samtools view input.bam SV40:100-200 | awk '$3=="SV40"' | wc -l

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.7 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

Thanks, the added "chromosome" has the name gi|9628421|ref|NC_001669.1| and when I try the command you suggested

samtools view input.bam gi|9628421|ref|NC_001669.1|:100-200 | awk '$3=="gi|9628421|ref|NC_001669.1|"' | wc -l

It says "region "gi" specifies an unknown reference name...", is it some way to get around this or is it easier to just change the name in the fasta header?

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.7 years ago by frida.danielsson ▴ 50

0

Entering edit mode

The pipe characters need to be quoted or escaped for your command to work, otherwise they will be interpreted as shell pipes. This is why your samtools command ends at "gi", before the first pipe character, and this explains the error message, since no chromosome is called "gi" in your dataset.

samtools view input.bam 'gi|9628421|ref|NC_001669.1|:100-200' | awk ...

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.7 years ago by Charles Plessy ★ 2.9k