Map human cell line RNA-seq data to virus
2
0
Entering edit mode
8.7 years ago

Hi,

I have RNA-seq data for a cell line that was transformed by viral transfection with the SV40 virus, I would like to calculate the expression levels of the viral gene large-T in these samples. I have started out by downloading the viral genome from ncbi and merged it with the human genome by adding it to the FASTA file as an extra chromosome. I have built a bowtie index and mapped the reads with bowtie, and now I wonder how to proceed and evaluate how many reads have mapped to the viral genome.

Frida

virus Cancer RNA-Seq Large-T SV40 • 3.0k views
ADD COMMENT
1
Entering edit mode

The third column of SAM/BAM file tells the reference sequence name. You can filter based on third column and count the reads. Lets say your "viral chromosome" name in fasta file is ">SV40", the following command gives you the number of reads mapped to "viral chromosome"

samtools view input.bam | awk '{ if ($3=="SV40") print }' | wc -l
ADD REPLY
0
Entering edit mode

Hi,

What is the different between

samtools index test_sorted.bam test_sorted.bai

and

samtools idxstats BJb20_sorted.bam | head -n 26

The last one just lists all the chromosomes.

ADD REPLY
2
Entering edit mode
8.7 years ago
Charles Plessy ★ 2.9k

Here is an even shorter awk command:

awk '$3=="SV40"'
ADD COMMENT
0
Entering edit mode

Thanks a lot! And do you know how I could be even more specific and find out the number of reads mapped to a specific gene in the viral genome?

ADD REPLY
0
Entering edit mode

If you know the coordinates of your gene of interest in the the SV40 genome, you can indicate this region to the samtools view command (after sorting and indexing the BAM file). You can find details in the samtools manpage. Note that it will also count partial overlaps. If you need to be more precise, for instance to take splicing into account, then it may be better to use generic transcriptome software (tophat, ...) in which you would add the SV40 RNAs of interest to the reference annotations (RefSeq, GENCODE, ...).

ADD REPLY
0
Entering edit mode

Thanks, how would the command look then?

ADD REPLY
1
Entering edit mode

If the name of the viral chromosome in the FASTA file that was indexed for your aligner was "SV40", and if the coordinates of interest were from position 100 to position 200, then the command would look like:

samtools view input.bam SV40:100-200 | awk '$3=="SV40"' | wc -l
ADD REPLY
0
Entering edit mode

Thanks, the added "chromosome" has the name gi|9628421|ref|NC_001669.1| and when I try the command you suggested

samtools view input.bam gi|9628421|ref|NC_001669.1|:100-200 | awk '$3=="gi|9628421|ref|NC_001669.1|"' | wc -l

It says "region "gi" specifies an unknown reference name...", is it some way to get around this or is it easier to just change the name in the fasta header?

ADD REPLY
0
Entering edit mode

The pipe characters need to be quoted or escaped for your command to work, otherwise they will be interpreted as shell pipes. This is why your samtools command ends at "gi", before the first pipe character, and this explains the error message, since no chromosome is called "gi" in your dataset.

samtools view input.bam 'gi|9628421|ref|NC_001669.1|:100-200' | awk ...
ADD REPLY
0
Entering edit mode
8.6 years ago

Hi again, it worked now when I changed the name of the chromosome in the fasta file and reran the mapping, but interestingly I get different results when I use the command

samtools idxstats BJb20_sorted.bam | head -n 26

..to see the number of mapped reads to each chromosome (26 in total with my added SV40 virus), can anyone help me understand the difference between these commands?

ADD COMMENT

Login before adding your answer.

Traffic: 2104 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6