Question: Map human cell line RNA-seq data to virus
0
gravatar for frida.danielsson
4.2 years ago by
European Union
frida.danielsson40 wrote:

Hi,

I have RNA-seq data for a cell line that was transformed by viral transfection with the SV40 virus, I would like to calculate the expresseion levels of the viral gene large-T in these samples. I have started out by downloading the viral genome from ncbi and merged it with the human genome by adding it to the FASTA file as an extra chromosome. I have built a bowtie index and mapped the reads with bowtie, and now I wonder how to proceed and evaluate how many reads have mapped to the viral genome. 

 

Frida

cancer rna-seq virus large-t sv40 • 1.7k views
ADD COMMENTlink modified 4.1 years ago • written 4.2 years ago by frida.danielsson40
1

The third column of SAM/BAM file tells the reference sequence name. You can filter based on third column and count the reads. Lets say your "viral chromosome" name in fasta file is ">SV40", the following command gives you the number of reads mapped to "viral chromosome"

samtools view input.bam | awk '{ if ($3=="SV40") print }' | wc -l 
ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by geek_y9.9k

Hi,

 

What is the different between 

samtools index test_sorted.bam test_sorted.bai

and 

samtools idxstats BJb20_sorted.bam | head -n 26

the last one just lists all the chromosomes..

 

 

 

ADD REPLYlink written 4.1 years ago by frida.danielsson40
2
gravatar for Charles Plessy
4.2 years ago by
Charles Plessy2.7k
Japan
Charles Plessy2.7k wrote:

Here is an even shorter awk command:

awk '$3=="SV40"'
ADD COMMENTlink written 4.2 years ago by Charles Plessy2.7k

Thanks a lot! And do you know how I could be even more specific and find out the number of reads mapped to a specific gene in the viral genome?

ADD REPLYlink written 4.2 years ago by frida.danielsson40

If you know the coordinates of your gene of interest in the the SV40 genome, you can indicate this region to the samtools view command (after sorting and indexing the BAM file). You can find details in the samtools manpage. Note that it will also count partial overlaps. If you need to be more precise, for instance to take splicing into account, then it may be better to use generic transcriptome software (tophat, ...) in which you would add the SV40 RNAs of interest to the reference annotations (RefSeq, GENCODE, ...).

ADD REPLYlink written 4.2 years ago by Charles Plessy2.7k

Thanks, how would the command look then?

ADD REPLYlink written 4.2 years ago by frida.danielsson40
1

If the name of the viral chromosome in the FASTA file that was indexed for your aligner was "SV40", and if the coodinates of interest were from position 100 to position 200, then the command would look like:

samtools view input.bam SV40:100-200 | awk '$3=="SV40"' | wc -l
ADD REPLYlink written 4.2 years ago by Charles Plessy2.7k

Thanks, the added "chromosome" has the name "gi|9628421|ref|NC_001669.1|" and when I try the command you suggested 

samtools view input.bam gi|9628421|ref|NC_001669.1|:100-200 | awk '$3=="gi|9628421|ref|NC_001669.1|"' | wc -l

 

It says "region "gi" specifies an unknown reference name...", is it some way to get around this or is it easier to just change the name in the fasta header?

 

ADD REPLYlink written 4.2 years ago by frida.danielsson40

The pipe characters need to be quoted or escaped for your command to work, otherwise they will be interpreted as shell pipes. This is why your samtools command ends at "gi", before the first pipe character, and this explains the error message, since no chromosome is called "gi" in your dataset.

samtools view input.bam 'gi|9628421|ref|NC_001669.1|:100-200' | awk ...

 

ADD REPLYlink written 4.2 years ago by Charles Plessy2.7k
0
gravatar for frida.danielsson
4.1 years ago by
European Union
frida.danielsson40 wrote:

Hi again, it worked now when I changed the name of the chromosome in the fasta file and reran the mapping, but interestingly I get different results when I use the command

samtools idxstats BJb20_sorted.bam | head -n 26

 

..to see the number of mapped reads to each chromosome (26 in total with my added SV40 virus), can anyone help me understand the difference between these commands?

 

 

ADD COMMENTlink written 4.1 years ago by frida.danielsson40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1733 users visited in the last hour