Endogenous Retrovirus Expression from RNA-Seq
2
1
Entering edit mode
8.2 years ago
tucanj ▴ 100

Hi,

I'm looking to quantify the different types of human endogenous retroviruses (genes/transcripts/or just reads from each virus) from RNA seq data. How would I align my data, seeing that ERV genes are not in UCSC table (group: genes and gene predictions) or Entrez gff?

Thanks!

alignment genome RNA-Seq • 5.7k views
6
Entering edit mode
8.2 years ago

The simplest way might be to extract the sequences from the repeatmasker tracks. You'll want to filter the masked sites by some of the score metrics just so you don't get a ton of HERVs that aren't likely to be expressed. Below is a little shell script to do that (it's a modified version of a test script I wrote for something else):

#!/bin/bash

#Extract and clean up
tar xf chromFa.tar.gz
tar xf chromOut.tar.gz
rm *hap*.fa
rm -rf *hap*
rm -f hg19.out hg19.fa

#Merge fasta files
for f in *.fa
do
cat $f >> hg19.fa rm$f
done

#Merge and filter repeats
for d in ls -d */
do
sed 's/[()]//g' $d/*.out | awk '{if(NR>3) { if($9=="C") {
start = $14 left =$12
} else {
start = $12 left =$14
}
len=$13+left cov=($13-start+1)/len
if($11=="LTR/ERV1" ||$11=="LTR/ERVK") {
if(cov>=0.98 && $2<=2.0 &&$4<=1.0 && $3<=1.0) print$0;
}}}' >> hg19.out
rm -rf $d done #Convert to RepeatMasker output to BED format awk '{if($9=="C"){strand="-"}else{strand=$9};printf("%s\t%i\t%s\t%s\t.\t%s\n",$5,$6-1,$7,11,strand)}' hg19.out > hg19.bed #Extract sequences in fasta format bedtools getfasta -name -s -fi hg19.fa -bed hg19.bed -fo /dev/stdout | fold -w 80 > hg19.repeats.fa  You'll need bedtools for this to work. I'd just index and align against hg19.repeats.fa and you'll have an idea of how much HERV you have in your RNAseq samples. Edit: BTW, do this in an empty directory! ADD COMMENT 0 Entering edit mode Awesome! Thanks for your response. Will this script work with hg38 if I replace hg19? ADD REPLY 1 Entering edit mode You'd need to modify it a bit, but in essence it should work (I only did something like this on hg19). I presume something like this would suffice: #!/bin/bash wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.out.gz #Extract and clean up gunzip chromFa.tar.gz gunzip chromOut.tar.gz awk '{if(NR>3) { if(9=="C") {
start = $14 left =$12
} else {
start = $12 left =$14
}
len=$13+left cov=($13-start+1)/len
if($11=="LTR/ERV1" ||$11=="LTR/ERVK") {
if(cov>=0.98 && $2<=2.0 &&$4<=1.0 && $3<=1.0) print$0;
}}}' hg38.fa.out > hg38.out

#Convert to RepeatMasker output to BED format
awk '{if($9=="C"){strand="-"}else{strand=$9};printf("%s\t%i\t%s\t%s\t.\t%s\n",$5,$6-1,$7,$11,strand)}' hg38.out > hg38.bed

#Extract sequences in fasta format
bedtools getfasta -name -s -fi hg38.fa -bed hg38.bed -fo /dev/stdout | fold -w 80 > hg38.repeats.fa


Note that what's above won't delete anything, so it's safe to run in a directory with other stuff.

0
Entering edit mode

Seems to work. Could you explain why you only extract LTR/ERV1 and LTR/ERVK (if I am not mistaken)? I am looking for all HERVs (and compare abundance), so would there be a way to extract all of them?

0
Entering edit mode

This is just a slightly modified version of something else that only needed those. You could just change it to include the entirety of the ERV family.

0
Entering edit mode

OK. Furthermore, won't this script just extract the long terminal repeats and not the coding regions? In that sense it would not do what I wanted

0
Entering edit mode

At least in hg19, the coding region still has the same class designation (e.g., "LTR/ERVK"), just the name has -int appended. I haven't a clue if this is the case in hg38 too. As a general rule, when someone gives you some code take that as a template that you may need to modify to do exactly what you need.

0
Entering edit mode

Agreed to take as code as template. I just wasn't sure if it was even modifiable to answer the question. Thanks!

0
Entering edit mode

Hi Devon,

This is great help to the question, but I think its still good to use entire unmasked genome for mapping,

For the analysis of HERVs we always consider uniquely mapped reads. If we only fetch HERV sequences and mapping reads only onto them then there is a problem.

e.g.

SVA elements contain LTR of HERVK, so the reads from there would get mapped on LTR5, which would deviate the results when we estimate the expression level of HERVK.

0
Entering edit mode

Yeah, the original version of this script contained SVA sequences (among other things). I'm assuming that OP knows what he/she is doing and can just modify that as appropriate.

0
Entering edit mode

Yes I figured had to map to full genome. Follow up question: Because HERVs are replicated so many times, will you actually get uniquely mapped reads? HOMER suggests taking one random position, regardless of uniqueness (http://homer.salk.edu/homer/ngs/analyzeRNA.html)

0
Entering edit mode

Yes you will get uniquely mapped reads because each copy of retro-elements is sequentially different to each other. I analyze it by read counts from bamcov of bedtools on tophat2 mapped bam file providing retro-elements co-ordinates.

2
Entering edit mode
8.2 years ago
george.ry ★ 1.2k

Following a similar idea to Devon Ryan, the way I did/do this was to use the output from RepeatMasker (I masked myself to actually know what parameters were used in the masking and to not mask simple repeats etc) and convert the .out file into a GTF file that can be used alongside a genomic GTF file in my standard htseq-count workflow. You can even cat them together to do a combined analysis.

One thing to think about is that REs can be intergenic or intragenic, and especially with LTR-derived sequences, can also be gene promoters etc. Consider that the expression of intragenic and close intergenic elements will often follow the expression of their surrounding / neighbouring gene. If you're looking for independent expression regulation of the REs, this is not what you're after! In this instance you'll need to avoid the REs that overlap gene areas (possibly extended for a kb or so) to avoid this overlap.

The RepeatMasker .out file also splits LTR elements into separate entries for the LTRs and for the internal sequences (often something like LTR|LTR|int|LTR|LTR). Depending on how your workflow handles reads that span different entities in a BED/GTF file, you will also need to merge these together in some way in order to correctly report expression.

1
Entering edit mode

I entirely disagree with you. REs can be exonised but still their expression and function is dependent on chromatin conformation and structure of REs. It depends on that whether the transcription of REs are driven by themselves or they are readthrough (whether they are drivers or passengers).

You cannot eliminate REs depending on their distance to genes for RNA-seq data analysis

If HERVs are full length e.g. LTR-Gag-Pol-Env-LTR, then LTR provides the site for transcription factor binding and HERV particles express, doesn't matter whats their position respective to genes

0
Entering edit mode

I'm not saying that he necessarily should, but it absolutely needs consideration. Depending on what's being investigated it may or may not matter in the first instance, in which case ignore me - are we just looking at total expression that could lead to some kind of peptide or protein (who cares where it's coming from), or actually for truly independent regulation?

Obviously the chromatin and the integrity of the element impacts expression, but it is incorrect to say that proximity doesn't. The R^2 of nearest gene vs LTR element expression sits at about 0.325 with a quick calculation, which is indeed reduced if you avoid intragenic and highly proximal elements (down to ~0.18). As you say, a full length element is far more likely to control its own expression, but it is also the case that the majority of elements masked are not intact, yet would still be investigated in a workflow similar to this, so that argument is largely academic.

Following on from that, read through and splicing probably accounts for a lot / most of this, as you say, but that's precisely the point I'm making. Taking mouse Tlr7 as an example I have to hand, this has 6 transcripts in the ensembl database, one of which splices straight into the middle of an ERVL-MaLR fragment and terminates as a result. If indeed the transcript is as it appears, any reads aligning to this particular element likely solely report the expression coming from the Tlr7 promoter. Taking an extreme case, but to make the point plain, the human L1TD1 gene is derived, as the name suggests, from a LINE, and its protein-coding regions are still largely masked by a standard RepeatMasker run. Reads hitting these masked regions report the expression of the L1TD1 promoter.

Without considering cases such as these, or at least accepting the possibility that it even needs considering, I'm sorry, but you're being careless.

0
Entering edit mode

Whatever, but one should not filter out retroelements on the basis of their proximity with genes, moreover, l1td1 is known to be retroposed gene, and there are many retroelements which provide poly A signal, even bidirectional promoter, this doesn't mean that you would filter out every element near to genes, anyway , it's my opinion about the q n a .

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0040532

http://www.nature.com/nature/journal/v516/n7530/abs/nature13760.html

http://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-015-0146-5

are good literatures showing the analysis of RNA-seq data.

http://www.nature.com/nature/journal/v516/n7531/abs/nature13804.html ThIs is the paper we published in nature, in method section Its written that how the expression of retroelement was estimated.

hth

0
Entering edit mode

Yeah, I included L1TD1 as an example after seeing you were a stem cell person ;)

It was mainly meant to be a point to think about, and perhaps I phrased my post poorly in any case, as I don't actually myself avoid 'gene areas' when doing this - I do a bedtools subtract with exons, which removes ~2.7% of RepeatMasker annotations for GRCh38.78 and ~1.7% for GRCm38.38.