Question

Annotation of L1 from short read (single-strand) RNA

0

Entering edit mode

8 months ago

Kilian • 0

Hi,

I would like to annotate sequences from single-end fastq files (225bp length) with regards to Line1 structure. i.e. Orf1/Orf2

I tried aligning the fastq files and then doing annotation (TEtranscript), but since I only have 500 to 1000 reads total this approach failed. The reads were not alligned to the genome even when permitting -k 5000 with bowtie2.

I would greatly appreciate tips regarding short read annotation of short-read fastq sequences.

Example of sequence: AGAGAATCTCAGGTGCAGAAGATTCCATAGAGGACATCGACACAACAGTCAAAGAAAATACAAAATGCAAAAGGATCCTAACTCAAAACATCCAGGTAATCCAGGACACAATGAGAAGACCAAACCTACGGATAATAGGAATTGATGAGAATGAAGATTTTCAACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATATCTCGTATGCCGTCTTC

I also tried L1base and looking into repeatblast - but not sure if this would be a suitable approach.

I hope this somewhat makes sense since I am also new to this.

short-read Line1 fastq annotation • 406 views

ADD COMMENT • link updated 8 months ago by rfran010 ▴ 900 • written 8 months ago by Kilian • 0

0

Entering edit mode

I blat-ed your read in UCSC and it seems to align nicely. However, you probably had difficulty with bowtie2 if you didn't set it to "local" mode. In your example read, only 165 bases align to a L1 element using blat. If you ctrl+F for the Illumina adapter sequence "AGATCGGAAGAGC", you'll notice it pops up after about 165 bases in your example read, meaning after 165 cycles (165 bp) the fragment was completely sequenced and you started to read through to the adapter on the other end for the remainder of the 225 cycles.

If you either (1) trim your reads before alignment or (2) set the "local" mode on bowtie2, you should alleviate your alignment issues. I recommend trimming your reads instead of relying on local mode in this case. Also, you shouldn't need to change the "-k" parameter unless you want to possibly output more than one alignment for each read, but for your given goal this wouldn't really make sense since each alignment should map to the same structural unit.

If you really just want to identify a read as coming from the ORF1/ORF2, it would probably be best if you have a file of consensus sequences for the various LINE1s that have these annotated, and then just align to that directly instead of the genome.

ADD REPLY • link 8 months ago by rfran010 ▴ 900