Is it possible/recommended to trim RNA-Seq reads to a specific length
1
0
Entering edit mode
4.6 years ago
komal.rathi ★ 3.9k

Hi everyone,

I have paired end RNA-sequencing samples where the mates in the two paired files are of unequal lengths:

For e.g.:

R1:

@NB501069:25:HY3KCBGXX:1:11101:16049:1105 1:N:0:NGACCA
CTCGTGGGGGGGCCGGGCCACCCCTCCCACGGCGCGACCGCTNNCCN
+
AAAAAEEEEEAEEEEEEEEAEEEEEEEEEEEAAEEE/EEAA/##6E#


R2:

@NB501069:25:HY3KCBGXX:1:11101:16049:1105 2:N:0:NGACCA
CTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNNNCCGCGCGGCACCCCCCCGTCGCCGGGGCGGGGG
+
AAA############################################################/####/E/E<EA<E/EEEEEEEE/A/E<AEEAEEEE//


Using trim_galore hasn't made any difference:

R1:

@NB501069:25:HY3KCBGXX:1:11101:16049:1105 1:N:0:NGACCA
CTCGTGGGGGGGCCGGGCCACCCCTCCCACGGCGCGACCGC
+
AAAAAEEEEEAEEEEEEEEAEEEEEEEEEEEAAEEE/EEAA


R2:

@NB501069:25:HY3KCBGXX:1:11101:16049:1105 2:N:0:NGACCA
CTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNNNCCGCGCGGCACCCCCCCGTCGCCGGGGCGGG
+
AAA############################################################/####/E/E<EA<E/EEEEEEEE/A/E<AEEAEEEE


This is another sample:

Original file R1:

@NB501069:23:HYV7KBGXX:1:11101:19650:1064 1:N:0:CTTGTA
CCCAGNCTGGAGTGCAGTGGCATTGTCATAGCTCACTATAACCTCAAATTCCTCAACTCAAATGATCCTCCCACCTCAGCCTCCCAAGTAGCTAGGACTAC
+
AAA6A#EEAEEEAAAAEEEAEEEE/A/E/EE<EEE/EE/EEEEEEEEEEEEAEEEEEEEEEEEE/EEEEE</E<EEEEAEEEE/AEEE<EE/AEEEEEEE<
@NB501069:23:HYV7KBGXX:1:11101:1659:1064 1:N:0:GTTGTA
CAGGGTTGGAAGAGCTGGCCTCGCCTTTCGGCTCCTTTCTCGTCTTGGCCGCGCCGCGGCGTAGGTCCAGCTTGAGCTGCTGGTTCTGCTGGAGCAGGGTG
+
AAAAAEEEEEEAEEEEEEEEEEE<EAEEEEEAE/EEEEAEEAEEEE/EEEEEA//EE<EAEA//EEEAEEE/E<//</A6E<EEE<EE6AAEAE6<AEEE/
@NB501069:23:HYV7KBGXX:1:11101:3487:1064 1:N:0:CTTGTA
AAGAATCAGCAGCCAATCCTCAAAGTTTAAATCATTTAAGGAAATGGGGAAACAAAATTCCAGGTAAATAACAAGACTGAAAAACTAGATTTAAAATAGTG
+
AAAAA6EEEAEEEEEEEEEEEEEEE6EAEEEEEEEEEEEAEEEEAEAA<EEEEEEEEEEEEEEEEEAEEE/EEEEEEEEEAAEEE<AEAEE/EEEEEEEA/
@NB501069:23:HYV7KBGXX:1:11101:12495:1064 1:N:0:CTTGTA
CATTATTTGGAATTCCTGCGACTGTTTCCCTATCAGTATCCTCTGCTGGCCTCTTTACAGTTTTGCATTCTGCTGTGCCATTTGTAGACCGAACGTC
+
AAAAAAEEEAAEEEEEEEEE<EEAEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEAEEAEE<AAEEEAEE


First few reads after trimming R1:

@NB501069:23:HYV7KBGXX:1:11101:19650:1064 1:N:0:CTTGTA
CCCAGNCTGGAGTGCAGTGGCATTGTCATAGCTCACTATAACCTCAAATTCCTCAACTCAAATGATCCTCCCACCTCAGCCTCCCAAGTAGCTAGGACTAC
+
AAA6A#EEAEEEAAAAEEEAEEEE/A/E/EE<EEE/EE/EEEEEEEEEEEEAEEEEEEEEEEEE/EEEEE</E<EEEEAEEEE/AEEE<EE/AEEEEEEE<
@NB501069:23:HYV7KBGXX:1:11101:3487:1064 1:N:0:CTTGTA
AAGAATCAGCAGCCAATCCTCAAAGTTTAAATCATTTAAGGAAATGGGGAAACAAAATTCCAGGTAAATAACAAGACTGAAAAACTAGATTTAAAATAGT
+
AAAAA6EEEAEEEEEEEEEEEEEEE6EAEEEEEEEEEEEAEEEEAEAA<EEEEEEEEEEEEEEEEEAEEE/EEEEEEEEEAAEEE<AEAEE/EEEEEEEA
@NB501069:23:HYV7KBGXX:1:11101:12495:1064 1:N:0:CTTGTA
CATTATTTGGAATTCCTGCGACTGTTTCCCTATCAGTATCCTCTGCTGGCCTCTTTACAGTTTTGCATTCTGCTGTGCCATTTGTAGACCGAACGTC


Looking at the distribution of the read lengths, majority of them are 100 bp long.

My goal is to retrieve fusions from the RNA-Seq data. I am able to run STAR-Fusion on this despite of the unequal mate lengths but I am unable to run chimeraScan because of this exact reason.

Is it possible to trim the reads in such a way as to create mates of equal lengths using a trimming tool? More importantly, would that approach be recommended?

Thanks!

trim RNA-Seq chimeraScan STAR-Fusion • 1.3k views
0
Entering edit mode

0
Entering edit mode

komal.rathi : Hopefully not all of your R2 read data looks like that (I assume these are just the first few reads). How and what was done to this data to get them in this state (on sequencer trimming?) Are those N's a result of masking the adapter? If R1 reads are indeed trimmed then you may have short inserts in this data.

0
Entering edit mode

I have edited my question to reflect that I had trimmed the reads using trim_galore.

0
Entering edit mode

RNA-sequencing samples where the mates in the two paired files are of unequal lengths

I have a suspicion that this data is pre-trimmed (on sequencer/BaseSpace) which is why you have unequal length reads. If majority/all of your R2 reads have N's (>50% of the read) like that then this appears to be pretty bad data (unless the bases have been deliberately masked). Not sure if it can be used/trusted to find fusions.

0
Entering edit mode

Yeah I guess this question needs more information than I have put - I need to talk to the biologists who generated this data. I will clarify some things and add the details in the question.

0
Entering edit mode
4.6 years ago
mforde84 ★ 1.3k

fastax-trimmer - http://hannonlab.cshl.edu/fastx_toolkit/

Will get the job done. Why not give it a try? If you want to test it out, run a alignment and differential expression analysis with the regular and trimmed reads, and see how well they compare. Also the 5' of the read in RNAseq is more noisey than the 3'. So if you uniformly trim from 5' it may actually improve alignments, but probably not very much.