Should you include Trimmomatic unpaired reads when assembling by alignment?
0
0
Entering edit mode
3.2 years ago
RBright21 ▴ 10

Should you include unpaired reads generated from adaptor/quality trimming using Trimmomatic when carrying out assembly by alignment? I had thought that you should so as not to lose what could be valuable data but have been reading recently that some people discard unpaired reads prior to analysis.

Is it dependent on the application, the amount of data you have or just personal preference?

Thanks

alignment assembly sequence • 1.8k views
ADD COMMENT
0
Entering edit mode

What do you mean by assembly by alignment, genome guided assembly in Trinity? Normally, the fraction of unpaired sequences should be very small and therefore it should do no harm to leave them out. If you got a large fraction of unpaired reads, you might want to investigate your raw data more closely.

ADD REPLY
0
Entering edit mode

Hi Michael

Apologies I'm still very new to bioinformatics so I'm not always sure on the terminology - I was thinking when using an alignment tool like Bowtie2 for alignment of reads against a reference sequence.

Thanks for your input though. What would you consider to be a large fraction of unpaired reads? more than 1%? More than 10%? or again does it depend on the application/the data you have?

Thanks

ADD REPLY
0
Entering edit mode

Ok, I see. And yes it depends... So are you doing RNA-seq or DNA-seq and want to map back to the genome? Are you using Illumina paired end sequences? 1. If you are doing RNA-seq, you should use a recent aligner like Hisat or STAR. 2. It depends what is a large fraction and the important aspect is how much of your library is considered "Bad". In high quality illumina output, I would expect to see <<1% adapters and less than 1-2% filtered in total. In the end you should have a lot of reads and it almost would not matter for transcript counts. On the other hand, STAR would still be able to align correctly if there is a little bit of adapter sequence at the ends of reads.

ADD REPLY
0
Entering edit mode

Thanks so much for your quick response again. It is viral DNA (enriched during the library prep to remove non-target sequences) and illumina paired end reads (generated on Miseq) that I am working with.

Should I be considering using the "keepbothreads" function in Trimmomatic if I see discard rates above 1-2% (or routinely for that matter) or does a discard rate above that suggest that my parameters are too stringent or just that my data is not good?

ADD REPLY
1
Entering edit mode

So are you trying to identify virus strains or sample composition like in viral metagenome analysis? If the former is the case, I would remove everything that is "irregular" just to be on the safe side. Remove all low quality reads, and clip adapters, then keep only properly paired reads. A single viral genome is so small in comparison to your library size that you will get incredible depth anyway. On the other hand, proper identification will depend on a few nucleotide variations only, so you may want to do aggressive QC and have everything as "clean" as possible.

ADD REPLY
0
Entering edit mode

I am trying to identify virus strain so your suggestion is really helpful. I do have lots of data and its only a 36kb genome so I do find when I align my reads I have lots of coverage and haven't found an instance yet where leaving out the unpaired reads has been detrimental. When I first started experimenting with the data I had been leaving the unpaired reads out but then had seen so many people suggest not discarding valuable data I kept them but I wonder if I was confusing metagenomic with single strain analysis.

Thanks so much for your help with this. I really appreciate it and feel much more confident moving forward with my analysis

ADD REPLY

Login before adding your answer.

Traffic: 1945 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6