Question

Minimum length in fastq quality trimmer with FASTX toolkit

0

Entering edit mode

7.1 years ago

Lucila ▴ 20

Hi all,

I am filtering and trimming my RNA-seq data with fastq quality trimmer on FASTX toolkit, and I am wondering which minimum length I have to use, I mean, under which length it is convinient to discard the reads after the trimming. I have sequenced my samples with Illumina and my reads are 50bp long.

Thank you for all your comments!

RNA-Seq Quality filtering • 3.1k views

ADD COMMENT • link 7.1 years ago by Lucila ▴ 20

0

Entering edit mode

Depends on what you want/need to do with your sequences. With sequences being so short, you're probably better off simply trimming by quality.

ADD REPLY • link 7.1 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

@st.ph.nIt - it sounds like he's already trimming by quality, but agreed, more details are needed to give a useful answer. So, please describe:

1) The organism (clade, ploidy, genome size, GC content, etc)
2) The experiment (RNA-seq quantification? assembly? isoform identification? variant-calling?  "RNA-seq" is not very specific)
3) The preparation (PCR-free? low input? expected coverage? ribo-depletion? ribo-depletion efficiency?)
4) The goal might be useful.

Also, FASTX is the bottom of the barrel in bioinformatics software. It's slow and uses non-optimal algorithms; I suggest you use virtually anything else.

ADD REPLY • link 7.1 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks Brian,

1- The organism is Triatoma infestans, an insect vector belonging to subfamily Triatominae (Hemiptera, Reduviidae). We do not have the genome, so we do not have information about its size and GC content. This organism is diploid.
2- Regarding the experiment, we want to quantify gene expression differences between different conditions using RNAseq on Illumina sequencing platform. We want to elaborate a de novo assembly using the reads from all libraries, annotate the different transcripts using GO terms and Blast searches, and finally, map the reads from the different libraries to perform the gene expression analysis.
3- We use TruSeq SBS sequencing kits version 3 (Illumina) to sequence 10 libraries. We obtained between 10-30 million single end reads (50 bp length) in each library.
4- As mentioned before, we want to compare gene expression levels between different conditions.

Do you have any alternative to FASTX tool?

If you need more info, please, let me know Lucila

ADD REPLY • link updated 7.1 years ago by Brian Bushnell 20k • written 7.1 years ago by Lucila ▴ 20

2

Entering edit mode

Both seqtk and BBDuk use the optimal Phred algorithm for quality-trimming, and both are very fast. I'd recommend either of them, though for paired reads I think BBDuk is more convenient. Adapter-trimming is probably at least as useful as quality-trimming, and BBDuk allows adapter-trimming simultaneously with quality-trimming. Neither of them are overly important for 50bp shotgun reads, though - they become more important with full-length reads (150bp+).

As st.ph.n indicated, you can't get a good assembly with single-ended 50bp reads, especially for a diploid. But you can still get an assembly, which may be sufficient for differential expression; many of the genes won't be complete in a single contig, but their sequence will still be present, and you may be able to annotate them with function. Then comparison between different conditions could be useful. However, you'd be better off sequencing a sample deeply with 2x150bp reads (or ideally, 2x250) with relatively long inserts (say, 600bp average, with a wide range) for the denovo assembly, then using the 50bp data for quantification.

You can determine the GC content of your organism (to a first approximation) from the reads. For example, using BBDuk:

bbduk.sh in=reads.fq gchist=gchist.txt

Notably, if that graph has two peaks, that may indicate your insect has a bacterial symbiote with different GC (cockroaches do, for example). Those could be assembled independently.

ADD REPLY • link 7.1 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you Brian for your response. Regarding the histogram, which software do you recommend to plot it in a graph?

I have visualized more than one peak with FASTQC. So perhaps it could be interesting to assemble them independently. Which tool allows me to organize the reads according to the GC content?

Sorry for all my begginer's questions.

Best, Lucila.

ADD REPLY • link 7.1 years ago by Lucila ▴ 20

0

Entering edit mode

You can filter by gc content with BBDuk's mingc and maxgc flags, but that's a pretty crude way of filtering, since the peaks will overlap. It's usually better to filter the assembled contigs by gc since they are longer.

As for plotting, I normally use Excel.

ADD REPLY • link 7.0 years ago by Brian Bushnell 20k

0

Entering edit mode

Assuming you have longer paired-end reads in order create your de novo assembly, I suggest you use Trinity, incorporating your shorter (assumably single-end) reads in the assembly. As far as trimming by quality, incorporated the trimmomatic command into your Trinity command.

ADD REPLY • link 7.1 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Thanks for your answer st.ph.n. Unfortunately we only have these single-end reads 50 bp lenght. Do you think it is possible to make a de novo assembly with them? We selected the cheaper option but maybe it was not the best....