Question

RNA seq FASTX quality trimming

0

Entering edit mode

8.2 years ago

Rahul ▴ 30

Hello,

I have filtered my illumina pair-end reads (Forward lib-24 million reads, Reverse Lib 24 million) using FASTX_Quality_Filter by applying the Q20 score to 90 percent of bases. (75 bp reads, insert size 200 bp)

But after filtering, I am observing around 18 million reads in a forward library and 20 million reads in a reverse library. I can see here 2 million bases difference between two libraries. Can I use above libraries for making transcriptome assembly purpose given that the number of reads are unequal?

Regards Rahul

RNA-Seq Assembly next-gen alignment rna-seq • 2.9k views

ADD COMMENT • link updated 8.2 years ago by Brian Bushnell 20k • written 8.2 years ago by Rahul ▴ 30

score 4 · Accepted Answer · 2016-03-01

4

Entering edit mode

8.2 years ago

Brian Bushnell 20k

fastx-toolkit is not pair-aware and should never be used for paired reads. There are many modern tools (such as BBDuk, which I wrote) that properly handle paired reads, and will give you paired reads as output, along with singletons in which the mate was discarded.

Q20 is too high for RNA-seq filtering (or pretty much anything), anyway - that will increase the bias of your output. Trimming to, say, Q10 is a much better idea.

ADD COMMENT • link 8.2 years ago by Brian Bushnell 20k

0

Entering edit mode

FastX is not for paired end data, its for single end.

You can also try Cutadapt

ADD REPLY • link 8.2 years ago by #### ▴ 220

1

Entering edit mode

Can I use trimmomatic/ printseq? for pair end reads

Thanks

ADD REPLY • link 8.2 years ago by Rahul ▴ 30

score 2 · Accepted Answer · 2016-03-01

2

Entering edit mode

8.2 years ago

GouthamAtla 12k

Simple answer is "Yes, you can". Just check how the program that you are going to use treats the singleton reads ( i.e 2 million extra reads in one of the file ) and how to input them.

P.S My answer was to original question, wether we can use singletons for assembly along with paired-end reads. The context ( and title ?) of the question changed later.

ADD COMMENT • link 8.2 years ago by GouthamAtla 12k

0

Entering edit mode

Thank you very for much for giving comments on my query. I am using Soapdenovo trans (iplant Collaborative site) for assembling reads with default a default parameter.

I have got around 50% completeness report of CEGMA when I tried assembly (scaffolding) with trimmed and quality filter reads. On other occasion when I tried assembly with raw reads, I got 81% CEGMA completeness report.Hence, I am in confusion whether I am giving right or wrong input. After ensuring proper cleanup steps still my results are not up to the mark.

ADD REPLY • link 8.2 years ago by Rahul ▴ 30

0

Entering edit mode

I don't think that's the best practice, though...

ADD REPLY • link 8.2 years ago by Brian Bushnell 20k

1

Entering edit mode

I edited my answer. The original question was different. It was about using singleton reads in assembly.

ADD REPLY • link 8.2 years ago by GouthamAtla 12k

score 2 · Accepted Answer · 2016-03-01

2

Entering edit mode

8.2 years ago

Antonio R. Franco ★ 5.1k

If using Illumina data, try to compare the results you get using fastq_quality_trimmer Maybe it will let your files synchronized with the same number of sequences just because it will make sequences shorter, preserving more sequences with high quality

ADD COMMENT • link 8.2 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

Thanks for your valuable comments and suggestions...

A) After assembling scaffolds from the trimmed and quality filter reads, I am getting the following ratio Average_number_of_contigs_per_scaffold :-1.0

B) For the untrimmed raw reads.... Average_number_of_contigs_per_scaffold :-1.2-1.4

C) Assembly in published paper showing around... Average_number_of_contigs_per_scaffold :-1.9

I don't know whether the problem in my scaffolds is due to input reads or else.....?

Any suggestion will be highly appreciated...

Regards Rahul

ADD REPLY • link 8.2 years ago by Rahul ▴ 30

0

Entering edit mode

If you give some attention to the assemblathon 2 contest, you will notice that the number of contigs depends upon the source of the DNA. In Assemblathon 2 you will read that some assemblers works better with fish and not with the boa. The contrary happens with a different assembler. Source of DNA, and in particular its complexity and number of repeated sequences play a key role in the formation of contigs and scaffolds. If statistics of the "publisher paper" rely or was done with a different genome, I believe you cannot compare

ADD REPLY • link 8.2 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

The publisher used Soapdenvo 2 for assembling. I am trying assembly with soapdenovo trans on same published reads with almost same parameters except the quality trimming parameters.

ADD REPLY • link 8.2 years ago by Rahul ▴ 30