Question: Trimming RNA Seq data - Invalid nucleotide sequence
0
gravatar for curious
3.4 years ago by
curious40
curious40 wrote:

Hi,

I downloaded lots of SRA files (Chip-seq, RNA-seq, dnase etc.) from Roadmap project. I'm converting them to FASTQ format (fastq-dump with --split-files option) then do some preprocessing for maintaining consistency.

Since the sequence lengths coming out of these experiments are different, I'm trimming (using fastx_trimmer) the reads to a 36bp length. It works fine for FASTQs from Chip-seq SRAs. However, the FASTQ from RNA seq (ABI SOLID platform) have this format (first 8 lines)

@SRR179594.1 mendel_20110320_FRAG_BC_Ryan_RNA_Seq_2_58_404_F3 length=50
T.11.0223.0120.1020110202.0.0010.0.20.0201.2.021021
+SRR179594.1 mendel_20110320_FRAG_BC_Ryan_RNA_Seq_2_58_404_F3 length=50
!!@B!@A;B!BB:B!BB=A/%>(/%!A!.6%A!/!%'!%5.%!)!/()%-%
@SRR179594.2 mendel_20110320_FRAG_BC_Ryan_RNA_Seq_2_58_408_F3 length=50
T.20.3101.000021200002230.2.0312.0.13.0313.0.220003
+SRR179594.2 mendel_20110320_FRAG_BC_Ryan_RNA_Seq_2_58_408_F3 length=50
!!>B!<B:>!@@*?3-;%A9?A%'+!B!51,A!=!<'!:'.:!(!)-'*>5

Using fastx_trimmer on this to keep the first 36bp is throwing an error:

fastx_trimmer: found invalid nucleotide sequence (T.11.0223.0120.1020110202.0.0010.0.20.0201.2.021021) on line 2

Understandably due to a different format from ~ACTGN~. How do I go about this if I were to trim the RNA sequences?

rna-seq sequence fastq • 1.4k views
ADD COMMENTlink modified 3.4 years ago by mastal5112.0k • written 3.4 years ago by curious40

I guess you need to use abi-dump instead of fastq-dump if the data is from ABI. I have never used, but just a thought.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by geek_y9.8k

It is likely you need to run the trimmer with the -Q33 qualificator

ADD REPLYlink written 3.4 years ago by Antonio R. Franco4.1k

thanks for that, I didn't realize SRA toolkit had support for ABI specific files.

ADD REPLYlink written 3.4 years ago by curious40

This is not actually useful. ABI-Dump will extract your sequences into fasta and Quality separated files, but they have eventually to be joined again into a single fastq file for its use in many applications The dots meaning that the quality of the base call has been so bad, will remain the same

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Antonio R. Franco4.1k
0
gravatar for mastal511
3.4 years ago by
mastal5112.0k
mastal5112.0k wrote:

You need to convert the reads from SOLID colorspace to basespace (ACGTN).

Actually, I think a better idea is to use the SRA Toolkit

http://www.ncbi.nlm.nih.gov/books/NBK158900/

to convert the file into csfasta and quals files, and then use software that will support SOLiD data.

A previous post discussing software for SOLID is here:

Which Programs Are You Relying On For Solid Data Analysis?

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by mastal5112.0k

It is a very very bad idea to convert ABI colospace sequences to basespace sequences..

If you investigate why, you will discover that a failure or error in the colorspace sequence means a unique change in the color of that particular colorspace, whereas the remaining of the colorspace sequence (before and after the error) does not change at all.

However after converting the sequence to basespace, all the bases after the basecolor error changes

That means you can compare sequences in the colorspace environment if one of several errors are present, whereas it is impossible to do it after conversion

ADD REPLYlink written 3.4 years ago by Antonio R. Franco4.1k

Thanks Antonio. How do I go about processing the colorspace sequences if conversion is a bad idea? appreciate the response, I'm new to this area.

ADD REPLYlink written 3.4 years ago by curious40

That is a serious problem with SOLiD data.. For example, if you are going to map those sequences using TopHat, you need to use the "old" bowtie1 version, and not the newest bowtie2, because colorspace mapping is deprecated in bowtie2

And the same happens with many other program

In addition Notice the many dots included into your sequences. This is typical of SOLiD data, and this is hard to manage. You cannot use a trimmer program without erasing too data

I was working a year ago with SOLiD data, and I eventually quit working with them.

ADD REPLYlink written 3.4 years ago by Antonio R. Franco4.1k

Ok, thanks for the insight Antonio. that really helps.

ADD REPLYlink written 3.4 years ago by curious40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1491 users visited in the last hour