Question: Generating Read Length?
0
gravatar for nikelle.petrillo
3.2 years ago by
Providence College, Providence, RI
nikelle.petrillo100 wrote:

Hello! I am new to RNA seq. I quality trimmed my fastq sequences via fastX toolkit using a phred score of 30. I would like to figure out the read length after the quality trim (phred score < 30 was removed). Any help would be much appreciated.  

 

 

-Nikelle 

read length rna-seq phred quality • 1.4k views
ADD COMMENTlink modified 3.2 years ago by Chris Miller20k • written 3.2 years ago by nikelle.petrillo100

This kind of trimming is very severe and likely unnecessary. I assume you are aligning to a reference genome. You are probably throwing away lot of good data if you did lose a lot if bases after trimming.

See this thread for a paper referenced in there about trimming using quality: Which Phred value to use in trimming

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by genomax65k

You might want to consult www.rnaseq.wiki for a really nice description of the steps involved in processing RNAseq. IIRC, they cover read trimming in the appropriate section

ADD REPLYlink written 3.2 years ago by Chris Miller20k
0
gravatar for Biomonika (Noolean)
3.2 years ago by
State College, PA, USA
Biomonika (Noolean)3.0k wrote:

Run FastQC before and after trimming. 

ADD COMMENTlink written 3.2 years ago by Biomonika (Noolean)3.0k

Thank you very much! One other question. Before trim, I had sequence length = 100. After trimming, I had a sequence length of 3-100. Can you make any sense of this? I think I am having a hard time figuring out what length it is measuring.

 

 

ADD REPLYlink written 3.2 years ago by nikelle.petrillo100

That means some reads were trimmed to a length of 3 eliminating 97 bases and you have a range of read lengths remaining that goes from 3 to 100. See my comment for your original question.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by genomax65k

Thanks! 

ADD REPLYlink written 3.2 years ago by nikelle.petrillo100
0
gravatar for Chris Miller
3.2 years ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:

How about this:

sed -n '1~4p' filename.fastq | perl -ne 'chomp;print length($_) . "\n"' | sort -n | uniq -c >length.dist
ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Chris Miller20k

Hi, Thanks Chris. If I were to input this, what would this be generating?

ADD REPLYlink written 3.2 years ago by nikelle.petrillo100

Let's break it down:

sed -n '1~4p' filename.fastq 

Gives you every 4th line of the file (the sequence line)

perl -ne 'chomp;print length($_) . "\n"'

outputs the length of that line

sort -n | uniq -c

condenses it into a table of counts like this:

  3  98
123  99
 22  100

 

 

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Chris Miller20k

Thanks very much, that was helpful. What are the two different columns in the table?

ADD REPLYlink written 3.2 years ago by nikelle.petrillo100
Count and read length (see 'man uniq')
ADD REPLYlink written 3.2 years ago by Chris Miller20k

Chris, the fourth line of FASTQ is the quality score, not sequence (but it should be trimmed to the same length as the sequence string, so results should be the same).

ADD REPLYlink written 3.2 years ago by harold.smith.tarheel4.3k
The command gives every 4th line, starting with the 1st. Come to think of it, that should be 2~4p, right, because of the header?
ADD REPLYlink written 3.2 years ago by Chris Miller20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1723 users visited in the last hour