Generating Read Length?
2
0
Entering edit mode
8.7 years ago

Hello! I am new to RNA seq. I quality trimmed my fastq sequences via fastX toolkit using a phred score of 30. I would like to figure out the read length after the quality trim (phred score < 30 was removed). Any help would be much appreciated.

-Nikelle

RNA-Seq phred-quality read-length • 3.6k views
ADD COMMENT
0
Entering edit mode

This kind of trimming is very severe and likely unnecessary. I assume you are aligning to a reference genome. You are probably throwing away lot of good data if you did lose a lot if bases after trimming.

See this thread for a paper referenced in there about trimming using quality: Which Phred value to use in trimming

ADD REPLY
0
Entering edit mode

You might want to consult www.rnaseq.wiki for a really nice description of the steps involved in processing RNAseq. IIRC, they cover read trimming in the appropriate section

ADD REPLY
0
Entering edit mode
8.7 years ago

Run FastQC before and after trimming.

ADD COMMENT
0
Entering edit mode

Thank you very much! One other question. Before trim, I had sequence length = 100. After trimming, I had a sequence length of 3-100. Can you make any sense of this? I think I am having a hard time figuring out what length it is measuring.

ADD REPLY
0
Entering edit mode

That means some reads were trimmed to a length of 3 eliminating 97 bases and you have a range of read lengths remaining that goes from 3 to 100. See my comment for your original question.

ADD REPLY
0
Entering edit mode

Thanks!

ADD REPLY
0
Entering edit mode
8.7 years ago

How about this:

sed -n '1~4p' filename.fastq | perl -ne 'chomp;print length($_) . "\n"' | sort -n | uniq -c >length.dist
ADD COMMENT
0
Entering edit mode

Hi, Thanks Chris. If I were to input this, what would this be generating?

ADD REPLY
0
Entering edit mode

Let's break it down:

sed -n '1~4p' filename.fastq

Gives you every 4th line of the file (the sequence line)

perl -ne 'chomp;print length($_) . "\n"'

outputs the length of that line

sort -n | uniq -c

condenses it into a table of counts like this:

  3  98
123  99
 22  100
ADD REPLY
0
Entering edit mode

Thanks very much, that was helpful. What are the two different columns in the table?

ADD REPLY
0
Entering edit mode

Count and read length (see man uniq)

ADD REPLY
0
Entering edit mode

Chris, the fourth line of FASTQ is the quality score, not sequence (but it should be trimmed to the same length as the sequence string, so results should be the same).

ADD REPLY
0
Entering edit mode
The command gives every 4th line, starting with the 1st. Come to think of it, that should be 2~4p, right, because of the header?
ADD REPLY

Login before adding your answer.

Traffic: 2123 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6