Question: filtering reads based on read counts
gravatar for sumithrasank75
4.8 years ago by
United States
sumithrasank75130 wrote:

I have parsed out a fastq file and have a text file like

read_id      read-count

aaaa              10

bbbb               1000

ccccc             10000

dddd                1 

and so on. It is clear that some read_ids which have very large or very small read_counts are outliers. Can you suggest a scheme I could use in R/python/pandas with which I can systematically filter for reads which lie within a range of values (based on mean or median of the read_counts)? 



next-gen R • 1.0k views
ADD COMMENTlink modified 4.4 years ago by Biostar ♦♦ 20 • written 4.8 years ago by sumithrasank75130

Is the read ID the ID from the sequencer or is this some sort of ID given to a specific sequence? In other words, how were these numbers generated and what's the biological context? We'd need answers to those questions to give you a reliable answer.

ADD REPLYlink written 4.8 years ago by Devon Ryan94k

The reads are aligned with bowtie to a library of oligos which have IDs given in the read_id column. The read_count values are an  example of the large range of values. Example : sequence dddd is covered on 1 time but sequence ccccc is covered 10000 times .

ADD REPLYlink written 4.8 years ago by sumithrasank75130

Right, but what then do the oligos represent? These could be assembled contigs, in which case the raw count is less interesting that the median depth. Alternatively, these could be small RNAs, in which case the high counts are biologically meaningful. One could conceive of additional possibilities. You still need to give us more details, since the answer depends on the biological context of the data.

ADD REPLYlink written 4.8 years ago by Devon Ryan94k
gravatar for arnstrm
4.8 years ago by
Ames, IA
arnstrm1.7k wrote:

You need to be doing digital normalization of the reads. I've used Khmer before and it does this job very nicely.

See this article about Digital Noramlization:

EDIT: from you latest comment, it looks like you want to normalize RNAseq kind of data (counts table). There are plenty of RNA seq R packages that you can use it for this purpose. The method that I suggested work directly on the raw fastq file to remove overly redundant reads from the input file.


ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by arnstrm1.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1634 users visited in the last hour