Question

Fastq Files - Orientation And Duplicates

1

Entering edit mode

11.7 years ago

nupurgupta0806 ▴ 30

I have a bunch of fastq files which look like the below -

@QDBBN:28:14
AAGGATTCATACATGGGTTCTGGTTCTGGTTCTAT...
+
?8?*?ABC/*?EDEDDD9D>CCD=C=...
@QDBBN:31:96
AAGGCATGACATCATGG...
+
D=?*&^#*ABABE7C...

They are from IonTorrent machines. Is there any way I can determine if they are forward or reverse. Also, would it be useful for me to remove duplicates at this point? (It usually is done post alignment, but I thought it might be useful)

fastq duplicates • 4.5k views

ADD COMMENT • link updated 11.7 years ago by Jorge Amigo 14k • written 11.7 years ago by nupurgupta0806 ▴ 30

score 2 · Answer 1 · 2012-08-31

removing duplications is not always recommended. You are not saying much about your goal, so you'll basically have to decide for yourself if this is the right way to do it.

About the direction, AFAIK you will need to map them before you can say in which direction they are, unless you did a directional sequencing. But even then it is difficult, as the directional sequencing is sometimes not necessarily what one expected.

You should say a bit more about both your input data as well as what you are trying to do in the experiment

Assa

score 1 · Answer 2 · 2012-09-01

the only tool I know that removes duplicates from fastq files directly is fastx_collapse, from the FASTX toolkit. but the best algorithm I know for duplicates detection is Picard tools', which deals only with BAM files, so in case you're looking for a proper software suggestion I would definitely go for Picard.

I understand that one would be tempted to reduce the mapping resource needs by reducing removing duplicates before, hence reducing the amount of reads to deal with. but the mapping algorithm implies a lot more things than looking for duplicates does, and for that reason in the mapping process some reads that could look like duplicates could be mapped to different locations, and a posterior duplicates removal algorithm would have it easier to distinguish which reads are in fact duplicates and which ones aren't. the best suggestion I could give you is exactly would be what you already knew: it is much better to map first and remove duplicates afterwards.

PS: the strand direction of the reads can only be determined by the mapping process.

score 0 · Answer 3 · 2012-08-31

I have some biologics which are useful to me. I have the associated FASTQ files. I need to know their sequence, so that I can manufacture them. First, I will look for the existence of certain barcodes in the FASTQ data. And then look for attached 'tails' to these. These are sequences of interest to us. They won't be mapped - these are not organic genetic data. I want to identify the 'tails' that occur most frequently. So I might have to check from both ends anyway to get correct results.

So in a case like this, perhaps I should remove exact duplicates of the whole read - these might be an artificial artifact of PCR amplification, and would skew results if not removed.

What do you think?