Fastq Files - Orientation And Duplicates
3
1
Entering edit mode
11.7 years ago

I have a bunch of fastq files which look like the below -

@QDBBN:28:14
AAGGATTCATACATGGGTTCTGGTTCTGGTTCTAT...
+
?8?*?ABC/*?EDEDDD9D>CCD=C=...
@QDBBN:31:96
AAGGCATGACATCATGG...
+
D=?*&^#*ABABE7C...

They are from IonTorrent machines. Is there any way I can determine if they are forward or reverse. Also, would it be useful for me to remove duplicates at this point? (It usually is done post alignment, but I thought it might be useful)

fastq duplicates • 4.5k views
ADD COMMENT
2
Entering edit mode
11.7 years ago
Assa Yeroslaviz ★ 1.8k

removing duplications is not always recommended. You are not saying much about your goal, so you'll basically have to decide for yourself if this is the right way to do it.

About the direction, AFAIK you will need to map them before you can say in which direction they are, unless you did a directional sequencing. But even then it is difficult, as the directional sequencing is sometimes not necessarily what one expected.

You should say a bit more about both your input data as well as what you are trying to do in the experiment

Assa

ADD COMMENT
1
Entering edit mode
11.7 years ago

the only tool I know that removes duplicates from fastq files directly is fastx_collapse, from the FASTX toolkit. but the best algorithm I know for duplicates detection is Picard tools', which deals only with BAM files, so in case you're looking for a proper software suggestion I would definitely go for Picard.

I understand that one would be tempted to reduce the mapping resource needs by reducing removing duplicates before, hence reducing the amount of reads to deal with. but the mapping algorithm implies a lot more things than looking for duplicates does, and for that reason in the mapping process some reads that could look like duplicates could be mapped to different locations, and a posterior duplicates removal algorithm would have it easier to distinguish which reads are in fact duplicates and which ones aren't. the best suggestion I could give you is exactly would be what you already knew: it is much better to map first and remove duplicates afterwards.

PS: the strand direction of the reads can only be determined by the mapping process.

ADD COMMENT
0
Entering edit mode
11.7 years ago

I have some biologics which are useful to me. I have the associated FASTQ files. I need to know their sequence, so that I can manufacture them. First, I will look for the existence of certain barcodes in the FASTQ data. And then look for attached 'tails' to these. These are sequences of interest to us. They won't be mapped - these are not organic genetic data. I want to identify the 'tails' that occur most frequently. So I might have to check from both ends anyway to get correct results.

So in a case like this, perhaps I should remove exact duplicates of the whole read - these might be an artificial artifact of PCR amplification, and would skew results if not removed.

What do you think?

ADD COMMENT

Login before adding your answer.

Traffic: 2619 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6