Question: Filter fastq/sam/bam for reads
gravatar for hlsz.laszlo
5.2 years ago by
hlsz.laszlo20 wrote:

Dear All,

I'm analyzing a ChIP-seq data, and I having some trouble filtering out "good" reads for us. Briefly, I've got a fastq file, then I sorted out reads that has 5' barcode sequence with no mismatch. Because the barcode sequence was not unique enough the reads aligned well even with barcode. 

I'm trying to filter out reads with artificial barcode. So, I aligned the barcoded and the barcode trimmed reads respectively to the hg19 genome with exact match. Then, to get the not endogenous 5' barcoded reads I need to filter out the exactly aligned barcoded reads from the exactly aligned not barcoded reads. 

Is there an easy was to do this? I'm a bit confused.




filtering chip-seq reads • 2.1k views
ADD COMMENTlink modified 5.2 years ago by Istvan Albert ♦♦ 81k • written 5.2 years ago by hlsz.laszlo20

I think you're not the only one confused... Can you make your question clearer? (an example maybe?)

ADD REPLYlink written 5.2 years ago by Asaf6.1k

So, the goal is to retain reads in a fastq file that has non endogenous eighth basepair on the 5 prime end. The first step is to create a fastq file that contains only reads with 5' barcode. Next, is to align the fastq with or without 5' barcode sequence (trim BC) with perfect matches. If you take the trimmed reads without the BC aligned IDs (reads) you get rid of endogenous "barcode" sequences. 

My problem is how to remove those reads... I managed to gather all read IDs that I want to keep. 

ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by hlsz.laszlo20

I didn't understand why some reads will have BC and some won't, shouldn't they all contain the barcode?

If you have a list of IDs that you want to extract from a SAM file you can do it using a simple script or probably use Galaxy

ADD REPLYlink written 5.2 years ago by Asaf6.1k
gravatar for Istvan Albert
5.2 years ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

First I'll say that this really does not sound quite right. 

It is very unlikely that you could fully align reads that contain a barcode. Even though say a six base long k-mer is not that unique on its own, when paired to an existing location in the a genome it will form to a very unique construct that is very unlikely to match exactly. If you aligning it partially (locally) then it is a different issue altogether but those alignments will be more difficult to interpret correctly.

(IMHO if you can fully align your reads it means that don't actually have a barcode there.)

In general when splitting by barcode you need to identify the barcodes and split by those and not by aligning with or without barcodes.

As for the answer to your question search for  extract fastq on this site, you'll get hits like this:

How To Extracting Fastq Sequence For Given Fastq Ids And Fastq File

ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by Istvan Albert ♦♦ 81k


Sorry if I wasn't fully clear. So, the barcode (not equal to illumina adapter, index) ligation was very inefficient. From the raw fastq file (~20 m, 100 bp reads) only a minority (~3 m) contains the barcode. Moreover this barcode seems not to be unique (~ 1 m read with barcode aligned perfectly to hg19; this group I want to remove from my fastq). I know that the remaining read number is low, but it worth trying.

I collected the IDs from perfect matched reads containing barcode and the IDs of reads that aligned perfectly when I trimmed the barcode. Then I used Microsoft Access (not sure if it is the best) to print trimmed IDs that not have ID match in the BC ID group (to get reads containing artificial "barcode").

I'll try what you suggested.



ADD REPLYlink modified 3 months ago by RamRS24k • written 5.2 years ago by hlsz.laszlo20

Well like I said, it does not matter wether the barcode itself matches the genome perfectly.

The issue here is why would a barcode+read also match the genome exactly. There is no simple explanation I can come up with to explain how 5% of your reads could come from a genomic location that, after being extended artificially with a barcode would still match perfectly. 

I suspect that you think the matches are perfect when in fact they are not, could be all mismatches or soft clipped.

ADD REPLYlink written 5.2 years ago by Istvan Albert ♦♦ 81k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 940 users visited in the last hour