Question

Remove reads from fastq file

0

Entering edit mode

7.2 years ago

varsha619 ▴ 90

Hi, Could someone please help me with removing reads from a fastq file from a specific genomic location? I have only been able to look at methods for removing reads from a specific chromosome from the aligned sam file, using samtools or from fastq using sequence IDs. I would like to remove PCR contaminants from my fastq files by giving specific genome coordinates. I appreciate your help!

sequencing • 3.2k views

ADD COMMENT • link updated 7.2 years ago by GenoMax 141k • written 7.2 years ago by varsha619 ▴ 90

0

Entering edit mode

See cutadapt, trimmomatic, fastxtoolkit for processing adapters/primers.

ADD REPLY • link 7.2 years ago by st.ph.n ★ 2.7k

score 1 · Answer 1 · 2017-02-03

1

Entering edit mode

7.2 years ago

harold.smith.tarheel ★ 4.9k

FASTQ files do not contain coordinates, so it is not possible to remove data based on that parameter. You would need to align and then filter, or filter by the sequence with one of the adapter-trimming tools (e.g., BBDuk or Trimmomatic).

ADD COMMENT • link 7.2 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

@harold.smith.tarheel, That makes sense, for example can something like "samtools view -b input.bam chr1:1-100 > output.bam" be used to remove sequences from the original file instead of extracting these regions to a new file?

ADD REPLY • link 7.2 years ago by varsha619 ▴ 90

1

Entering edit mode

From the manual:

-U FILE Write alignments that are not selected by the various filter options to FILE. When this option is used, all alignments (or all alignments intersecting the regions specified) are written to either the output file or this file, but never both.

It looks like you're using the syntax from an older version of SAMtools; I recommend updating to the current version.

ADD REPLY • link 7.2 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

@harold.smith.tarheel, Just to clarify, I used - samtools view in.sorted.bam -b -h -o inRegions.bam -U outRegions.bam -L Regions.bed... So here the -o file has the regions in "chr:start-stop" but the -U file excludes the regions in "chr:start-stop" and retains the rest? Thank you for your help!

ADD REPLY • link 7.2 years ago by varsha619 ▴ 90

score 1 · Answer 2 · 2017-02-03

Instead of depending on genome co-ordinates you may want to use clumpify.sh from BBMap suite to identify duplicates (you can identify optical, PCR and other kinds) independent of alignments. Then depending on the severity of the issue decide what to do with them (just mark or remove). See this post for additional details on how you would use this tool: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files