Question: Remove reads from fastq file
0
gravatar for varsha619
3.6 years ago by
varsha61990
varsha61990 wrote:

Hi, Could someone please help me with removing reads from a fastq file from a specific genomic location? I have only been able to look at methods for removing reads from a specific chromosome from the aligned sam file, using samtools or from fastq using sequence IDs. I would like to remove PCR contaminants from my fastq files by giving specific genome coordinates. I appreciate your help!

sequencing • 1.6k views
ADD COMMENTlink modified 3.6 years ago by genomax89k • written 3.6 years ago by varsha61990

See cutadapt, trimmomatic, fastxtoolkit for processing adapters/primers.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by st.ph.n2.5k
1
gravatar for harold.smith.tarheel
3.6 years ago by
United States
harold.smith.tarheel4.6k wrote:

FASTQ files do not contain coordinates, so it is not possible to remove data based on that parameter. You would need to align and then filter, or filter by the sequence with one of the adapter-trimming tools (e.g., BBDuk or Trimmomatic).

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by harold.smith.tarheel4.6k

@harold.smith.tarheel, That makes sense, for example can something like "samtools view -b input.bam chr1:1-100 > output.bam" be used to remove sequences from the original file instead of extracting these regions to a new file?

ADD REPLYlink written 3.6 years ago by varsha61990
1

From the manual:

-U FILE Write alignments that are not selected by the various filter options to FILE. When this option is used, all alignments (or all alignments intersecting the regions specified) are written to either the output file or this file, but never both.

It looks like you're using the syntax from an older version of SAMtools; I recommend updating to the current version.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by harold.smith.tarheel4.6k

@harold.smith.tarheel, Just to clarify, I used - samtools view in.sorted.bam -b -h -o inRegions.bam -U outRegions.bam -L Regions.bed... So here the -o file has the regions in "chr:start-stop" but the -U file excludes the regions in "chr:start-stop" and retains the rest? Thank you for your help!

ADD REPLYlink written 3.6 years ago by varsha61990
1
gravatar for genomax
3.6 years ago by
genomax89k
United States
genomax89k wrote:

Instead of depending on genome co-ordinates you may want to use clumpify.sh from BBMap suite to identify duplicates (you can identify optical, PCR and other kinds) independent of alignments. Then depending on the severity of the issue decide what to do with them (just mark or remove). See this post for additional details on how you would use this tool: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by genomax89k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1293 users visited in the last hour