Question: isolate reads from a paired-ended fastq files that fall within a certain bp range
gravatar for nibhelim
12 months ago by
United States
nibhelim0 wrote:

hi folks, I have pair ended fastq files with 50bp reads (01_R1 and 01_R2). I need to make a new pair of fastq files that contains just the reads that fulfill this requirement: the reads of R2 should be within a certain range (let's say no more than 100bp) far from the respective read in R1. Also no restrictions should be applied on the orientation or overlap (I want all the reads no matter the orientation and no matter if they also overlap) ideally I'd like to use Bowtie2 for this any suggestion on how to approach this (an help with a command line will be really appreciated since I'm a newbie)?

bowtie2 chip-seq alignment 3c • 397 views
ADD COMMENTlink modified 12 months ago by ATpoint36k • written 12 months ago by nibhelim0
gravatar for swbarnes2
12 months ago by
United States
swbarnes27.9k wrote:

You can use samtools or bedtools to get the read2 read names that you want, then picard tools to filter the original bam by those read names.

ADD COMMENTlink written 12 months ago by swbarnes27.9k

hi swbarnes2, can you be a bit more specific on what are the 'read names' that I should isolate? sorry for the basic questions I'm a newbie in seq

ADD REPLYlink written 12 months ago by nibhelim0
gravatar for ATpoint
12 months ago by
ATpoint36k wrote:

You could do it like this. Might take a little while as it involves a sorting step:

## Get read names where the mates have insert size less than 100 (can be customized changing `LEN`).
## Essentially it filters reads with an insert size less than LEN and then extracts its read names
samtools sort -n -O SAM your.bam \
  | tee >(samtools view -o your_sorted.bam) \
  | awk -v LEN=LEN '{if ($9 <= LEN && $9 >= -(LEN) && $9 != 0) print $1 | "uniq"} > read_names.txt

## Get those reads that match the read names and write as fasta:
samtools view your_sorted.bam \
  | cat <(samtools view -H Hx2_rep2_dedup.bam) <(grep -wFf read_names.txt /dev/stdin) \
  | samtools fastq -1 reads_1.fastq.gz -2 reads_2.fastq.gz -

The initial sorting step is necessary to 1) get the unique read names and more importantly 2) bam2fastq conversion should be done on name-sorted files as aligners expect random order of fastq which we make sure using name sort in the beginning. Go through the code step-by-step and try to get the concepts, ask if you need help :)

genomax mentioned from BBMap for step two. Sounds like the tool should be able to do that. I gave it a quick test and after few seconds the tool threw an error and stopped writing the fastq files which is why I will not include it here. Would require investigating that error but will go to bed now :)

ADD COMMENTlink modified 12 months ago • written 12 months ago by ATpoint36k

I suggest using with readnames.txt for step 2 from BBMap suite.

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax85k

Thanks for the help! is this approach taking all the reads despite their direction/distance? I would like to preserve all the paired reads that are concordant or not (-> <-, -> ->, <- <-) and also partially/totally overlapping.

ADD REPLYlink written 11 months ago by nibhelim0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1070 users visited in the last hour