Question

isolate reads from a paired-ended fastq files that fall within a certain bp range

0

Entering edit mode

4.8 years ago

nibhelim • 0

hi folks, I have pair ended fastq files with 50bp reads (01_R1 and 01_R2). I need to make a new pair of fastq files that contains just the reads that fulfill this requirement: the reads of R2 should be within a certain range (let's say no more than 100bp) far from the respective read in R1. Also no restrictions should be applied on the orientation or overlap (I want all the reads no matter the orientation and no matter if they also overlap) ideally I'd like to use Bowtie2 for this any suggestion on how to approach this (an help with a command line will be really appreciated since I'm a newbie)?

ChIP-Seq 3c alignment bowtie2 • 1.1k views

ADD COMMENT • link updated 4.8 years ago by ATpoint 82k • written 4.8 years ago by nibhelim • 0

score 0 · Answer 1 · 2019-06-27

0

Entering edit mode

4.8 years ago

swbarnes2 14k

You can use samtools or bedtools to get the read2 read names that you want, then picard tools to filter the original bam by those read names.

ADD COMMENT • link 4.8 years ago by swbarnes2 14k

0

Entering edit mode

hi swbarnes2, can you be a bit more specific on what are the 'read names' that I should isolate? sorry for the basic questions I'm a newbie in seq

ADD REPLY • link 4.8 years ago by nibhelim • 0

score 0 · Answer 2 · 2019-07-08

You could do it like this. Might take a little while as it involves a sorting step:

## Get read names where the mates have insert size less than 100 (can be customized changing `LEN`).
## Essentially it filters reads with an insert size less than LEN and then extracts its read names
LEN=100
samtools sort -n -O SAM your.bam \
  | tee >(samtools view -o your_sorted.bam) \
  | awk -v LEN=LEN '{if ($9 <= LEN && $9 >= -(LEN) && $9 != 0) print $1 | "uniq"} > read_names.txt

## Get those reads that match the read names and write as fasta:
samtools view your_sorted.bam \
  | cat <(samtools view -H Hx2_rep2_dedup.bam) <(grep -wFf read_names.txt /dev/stdin) \
  | samtools fastq -1 reads_1.fastq.gz -2 reads_2.fastq.gz -

The initial sorting step is necessary to 1) get the unique read names and more importantly 2) bam2fastq conversion should be done on name-sorted files as aligners expect random order of fastq which we make sure using name sort in the beginning. Go through the code step-by-step and try to get the concepts, ask if you need help :)

genomax mentioned filterbyname.sh from BBMap for step two. Sounds like the tool should be able to do that. I gave it a quick test and after few seconds the tool threw an error and stopped writing the fastq files which is why I will not include it here. Would require investigating that error but will go to bed now :)