Question: isolate reads from a paired-ended fastq files that fall within a certain bp range
0
gravatar for nibhelim
8 weeks ago by
nibhelim0
United States
nibhelim0 wrote:

hi folks, I have pair ended fastq files with 50bp reads (01_R1 and 01_R2). I need to make a new pair of fastq files that contains just the reads that fulfill this requirement: the reads of R2 should be within a certain range (let's say no more than 100bp) far from the respective read in R1. Also no restrictions should be applied on the orientation or overlap (I want all the reads no matter the orientation and no matter if they also overlap) ideally I'd like to use Bowtie2 for this any suggestion on how to approach this (an help with a command line will be really appreciated since I'm a newbie)?

bowtie2 chip-seq alignment 3c • 269 views
ADD COMMENTlink modified 6 weeks ago by ATpoint21k • written 8 weeks ago by nibhelim0
0
gravatar for swbarnes2
8 weeks ago by
swbarnes26.2k
United States
swbarnes26.2k wrote:

You can use samtools or bedtools to get the read2 read names that you want, then picard tools to filter the original bam by those read names.

ADD COMMENTlink written 8 weeks ago by swbarnes26.2k

hi swbarnes2, can you be a bit more specific on what are the 'read names' that I should isolate? sorry for the basic questions I'm a newbie in seq

ADD REPLYlink written 8 weeks ago by nibhelim0
0
gravatar for ATpoint
6 weeks ago by
ATpoint21k
Germany
ATpoint21k wrote:

You could do it like this. Might take a little while as it involves a sorting step:

## Get read names where the mates have insert size less than 100 (can be customized changing `LEN`).
## Essentially it filters reads with an insert size less than LEN and then extracts its read names
LEN=100
samtools sort -n -O SAM your.bam \
  | tee >(samtools view -o your_sorted.bam) \
  | awk -v LEN=LEN '{if ($9 <= LEN && $9 >= -(LEN) && $9 != 0) print $1 | "uniq"} > read_names.txt

## Get those reads that match the read names and write as fasta:
samtools view your_sorted.bam \
  | cat <(samtools view -H Hx2_rep2_dedup.bam) <(grep -wFf read_names.txt /dev/stdin) \
  | samtools fastq -1 reads_1.fastq.gz -2 reads_2.fastq.gz -

The initial sorting step is necessary to 1) get the unique read names and more importantly 2) bam2fastq conversion should be done on name-sorted files as aligners expect random order of fastq which we make sure using name sort in the beginning. Go through the code step-by-step and try to get the concepts, ask if you need help :)

genomax mentioned filterbyname.sh from BBMap for step two. Sounds like the tool should be able to do that. I gave it a quick test and after few seconds the tool threw an error and stopped writing the fastq files which is why I will not include it here. Would require investigating that error but will go to bed now :)

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by ATpoint21k
1

I suggest using filterbyname.sh with readnames.txt for step 2 from BBMap suite.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by genomax70k

Thanks for the help! is this approach taking all the reads despite their direction/distance? I would like to preserve all the paired reads that are concordant or not (-> <-, -> ->, <- <-) and also partially/totally overlapping.

ADD REPLYlink written 6 weeks ago by nibhelim0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 869 users visited in the last hour