isolate reads from a paired-ended fastq files that fall within a certain bp range
2
0
Entering edit mode
4.8 years ago
nibhelim • 0

hi folks, I have pair ended fastq files with 50bp reads (01_R1 and 01_R2). I need to make a new pair of fastq files that contains just the reads that fulfill this requirement: the reads of R2 should be within a certain range (let's say no more than 100bp) far from the respective read in R1. Also no restrictions should be applied on the orientation or overlap (I want all the reads no matter the orientation and no matter if they also overlap) ideally I'd like to use Bowtie2 for this any suggestion on how to approach this (an help with a command line will be really appreciated since I'm a newbie)?

ChIP-Seq 3c alignment bowtie2 • 1.1k views
ADD COMMENT
0
Entering edit mode
4.8 years ago

You can use samtools or bedtools to get the read2 read names that you want, then picard tools to filter the original bam by those read names.

ADD COMMENT
0
Entering edit mode

hi swbarnes2, can you be a bit more specific on what are the 'read names' that I should isolate? sorry for the basic questions I'm a newbie in seq

ADD REPLY
0
Entering edit mode
4.8 years ago
ATpoint 82k

You could do it like this. Might take a little while as it involves a sorting step:

## Get read names where the mates have insert size less than 100 (can be customized changing `LEN`).
## Essentially it filters reads with an insert size less than LEN and then extracts its read names
LEN=100
samtools sort -n -O SAM your.bam \
  | tee >(samtools view -o your_sorted.bam) \
  | awk -v LEN=LEN '{if ($9 <= LEN && $9 >= -(LEN) && $9 != 0) print $1 | "uniq"} > read_names.txt

## Get those reads that match the read names and write as fasta:
samtools view your_sorted.bam \
  | cat <(samtools view -H Hx2_rep2_dedup.bam) <(grep -wFf read_names.txt /dev/stdin) \
  | samtools fastq -1 reads_1.fastq.gz -2 reads_2.fastq.gz -

The initial sorting step is necessary to 1) get the unique read names and more importantly 2) bam2fastq conversion should be done on name-sorted files as aligners expect random order of fastq which we make sure using name sort in the beginning. Go through the code step-by-step and try to get the concepts, ask if you need help :)

genomax mentioned filterbyname.sh from BBMap for step two. Sounds like the tool should be able to do that. I gave it a quick test and after few seconds the tool threw an error and stopped writing the fastq files which is why I will not include it here. Would require investigating that error but will go to bed now :)

ADD COMMENT
1
Entering edit mode

I suggest using filterbyname.sh with readnames.txt for step 2 from BBMap suite.

ADD REPLY
0
Entering edit mode

Thanks for the help! is this approach taking all the reads despite their direction/distance? I would like to preserve all the paired reads that are concordant or not (-> <-, -> ->, <- <-) and also partially/totally overlapping.

ADD REPLY

Login before adding your answer.

Traffic: 2941 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6