I have a sample of double-end sequencing 18-R-001_R1._fastq.gz,18-R-001_R2._fastq.gz, how to get the overlap of R1 and R2, and filter out the reads that R1 and R2 perfectly matched.(without mismatch)
I have a sample of double-end sequencing 18-R-001_R1._fastq.gz,18-R-001_R2._fastq.gz, how to get the overlap of R1 and R2, and filter out the reads that R1 and R2 perfectly matched.(without mismatch)
You can use BBMerge from BBTools or FLASH to do the actual merging of the reads allowing for no mismatches.
You can then use reformat.sh
from BBMap suite to filter your data where the merged read is exactly the same length as R1/R2 (I am assuming your reads are all identical length to begin with and have not been trimmed, e.g. you could set minlength=n+1
, n = length of R1/R2
). That will filter out all reads where R1 and R2 perfectly match (your requirement).
If R1/R2 perfectly match but have a shorter insert than the length of sequencing, those reads would also be removed by filter above.
Thank you for your reply, maybe my expression is unclear. For example, a 2x150bp read pair, the overlap is 50bp in the middle, I want this 50bp double-end sequence, and I want this 50bp R1 to be completely match 50bp R2. If the reads of R1R2 are not exactly matched, remove it.
This can be done with bbmerge
to which genomax have linked.
$ bbmerge.sh in1=18-R-001_R1._fastq.gz in2=18-R-001_R2._fastq.gz out=overlap.fastq.gz pfilter=1 trimnonoverlapping=t
in1
and in2
define the input filesout
define the output filepfilter=1
leads to merging only if there is no mismatchtrimnonoverlapping=t
trimm the parts of the reads that not overlapUse of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hello 190444373 ,
could you please explain why do you think this is a good idea?
fin swimmer
maybe my expression is unclear.so it looks a little bad.