Hi
I want to use bowtie2 and align the sequenced reads(length=76) to hg38. I have trimmed to get 30 bp long reads and performed alignment. Reads that mapped to multiple regions in the genome were to be extended by 5 bp and then re-mapped. This process will be repeated until either all reads were uniquely mapped to the reference genome or until the reads were extended to their entirety (76 bp). I want to put all these steps
for sample in `ls ./*R1.fastq`
do
dir="./"
base=$(basename $sample "_R1.fastq")
for (i=46;i>=1;i-=5)
do
bowtie2 -x index -q ${base}_R1.fastq -5 0 -3 $i -p 64 --score-min L,-0.6,-0.2 --very-sensitive --no-unal --no-hd --no-sq -S ${base}.sam_$i_trimmed
grep -v XS ${base}.sam_$i_trimmed > Uniq_alignd_${base}.sam_$i_trimmed
grep XS ${base}.sam_$i_trimmed | cut -f1 >multi_aligned_ids_$i_trimmed
seqtk subseq $sample_R1.fastq multi_aligned_ids_$i_trimmed > input_$i_for_bowtie.fastq
done
However Now I am not sure how to use fastq_output file (input_$i_for_bowtie.fastq) from seqtk to be used in bowtie
This iterative trimming method is been used for HI-C Data analysis (Imakaev, et al., 2012). They have a Python module to perform this task. I just want to try it myself using linux.
Not really helping, but fyi: Hi-C data are processed like that because the read may span a ligation junction and might therefore mess up the alignment, but this is not the case for ATAC-seq. Better make sure you properly trim the Nextera adapter
CTGTCTCTTATACACATCT
. More steps are typically not necessary.The reason I chose to do iterative mapping is after adapter trimming only 56% of the reads are getting Uniquely aligned to the reference genome. Where as around 10-15% are getting aligned to multiple locations. I thought If I could somehow retain those as well that would add up to my alignment percentage.
Does your reference genome contain chrM?
Yes. Thought I would grep them out after alignment.
That is reasonable. Still, that mapping rate is poor and unusual for a good ATAC-seq library. Published data?
Yeah. GSE74912