I am on paleogenomic data. I have BAM files from two different datasets that were processed/sequenced different (one was partially UDG treated, one was not UDG treated at all - the latter was sequenced on a newer machine).
Most of my comparisons between the two datasets are plagued by batch effects. One idea my adviser asked me to try is clip the first 5-10 bp off all reads within all BAM files. The problem is neither of us know how to do that.
Anyone here know of a way to trim reads within a BAM file?
I do have the FASTQs. Adapters were trimmed. The idea here is to uniformly clip additional bases to make sure the damaged bases are removed.
We are not quite sure what is causing the batch effect. It could be quality score related. There is a lot more variation in the partially UDG set than in the later untreated set. When I look at rates of allele sharing, I get more sharing between batches than between individuals with similar ancestry between the two batches.
Try something like this on your FASTQ files: fastx-toolkit: fastx-trimmer , it'll let you remove the first/last N bases with
And then realign into BAM files and see if it helps.
Also you should run your FASTQ files through some tools like FastQC and aggregate your results with MultiQC. This will give you statistics on things like per base sequence content, over-represented sequences, quality score distributions etc.
An additional suggestion: You should identify variants that are shared within a batches, ideally a few examples like that ( of batch specific alleles.) Load up your BAM files in IGV, and go to the position of the variants. Do these variants tend to fall mostly in the first 5-10bp, or are they all over the reads? That might help you determine if clipping is appropriate. You could probably script it too by getting all reads spanning a batch specific variant from the BAM files, and computing the diffence between the allele position and the aligned read start distance, and see if the distribution is flat or skewed one way or another.