So I have been tasked with analyzing some sequence data even though I have no clue what I'm doing. I was given some data from 11 samples (S1, S2, etc.), each with a singles file as well as a forward read file (R1) and a reverse read file (R2). In addition, each sample was run in two different lanes (L001 and L002), so for every file there is a corresponding file from the other lane. I was given this data after it had already had some quality control done using Scythe and Sickle. They were in fastq format.
So my first step was to map these files to the reference. I did this using BWA mem. I aligned both the R1 and R2 files of a given sample and lane to the reference, then did it for the R1 and R2 of the same sample different lane, then did it for the singles files for each lane. Therefore, for every sample, I got 4 sam files that were mapped to the reference (ex - S1 L001 R1&R2, S1 L002 R1&R2, S1 L001 single, S1 L002 single) for all 11 samples.
Next I used samtools to convert the sam files to bam, as well as to constrict the bam file to only what was mapped to the reference genome.
Now here is where the trouble begins - I next used samtools to merge all of a sample's files together, for instance anything from S1 including both lanes for both the singles and the merged R1&R2 files. I used
samtools merge S1_merged.bam singleS1_L001.bam singleS1_L002.bam S1_L001.bam S1_L002.bam.
Then I tried to use MarkDuplicatesWithMateCigar in Picard to mark the duplicates in the single merged file (S1_merged.bam). But when I did it gave me the error "this program requires inputs in coordinate SortOrder." It seems as though my headings weren't sorted correctly.
So I tried to sort the merged bam file using samtools sort. I did
samtools sort S1_merged.bam -o S1_sorted.bam
which gave me a ton of files. I tried redoing it using the "-m 20G" command and it gave me 6 files instead.
So then I merged these six sorted files into "S1_sorted.bam" using samtools merge and tried doing MarkDuplicatesWithMakeCigar again. I did
java -jar $PICARD MarkDuplicatesWithMateCigar I=S1_sorted.bam O=S1_marked.bam M=S1_marked_metrics.txt
And it told me "Exception in thread "main" picard.PicardException: Found a samRecordWithOrdinal with sufficiently large clipping that we may have missed including it in an early duplicate marking iteration. Please increase the minimum distance to at least 120bp." So I tried to do it again but with the command "MINIMUM_DISTANCE=120" command added and it didn't even give me an error, it just spit me back out a list of a bunch of commands. I tried using MarkDuplicates instead of MarkDuplicatesWithMateCigar and it did the same thing.
I'm really at a loss here guys. Should I have sorted before I merged all the lanes and singles? Should I have merged my sorted files after sorting? Am I missing something?
Any help would be greatly appreciated.