Entering edit mode
3.4 years ago
sunnykevin97
▴
1000
Hi,
I find a lot of duplicates in my SAM file, how do I remove them? Not by using Picard or samtools are they unable to remove them, any AWK command ?
I found one cmd - every time I had to give a duplicate entry (671) , I had more duplicates in the SAM file. How do I automate the process ?
awk 'BEGIN { i = 0; } /^@/ { if (/671/) { if (i++ < 1) { print; } } else { print } } /^[^@]/ { print }' AvA_.sam > AvA_fixed_.sam
Suggestions.
My 2p - unless you are experienced, don't use
sedorawkfor manipulating vcf or sam files. There's almost certainly a more robust and less risky package that you could use.Any suggestions, I'm unable to find any such packages that do the job. Do you have anything in mind ?
samtools markup with -r option. You don't need to remove duplicates. If you mark duplicates, that is enough for downstream tools. Bamutil has dedup opion. Try that too.
Bamutil works fine.
what's a duplicate for you ? a normal way to remove the FLAG=duplicate would be
samtools view -F 1024 in.bammight be worth looking in to the
BBMappackage , there is a sub-program that is calleddedupe.sh, or you can even get there by usingBBdukI assume.dedupe.sh only removes duplicates from fasta or fastq file. Both the programs can't remove duplicates from SAM files.
that's true indeed. my bad :/
(and yes, from sam to fastq, dedupe is too much work-around )
All the SAM files generated using BWA mem.
After de novo assembly, I choose assembled contigs as reference and mapped to the trimmed fastq reads.
I found a lot of duplicates only from the Velvet and the Abyss contigs, not from Spades.
My overall objective is to construct a META ASSEMBLY by combining all the assemblies into one, that's why I'm generating a SAM-->BAM file to feed into gam-ngs, which generates the met assembly finally.
I'm generating meta assembly because I had a fragment genome assembly, I'd like to improve the continuity of scaffolds.
With fragmented assembly, it's so troublesome to annotate the genome, it fails.