I am a programmer and I just got this position to help on a research project. However, I don't have a background on DNA sequencing. Although I am reading about it and learning more about it. The only way to get this project running is to understand the data well.
This is part of a SAM file. What does NNN... mean? I get the idea that when a FASTQ file is aligned with a referenced genome that its output is a BAM file which can be converted to a SAM file. The goal is to extract the unmapped reads from the SAM file, convert it back to FASTQ and compare it with another reference genome. But I'm on the first step, extracting the unmapped reads of the SAM file, but I've notice this sequence below and I don't understand it. I was wondering if this is something I can delete or does this mean anything.
chr17_77390200_77390383_0:0:0_1:0:0_54817 117 chr17 77379231 0 * = 77379231 0 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 22222222222222222222222222222222222222222222222222
thanks a lot. I will try them. I've successfully was able to get the desired output but the runtime is very slow and picard seems to eat a lot of memory. I will the tools you mentioned.