I can't understand my sam files
3 months ago
ManuelDB ▴ 30

I am learning bout the format of the different NGS formats. Most files are quite easy to understand at least the general aspects. However, when I try to understand the Sam files generated in my lab, I can't easily understand the different fields.

The first line of the body of one of my Sam file looks like this

M00321:561:000000000-JM5F9:1:2107:12468:12982   65      1       14588   9       117M    =       14588   0       CCGTCACCCCCTCCCAAGGAAGTAGGTCTGAGCAGCTTGTCCTGGCTGTGTCCATGTCAGAGCAACGGCCCAAGTCTGGGTCTGGGGGGGAAGGTGTCATGGAGCCCCCTACGATTC CCCCCGCFFEEG7@@FFFGGFFCF<<FFGGFGFEEFD<FEFGGF@FGGFGEEFFGGFFF<EAFGGGGGG7@@@EF,C<CECEGCFGCCFE:<C>FFFCFF99:<,8<:*C@C:7*CF   NM:i:0  MD:Z:117        MC:Z:117M       AS:i:117        XS:i:112        RG:Z:1_2        XA:Z:15,-102516461
,117M,1;9,+14699,117M,2;2,-114356309,117M,3;12,-90921,117M,3;


And this is a table explaining a sam file

Things I don't understand are

1. Second column (FLAG) makes no sense to me according to the next table

1. What exactly means the third column (in my example the number 1) should be a string shouldn't?
2. finally, why my sequence has one "white space", more sequence and then a couple of @, more sequence and again a couple of << plus other characters are not part of the sequence?
FWIW, I'd check your command line use to do the alignment. Based on the contents of that line of the sam file, I suspect there may be an error.

3 months ago
GenoMax 115k

Second column (FLAG) makes no sense to me according to the next table

65 - First read in the pair of a paired-read

https://samformat.info/sam-format-flag is another site you can use.

What exactly means the third column (in my example the number 1) should be a string shouldn't?

That is reference name. In this case chromosome is called 1.

why my sequence has one "white space",

I don't think so. After the space is phred quality scores. You can compare the length of sequence and length of the scores.

