Hello,
I am trying to generate four column files for methylation analysis from my NGS alignments against the human genome (single and paired-end). The file needs to contain the following information for any read that aligns:
1 - chr#
2 - start position of alignment
3 - stop position of alignment
4 - strand (+/-)
And nothing else.
Currently I am trying to awk the bam files to obtain the info, and it seems to work well enough except for strand info. I am either obtaining all positive or all negative strand reads, even though I can see I have alignments to both. I don't know if this is a problem with my awk command or what. If I recall correctly, the PE reads all came out positive strand and the single reads in another run all came out negative. (possibly it was the reverse pattern)
So I have two questions. First, is my awk command roughly correct? Do you have better suggestions or any idea what I am doing wrong. Second, can you think of any other way to generate this 4 column file? (comma separated preferably)
Here is the awk for the PE reads:
samtools view -f 0x0002 -X yourbamfilehere | gawk '{ printf "%s,%s,%s,%s\n", $3, $4, $4 + length($10), (index($2,"r") == 0) }' > yourcsvfilenamehere
Thanks so much! JN
Thank you so much!