Finding the number of aligned nucleotides per read in a sam file
2
0
Entering edit mode
5.5 years ago
ioannis ▴ 50

Hi,

I have tried a "home-made" method in order to extract the number of nucleotides per read out of a SAM file but is not working correctly for all the reads due to some deletions (I guess).

grep -e "pattern" my.sam | cut -f 2,3,4,5 > output.txt

To explain, I search for reads with a pattern and then extract some information from every read such as Chromosome, Position, Strand, and Sequence. Then I use R software to count the number of characters of the "Sequence" column and I get the number of nucleotides per read. However, the sequence sometimes might contain a deletion "-" which counts as a character and I get some misplaced reads. I don't want to get rid of these reads. My alignment parameters allow only 1 mismatch so I expect to get a single deletion or addition if that matters.

Is there any way to use the cigar strings of each read and extract the correct number of aligned nucleotides per read?

Thanks for your time, Ioannis

SAM CIGAR stings • 1.8k views
ADD COMMENT
0
Entering edit mode

However, the sequence sometimes might contain a deletion "-" which counts as a character a

that's not a SAM file. The specification doesn't allow a hyphen. https://samtools.github.io/hts-specs/SAMv1.pdf

String *|[A-Za-z=.]+ segment SEQuence

ADD REPLY
0
Entering edit mode

I see, but maybe I didn't make it clear. In the Sequence column I do not get any "-" or any other symbols but I get some aligned reads with chromosome and position that do not much the genomic sequence by one nucleotide.

Example: Genomic sequence: AGTCTAGTACCC Aligned sequence: AGCTAGTACCC

In that case there is a single deletion and I wonder if within the SAM file this information can be found and somehow extracted.

ADD REPLY
3
Entering edit mode
5.5 years ago

using bioalcidaejdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html and looping over the cigar string:

java -jar dist/bioalcidaejdk.jar -e 'stream().filter(R->!R.getReadUnmappedFlag() && R.getCigar()!=null).forEach(R->println(R.getReadName()+"\t"+R.getContig()+"\t"+R.getStart()+"\t"+R.getCigar().getCigarElements().stream().filter(CE->CE.getOperator().isAlignment()).mapToInt(CE->CE.getLength()).sum()));' input.bam
ADD COMMENT
1
Entering edit mode

Sorry for taking some time to answer but I am trying to understand how everything works without knowing java basics. The script you shared is working perfectly.

Thanks Pierre

ADD REPLY
0
Entering edit mode
5.1 years ago
jilguero888 ▴ 20

Hi, you can use the raw output from samtools mpileup (without options), and on the 5th column you have the aligned nucleotides at every position.

Get all the nucleotides from a bam file that is aligning to specific position in the genome

ADD COMMENT
1
Entering edit mode

the question was "number of aligned bases per reads" not "number of reads per position"

ADD REPLY

Login before adding your answer.

Traffic: 1895 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6