Question: Finding the number of aligned nucleotides per read in a sam file
0
gravatar for ioannis
23 months ago by
ioannis30
ioannis30 wrote:

Hi,

I have tried a "home-made" method in order to extract the number of nucleotides per read out of a SAM file but is not working correctly for all the reads due to some deletions (I guess).

grep -e "pattern" my.sam | cut -f 2,3,4,5 > output.txt

To explain, I search for reads with a pattern and then extract some information from every read such as Chromosome, Position, Strand, and Sequence. Then I use R software to count the number of characters of the "Sequence" column and I get the number of nucleotides per read. However, the sequence sometimes might contain a deletion "-" which counts as a character and I get some misplaced reads. I don't want to get rid of these reads. My alignment parameters allow only 1 mismatch so I expect to get a single deletion or addition if that matters.

Is there any way to use the cigar strings of each read and extract the correct number of aligned nucleotides per read?

Thanks for your time, Ioannis

sam cigar stings • 692 views
ADD COMMENTlink modified 19 months ago by jilguero88810 • written 23 months ago by ioannis30

However, the sequence sometimes might contain a deletion "-" which counts as a character a

that's not a SAM file. The specification doesn't allow a hyphen. https://samtools.github.io/hts-specs/SAMv1.pdf

String *|[A-Za-z=.]+ segment SEQuence

ADD REPLYlink written 23 months ago by Pierre Lindenbaum130k

I see, but maybe I didn't make it clear. In the Sequence column I do not get any "-" or any other symbols but I get some aligned reads with chromosome and position that do not much the genomic sequence by one nucleotide.

Example: Genomic sequence: AGTCTAGTACCC Aligned sequence: AGCTAGTACCC

In that case there is a single deletion and I wonder if within the SAM file this information can be found and somehow extracted.

ADD REPLYlink written 23 months ago by ioannis30
2
gravatar for Pierre Lindenbaum
23 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

using bioalcidaejdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html and looping over the cigar string:

java -jar dist/bioalcidaejdk.jar -e 'stream().filter(R->!R.getReadUnmappedFlag() && R.getCigar()!=null).forEach(R->println(R.getReadName()+"\t"+R.getContig()+"\t"+R.getStart()+"\t"+R.getCigar().getCigarElements().stream().filter(CE->CE.getOperator().isAlignment()).mapToInt(CE->CE.getLength()).sum()));' input.bam
ADD COMMENTlink written 23 months ago by Pierre Lindenbaum130k
1

Sorry for taking some time to answer but I am trying to understand how everything works without knowing java basics. The script you shared is working perfectly.

Thanks Pierre

ADD REPLYlink written 23 months ago by ioannis30
0
gravatar for jilguero888
19 months ago by
jilguero88810
United States
jilguero88810 wrote:

Hi, you can use the raw output from samtools mpileup (without options), and on the 5th column you have the aligned nucleotides at every position.

Get all the nucleotides from a bam file that is aligning to specific position in the genome

ADD COMMENTlink written 19 months ago by jilguero88810

the question was "number of aligned bases per reads" not "number of reads per position"

ADD REPLYlink written 19 months ago by Pierre Lindenbaum130k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1888 users visited in the last hour