Question

paired-end reads, idxstats, and samtools and grep

0

Entering edit mode

19 months ago

Bogdan ★ 1.4k

Dear all,

I would appreciate having your insight on the following situation when dealing with a bam file that contains paired end reads of ChIP-seq and a specific spike-in (sequence A).

I do see a discrepancy between the number of sequenceA reported by "samtools idxstats" versus "grep sequenceA". The question is why ?

A. shall I work with the entire bam file, that contained both Read1 and Read2 of paired-end pair (ALL reads), and spike-ins sequence A, and samtools idxstats produces the following output regarding the number of these sequences in the file:

*sequence_A :: 313668

B. Shall I separate the BAM file into 2 bam files that contain either FWD or REV reads, the results of samtools idxstats show 1/2 of the number of sequence A :

<> FWD reads :

"sequence _A : 156713

<> REV reads :

" sequence_A : 156955"

C. So far, so good. The question arises when I check the number of Sequence_A in the BAM files with ALL reads, FWD or REV reads, by using "grep" the numbers are :

<> ALL reads :

samtools view ALL_READS.bam | grep "Sequence A" | wc -l :: 130481

<> FWD reads :

samtools view FWD_READS.bam | grep "Sequence A" | wc -l :: 0

<> REV reads :

samtools view REV_READS.bam | grep "Sequence A" | wc -l :: 130481

**

The code that I am using is shown below : 

     samtools view -b -F 4 $FILE -o "${FILE%.bam}.map.bam"

     samtools sort "${FILE%.bam}.map.bam" -o "${FILE%.bam}.map.sort.bam"
     samtools index "${FILE%.bam}.map.sort.bam"

     samtools idxstats "${FILE%.bam}.map.sort.bam" | tee "${FILE%.bam}.map.sort.idxstats" > /dev/null
     samtools view -h "${FILE%.bam}.map.sort.bam" -o "${FILE%.bam}.map.sort.sam"

      ####################################################################################
     #################################################################################### to separate into FORWARD strand

     samtools view -F 0x10  "${FILE%.bam}.map.sort.bam" -bh -o "${FILE%.bam}.map.sort.fwd.bam"

     samtools index "${FILE%.bam}.map.sort.fwd.bam"
     samtools idxstats "${FILE%.bam}.map.sort.fwd.bam" | tee "${FILE%.bam}.map.sort.fwd.idxstats.txt" > /dev/null

     ####################################################################################
     #################################################################################### to separate into REVERSE strand

     samtools view -f 0x10 "${FILE%.bam}.map.sort.bam" -bh -o "${FILE%.bam}.map.sort.rev.bam"

     samtools index "${FILE%.bam}.map.sort.rev.bam"
     samtools idxstats "${FILE%.bam}.map.sort.rev.bam" | tee "${FILE%.bam}.map.sort.rev.idxstats.txt" > /dev/null

**

samtools • 1.8k views

ADD COMMENT • link updated 19 months ago by LauferVA 4.2k • written 19 months ago by Bogdan ★ 1.4k

score 0 · Answer 1 · 2022-09-01

0

Entering edit mode

19 months ago

swbarnes2 14k

You don't say how large the discrepancy is, but surely part of the point of using an aligner is that it will align reads with errors, while a grep won't count reads with errors.

ADD COMMENT • link 19 months ago by swbarnes2 14k

0

Entering edit mode

I believe in this case sequenceA would refer to e.g. a chromosome that the reads are aligned to (I say this because that is what idxstats does, it reports the total reads on each chromosome basically)

ADD REPLY • link 19 months ago by cmdcolin ★ 3.8k

0

Entering edit mode

Yes, indeed, we use sequence A as an "additional chromosome" beside the usual chromosomes 1, ... 22, X and Y.

ADD REPLY • link 19 months ago by Bogdan ★ 1.4k

0

Entering edit mode

A question for our biostar administrators : is it OK if I post the same question on StackOverflow in order to learn other people's opinions ? Thank you.

ADD REPLY • link 19 months ago by Bogdan ★ 1.4k

0

Entering edit mode

If you feel that you will get useful answers then feel free to post at SO. Please reference this thread in your SO post so people are aware of the cross-posting. Perhaps add a link for the cross-posted thread here in original post as well.

I am not certain what exactly you are asking above. You only looked for forward and reverse reads but did you account for secondary/ supplementary alignments? Use of grep is simply to look for word sequence A nothing more correct?

ADD REPLY • link 19 months ago by GenoMax 141k

0

Entering edit mode

Thank you, yes, a good point about excluding secondary or supplementary alignments.

ADD REPLY • link 19 months ago by Bogdan ★ 1.4k

0

Entering edit mode

bogdan - what ethnicity ? are the samples of a different eth than the genomic ref?

VAL

ADD REPLY • link 19 months ago by LauferVA 4.2k

0

Entering edit mode

Hi Vincent, it is a good question. I know that we work with human samples, but I am not sure about sample ethnicity. An information that I have not included in my message was that Sequence A, Sequence B, etc are barcode sequences that we find in .bam files. In this particular case, we would not expect SNPs in their sequences.

ADD REPLY • link 19 months ago by Bogdan ★ 1.4k

0

Entering edit mode

Bogdan,

this issue has clung to my memory. have you reached resolution?

VAL

ADD REPLY • link 19 months ago by LauferVA 4.2k

0

Entering edit mode

Hi Vincent, yes, I will post the answers a bit later :)

ADD REPLY • link 19 months ago by Bogdan ★ 1.4k

0

Entering edit mode

oke we are waiting. :-)

ADD REPLY • link 19 months ago by LauferVA 4.2k