Question

How to remove reads with only Ns from bam/sam/cram file?

0

Entering edit mode

3.9 years ago

erik.stricker • 0

When using my current sam file (which I generated from a cram file, which in turn was generated from a bam file) in my analysis I kept getting following error:

terminate called after throwing an instance of 'std::out_of_range'
what():  basic_string::substr: __pos (which is 18446744073709551615) > this->size() (which is 51)

or

terminate called after throwing an instance of 'std::out_of_range'
what():  basic_string::substr: __pos (which is 18446744073709551615) > this->size() (which is 101)

After some digging, I found out that the following read caused the problem:

HWI-ST1222:6:1101:1083:200937#0 141     *       0       0       *       *       0       0       NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN     BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB     SM:Z:TCGA-DE-A0Y3-10A-01D-A10R  RG:Z:120617_SN1222_0117_BD1414ACXX_6

I have no clue how the sequencer could even generate such a read but I assume that the fact that it includes only Ns makes my analysis fall apart.

So my question is: How can I filter out reads with only Ns from the sam file?

By the way, following neighboring read gave no trouble:

HWI-ST1222:6:1101:1083:200937#0 77 * 0 0 * * 0 0 CAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB SM:Z:TCGA-DE-A0Y3-10A-01D-A10R RG:Z:120617_SN1222_0117_BD1414ACXX_6

I am currently using whole-genome sequencing data to detect human endogenous retroviruses (HERVs) with the Specific Transposable Element Aligner (HERV-K).

samtools bam cram sam remove reads • 1.8k views

ADD COMMENT • link updated 3.9 years ago by Anand Rao ▴ 630 • written 3.9 years ago by erik.stricker • 0

0

Entering edit mode

The simplest solution would be to remove unmapped reads as samtools view -F 4 -o out.bam in.bam

ADD REPLY • link 3.9 years ago by ATpoint 82k

0

Entering edit mode

Not an optimal solution as I expect to find relevant sequences particularly in the unmapped reads.

ADD REPLY • link 3.9 years ago by erik.stricker • 0

0

Entering edit mode

If you don't want to remove all unmapped reads then try the following which should only remove reads that have one or more N's (Note: this will remove reads that may have mapped otherwise but have an N). Not tested but should work. From BBTools package (samtools should be in your $PATH):

reformat.sh in=your.bam out=filtered.bam maxns=0

Adding mappedonly=t will achieve the same thing as @ATPoint's solution. You could also set maxns=N where N is a higher number like the last example you posted (if you want to keep that read).

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

The problem is that the read length can be variable and therefore I cannot determine a maximum number of Ns that I would want to exclude.

ADD REPLY • link 3.9 years ago by erik.stricker • 0

0

Entering edit mode

That is something you are going to need to decide for any solution (even if it is something you custom code). I would say try using reformat.sh with 0,1,2 etc N's to see how many alignments you lose.

ADD REPLY • link 3.9 years ago by GenoMax 141k

h.mon · Answer 1 · 2020-06-17

0

Entering edit mode

3.9 years ago

h.mon 35k

You should perform some quality filtering prior to doing downstream analyses - any very light quality filtering would remove these all N reads.

Is this old sequencing data? It seems to be encoded with Illumina 1.5+ Phred format, as all the N bases have a B quality, which would be 3 for Illumina 1.5+ (the lowest possible quality), but 34 for Illumina 1.8+, which is of very good quality.

ADD COMMENT • link 3.9 years ago by h.mon 35k

0

Entering edit mode

Do you have any tool you would recommend for quality filtering?

This is the sequencing information:

@RG     ID:120617_SN1222_0117_BD1414ACXX_6      SM:TCGA-DE-A0Y3-10A-01D-A10R    LB:TCGA-DE-A0Y3-10A-01D-A10R    PL:Illumina_HiSeq2000   CN:Harvard_GCC_02
@PG     ID:bwa  PN:bwa  VN:0.5.9-r16    CL:-l40-k2-n3
@PG     ID:OLB  PN:OLB  VN:OLB-1.9.3    CL:setupBclToQseq.py

ADD REPLY • link updated 3.9 years ago by h.mon 35k • written 3.9 years ago by erik.stricker • 0

0

Entering edit mode

bbduk.sh from BBMap suite. A guide is available here.

ADD REPLY • link 3.9 years ago by GenoMax 141k

score 0 · Answer 2 · 2020-06-17

I have removed stretches of Ns in my FASTQ input file, as part of my FASTQ pre-procesing pipeline, BEFORE any mapping step,

The software suite is: HTStream

The specific tool I've used is: hts_NTrimmer

Install and usage instructions are at this link

The generic version of my very simple-to-use syntax is:

hts_NTrimmer -U $IN -F -g -f -p $OUT -L $LOG

You may be interested in other tools in this suite and their application to any other pre-processing requirements, in which case the link to their bundled use case can be found online at a relatively recent (March 2019) Bioinformatics Workshop page.