How to remove reads with only Ns from bam/sam/cram file?
2
0
Entering edit mode
3.9 years ago

When using my current sam file (which I generated from a cram file, which in turn was generated from a bam file) in my analysis I kept getting following error:

terminate called after throwing an instance of 'std::out_of_range'
what():  basic_string::substr: __pos (which is 18446744073709551615) > this->size() (which is 51)

or

terminate called after throwing an instance of 'std::out_of_range'
what():  basic_string::substr: __pos (which is 18446744073709551615) > this->size() (which is 101)

After some digging, I found out that the following read caused the problem:

HWI-ST1222:6:1101:1083:200937#0 141     *       0       0       *       *       0       0       NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN     BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB     SM:Z:TCGA-DE-A0Y3-10A-01D-A10R  RG:Z:120617_SN1222_0117_BD1414ACXX_6

I have no clue how the sequencer could even generate such a read but I assume that the fact that it includes only Ns makes my analysis fall apart.

So my question is: How can I filter out reads with only Ns from the sam file?

By the way, following neighboring read gave no trouble:

HWI-ST1222:6:1101:1083:200937#0 77 * 0 0 * * 0 0 CAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB SM:Z:TCGA-DE-A0Y3-10A-01D-A10R RG:Z:120617_SN1222_0117_BD1414ACXX_6

I am currently using whole-genome sequencing data to detect human endogenous retroviruses (HERVs) with the Specific Transposable Element Aligner (HERV-K).

samtools bam cram sam remove reads • 1.8k views
ADD COMMENT
0
Entering edit mode

The simplest solution would be to remove unmapped reads as samtools view -F 4 -o out.bam in.bam

ADD REPLY
0
Entering edit mode

Not an optimal solution as I expect to find relevant sequences particularly in the unmapped reads.

ADD REPLY
0
Entering edit mode

If you don't want to remove all unmapped reads then try the following which should only remove reads that have one or more N's (Note: this will remove reads that may have mapped otherwise but have an N). Not tested but should work. From BBTools package (samtools should be in your $PATH):

reformat.sh in=your.bam out=filtered.bam maxns=0

Adding mappedonly=t will achieve the same thing as @ATPoint's solution. You could also set maxns=N where N is a higher number like the last example you posted (if you want to keep that read).

ADD REPLY
0
Entering edit mode

The problem is that the read length can be variable and therefore I cannot determine a maximum number of Ns that I would want to exclude.

ADD REPLY
0
Entering edit mode

That is something you are going to need to decide for any solution (even if it is something you custom code). I would say try using reformat.sh with 0,1,2 etc N's to see how many alignments you lose.

ADD REPLY
0
Entering edit mode
3.9 years ago
h.mon 35k

You should perform some quality filtering prior to doing downstream analyses - any very light quality filtering would remove these all N reads.

Is this old sequencing data? It seems to be encoded with Illumina 1.5+ Phred format, as all the N bases have a B quality, which would be 3 for Illumina 1.5+ (the lowest possible quality), but 34 for Illumina 1.8+, which is of very good quality.

ADD COMMENT
0
Entering edit mode

Do you have any tool you would recommend for quality filtering?

This is the sequencing information:

@RG     ID:120617_SN1222_0117_BD1414ACXX_6      SM:TCGA-DE-A0Y3-10A-01D-A10R    LB:TCGA-DE-A0Y3-10A-01D-A10R    PL:Illumina_HiSeq2000   CN:Harvard_GCC_02
@PG     ID:bwa  PN:bwa  VN:0.5.9-r16    CL:-l40-k2-n3
@PG     ID:OLB  PN:OLB  VN:OLB-1.9.3    CL:setupBclToQseq.py
ADD REPLY
0
Entering edit mode

bbduk.sh from BBMap suite. A guide is available here.

ADD REPLY
0
Entering edit mode
3.9 years ago
Anand Rao ▴ 630

I have removed stretches of Ns in my FASTQ input file, as part of my FASTQ pre-procesing pipeline, BEFORE any mapping step,

The software suite is: HTStream

The specific tool I've used is: hts_NTrimmer

Install and usage instructions are at this link

The generic version of my very simple-to-use syntax is:

hts_NTrimmer -U $IN -F -g -f -p $OUT -L $LOG

You may be interested in other tools in this suite and their application to any other pre-processing requirements, in which case the link to their bundled use case can be found online at a relatively recent (March 2019) Bioinformatics Workshop page.

ADD COMMENT

Login before adding your answer.

Traffic: 2806 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6