Question

Removing chrEBV reads from bam file

0

Entering edit mode

3.1 years ago

Marco Pannone ▴ 790

Hello!

I am trying to visualize on the UCSC Genome Browser some .wig files, but I keep getting the following error message during the uploading of my custom track:

Error line 39996036 of somefile.gz: 'chrEBV' is not a valid sequence name in hg38

I have already trying to filter out from the starting .bam files all the reads corresponding to chrEBV, using the procedure suggested in Remove mitochondrial reads from BAM files.

Unfortunately, I still get the same error when trying to upload the new .wig file.

How can I make sure to get rid of all the chrEBV reads and finally manage to visualize my data on the Genome Browser?

Thanks!

bam genome • 2.4k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 3.1 years ago by Marco Pannone ▴ 790

0

Entering edit mode

I have already trying to filter out from the starting .bam files all the read

how did you do that ? what was the cmd -line ? what is the output of `samtools idxstat' ?

ADD REPLY • link 3.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

My command line was: samtools idxstats input.bam | cut -f 1 | grep -v chrEBV | xargs samtools view -b input.bam > output_filtered.bam

ADD REPLY • link 3.1 years ago by Marco Pannone ▴ 790

0

Entering edit mode

it looks ok. And how did you create the wig ?

ADD REPLY • link 3.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I first created .bigWig files with the following command line:

bamCoverage -b input.bam -o output.bw -of bigwig -bs 20 -p 6 --effectiveGenomeSize 2747877777 --normalizeUsing RPKM -e 76 --centerReads

and then converted it from .bigWig to .wig, using this command line:

bigWigToWig input.bw output.wig

bamCoverage does not allow me to create a .wig file directly, that's why I follow this two-steps procedure. Also, when uploading the tracks on the UCSC Genome Browser, I prefer to go with .wig.gz files rather than .bigWig, since for .bigWig I need to upload them first on a web-server and then provide an URL (as long as I understood).

ADD REPLY • link 3.1 years ago by Marco Pannone ▴ 790

1

Entering edit mode

yes, I think ATPoint is right. If bamCoverage uses the SAM header dictionary to create the wig, i will create some records containing the chrEBV chromosome.

ADD REPLY • link 3.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Yes, indeed his approach worked fine. Thanks anyway for the reply and the time dedicated to my issue :)

ADD REPLY • link 3.1 years ago by Marco Pannone ▴ 790

score 2 · Answer 1 · 2021-03-12

2

Entering edit mode

3.1 years ago

ATpoint 82k

That will only remove the sequences aligned to chrEBV but not remove it from the header. Depending which tool you use there will be chrEBV in the wig with coverage of zero.

Try (untested):

samtools idxstats input.bam \
| cut -f 1 \
| grep -v 'chrEBV' \
| xargs samtools view -h input.bam \
| grep -v 'chrEBV' \
| samtools view -o output_filtered.bam

ADD COMMENT • link 3.1 years ago by ATpoint 82k

0

Entering edit mode

This worked for me! Now the .wig files upload on the UCSC Genome Browser without any error message. Thanks a lot for that!

Just out of curiosity, how is it possible that I ended up having chrEBV reads in my dataset? They belong to the Epstein-Barr virus, right? The biological starting material of the experiment is human cell culture, so I wonder if this is a sign of cell culture contamination. Still, I did not expect that GRChM38 genome assembly would include also genomic regions belonging to EBV.

ADD REPLY • link 3.1 years ago by Marco Pannone ▴ 790

2

Entering edit mode

including common contaminants remove false positives reads mapping at the wrong position. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0097876

ADD REPLY • link 3.1 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Adding on this, here is a nice explanation as well: C: What is the story behind chrEBV?

It basically captures EBV contaminations avoiding that these sequence false-positively get aligned to the genome by providing the true-positive sequence (=the chrEBV). This is a so-called "decoy" sequence.

ADD REPLY • link 3.1 years ago by ATpoint 82k