Question

Remove fungi from human genome assemblies

0

Entering edit mode

4.1 years ago

David ▴ 230

Hi,

I´m mapping a set of ont long reads to the human genome:

I´m using this genome version from gencode: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/GRCh38.p13.genome.fa.gz

See description here: https://www.gencodegenes.org/human/

I´m using a skin sample so i expect to find bacteria ... and fungi from Malazessia genus for exemple. But i get no hits since they are filtered when using minimap against the HG reference above.

I´m wondering how are you delaing with these false positives sequences (in such case fung) in human assemblies ??

Thanks

minimap dnaseq • 1.5k views

ADD COMMENT • link updated 4.0 years ago by Biostar 20 • written 4.1 years ago by David ▴ 230

0

Entering edit mode

But i get no hits since they are filtered when using minimap against the HG reference above.

Are you certain? Why should they be filtered?

I am just speculating but I assume you prepped this sample to get fragments that were as long as possible. Perhaps non-human DNA was removed in that process? Fungi have tough cell walls so they could have been excluded?

Have you tried to use a software package like this to see if you are able to find other sequences in your entire data set?

ADD REPLY • link 4.1 years ago by GenoMax 141k

0

Entering edit mode

The fungi sequences are filtered together with the human genome, the reason why is because the human assemblies contain sequences that unfortunatly belong to contaminates (such as bacterial and fungi). These are hard to filter so was wondering.

The following post from Brian bushnell (bbmap package) discuses in 3) the fungi contamination. His suggestion is to map fungi databases against human genome and mask these sequences from the Human assembly.

http://seqanswers.com/forums/archive/index.php/t-42552.html

Just wondering if others have used a similar approach.

ADD REPLY • link 4.1 years ago by David ▴ 230

1

Entering edit mode

Brian is referring to short reads which are much more likely to be multi-map/false map.

What is the average read size in your case? At nanopore sized read length you should see something (if you expect non-human sequences) in your data. If you are seeing nothing then my hunch is still for sample prep eliminating non-human sequences. I guess you are not interested in human data for this experiment?

ADD REPLY • link 4.1 years ago by GenoMax 141k

0

Entering edit mode

Yes that´s a metagenomics experiment on skin, i´m not interested in human , but bacteria and fungi. For some reasons fungi are filtered out:

This is the command line:

# Map ONT  reads to human genome
minimap2  -t10 -ax map-ont human_p38.3.fa   ONT.reads.fq | samtools view --threads 10 -Sb - > mapping.human.bam

# Get nanopore reads unmapped to human genome
samtools view --threads 10 -b -f 4 -F 0x900  mapping.human.bam >  human.unmapped.bam
samtools sort...
samtools fastq...

#Get ONT reads mapped to the human genome
samtools view --threads 10 -b -F4 -F0x900 human.mapping.bam > human.mapped.bam
samtools sort...
samtools fastq...

Is that ok ?

ADD REPLY • link 4.1 years ago by David ▴ 230

0

Entering edit mode

Can you try:

minimap2 -ax map-ont contaminant.fasta nanopore.fastq.gz | samtools fastq -n -f 4 - > nano_clean.fastq.gz

Command is from this thread. contaminant.fasta would be your human genome.

ADD REPLY • link 4.1 years ago by GenoMax 141k

0

Entering edit mode

The -F 0x900 is for excluding secondary and supplementary alignments from the unmapped reads. I don´t see why secondary alignments would filter fungi unless human reads map to fungi ?

ADD REPLY • link 4.1 years ago by David ▴ 230

0

Entering edit mode

Do you have secondary alignments? You have not said anything about the length of your reads so far.

ADD REPLY • link 4.1 years ago by GenoMax 141k

0

Entering edit mode

Yes i do have secondary aligments:

samtools flagstat mapping.human.bam
141346 + 0 in total (QC-passed reads + QC-failed reads)
12798 + 0 secondary
44548 + 0 supplementary
0 + 0 duplicates
89377 + 0 mapped (63.23% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

As for the reads length: Average read length is 2553 bp

So i´m going to investigate secondary aligmments as i guess that´s the problem

ADD REPLY • link 4.1 years ago by David ▴ 230