Question: Remove fungi from human genome assemblies
0
gravatar for David
8 months ago by
David180
David180 wrote:

Hi,

I´m mapping a set of ont long reads to the human genome:

I´m using this genome version from gencode: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/GRCh38.p13.genome.fa.gz

See description here: https://www.gencodegenes.org/human/

I´m using a skin sample so i expect to find bacteria ... and fungi from Malazessia genus for exemple. But i get no hits since they are filtered when using minimap against the HG reference above.

I´m wondering how are you delaing with these false positives sequences (in such case fung) in human assemblies ??

Thanks

minimap dnaseq • 303 views
ADD COMMENTlink modified 6 months ago by Biostar ♦♦ 20 • written 8 months ago by David180

But i get no hits since they are filtered when using minimap against the HG reference above.

Are you certain? Why should they be filtered?

I am just speculating but I assume you prepped this sample to get fragments that were as long as possible. Perhaps non-human DNA was removed in that process? Fungi have tough cell walls so they could have been excluded?

Have you tried to use a software package like this to see if you are able to find other sequences in your entire data set?

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax92k

The fungi sequences are filtered together with the human genome, the reason why is because the human assemblies contain sequences that unfortunatly belong to contaminates (such as bacterial and fungi). These are hard to filter so was wondering.

The following post from Brian bushnell (bbmap package) discuses in 3) the fungi contamination. His suggestion is to map fungi databases against human genome and mask these sequences from the Human assembly.

http://seqanswers.com/forums/archive/index.php/t-42552.html

Just wondering if others have used a similar approach.

ADD REPLYlink written 8 months ago by David180
1

Brian is referring to short reads which are much more likely to be multi-map/false map.

What is the average read size in your case? At nanopore sized read length you should see something (if you expect non-human sequences) in your data. If you are seeing nothing then my hunch is still for sample prep eliminating non-human sequences. I guess you are not interested in human data for this experiment?

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax92k

Yes that´s a metagenomics experiment on skin, i´m not interested in human , but bacteria and fungi. For some reasons fungi are filtered out:

This is the command line:

# Map ONT  reads to human genome
minimap2  -t10 -ax map-ont human_p38.3.fa   ONT.reads.fq | samtools view --threads 10 -Sb - > mapping.human.bam

# Get nanopore reads unmapped to human genome
samtools view --threads 10 -b -f 4 -F 0x900  mapping.human.bam >  human.unmapped.bam
samtools sort...
samtools fastq...

#Get ONT reads mapped to the human genome
samtools view --threads 10 -b -F4 -F0x900 human.mapping.bam > human.mapped.bam
samtools sort...
samtools fastq...

Is that ok ?

ADD REPLYlink modified 8 months ago • written 8 months ago by David180

Can you try:

minimap2 -ax map-ont contaminant.fasta nanopore.fastq.gz | samtools fastq -n -f 4 - > nano_clean.fastq.gz

Command is from this thread. contaminant.fasta would be your human genome.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax92k

The -F 0x900 is for excluding secondary and supplementary alignments from the unmapped reads. I don´t see why secondary alignments would filter fungi unless human reads map to fungi ?

ADD REPLYlink written 8 months ago by David180

Do you have secondary alignments? You have not said anything about the length of your reads so far.

ADD REPLYlink written 8 months ago by genomax92k

Yes i do have secondary aligments:

samtools flagstat mapping.human.bam
141346 + 0 in total (QC-passed reads + QC-failed reads)
12798 + 0 secondary
44548 + 0 supplementary
0 + 0 duplicates
89377 + 0 mapped (63.23% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

As for the reads length: Average read length is 2553 bp

So i´m going to investigate secondary aligmments as i guess that´s the problem

ADD REPLYlink modified 8 months ago • written 8 months ago by David180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1333 users visited in the last hour