Question: Removing Human Contigs From Metagenomic Shotgun Assembly (FASTA)
0
gravatar for isu2017
13 months ago by
isu20170
isu20170 wrote:

Hi there,

I used SPADE to assemble my metagenome shotgun dataset into contigs. I just realized, however, that there is human contamination in this assembly. Because of how long the assembly took, I'm trying to think of ways to remove those human contigs from the FASTA assembly. Any suggestions? Now, if I need to go back a step, and remove them from the FASTQ files, how should I proceed? (I'd rather not use something like Kneaddata from removal of human contaminations btw.)

thanks!

metagenome metagenomics spade • 742 views
ADD COMMENTlink modified 13 months ago by theglobetrotter7850 • written 13 months ago by isu20170
4

You could simply align the data to human genome (use blat, LAST or LASTZ) and remove sequences that align.

If you are willing to go back to the original data then try: http://seqanswers.com/forums/showthread.php?t=42552

ADD REPLYlink written 13 months ago by genomax92k
1

BlobTools is great for this, although if you have too many contigs (hundred thousands or millions of contigs) the blast step may be too slow.

ADD REPLYlink modified 13 months ago by genomax92k • written 13 months ago by h.mon31k

In addition to good suggestions that are already part of this thread, I think you should look at all similar posts on the far right side of this page. This is a fairly common problem and has been debated already.

You may want to consider binning of your sequences with t-SNE or UMAP. Human contigs that are > 5kb should separate easily from other sequences.

ADD REPLYlink written 13 months ago by Mensur Dlakic7.1k
1
gravatar for theglobetrotter78
13 months ago by
theglobetrotter7850 wrote:

Removing the host genome should be a part of your quality control step of your metagenomic pipeline. You can do this right after you quality trim your sequences. There are several ways to remove the host genome but I personally used BWA (Bowtie2 is another option) to align the reads to human genome. You will get two SAM or BAM files (aligned and unaligned) as output and you will take the unaligned SAM/BAM file and convert it to FASTA or FASTQ (I used Picard Tools here but you can also use SAMTools or BAMTools) to obtain non-human reads with which you will perform assembly. Regardless of whether you remove host reads or not, depending on the size of your data set SPADES can be a memory hog and take a while to run.

ADD COMMENTlink written 13 months ago by theglobetrotter7850
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1358 users visited in the last hour