Question: Trimming sequences based on NCBI contamination screen report
1
gravatar for T_18
6 months ago by
T_1840
T_1840 wrote:

Hi all!

I am sure I am not the first person with this problem and do not want to reinvent the wheel here, therefor I would like to know if someone has a solution for the following problem:

Based on the contamination report I got from the NCBI contamination screen after trying to upload my transcriptome assembly there are some sequences that need to be trimmed. One example from the report looks like this:

"sequence name" 350 1..109 Pseudomonas putida W619

Meaning that bp 1 - 109 is of suspicious origin and needs to be trimmed from this sequence that is 350 bp in total. Is there a program or script known to you that reads in both the contamination report (can be edited so it only includes lines as the example) and the fasta file and spits out a fasta with some of the secuences trimmed?

Or do I need to write one myself (not so experienced)..?

Thanks in advance!

transcriptome assembly ncbi • 388 views
ADD COMMENTlink modified 11 weeks ago by hjafar0 • written 6 months ago by T_1840

Meaning that bp 1 - 109 is of suspicious origin and needs to be trimmed from this sequence that is 350 bp in total.

NCBI's contamination screen report is probably done using some kind of automated pipeline. Do you have reason to believe that observation in it are true/real? It is possible that your organism may have the same sequence in its genome.

ADD REPLYlink modified 6 months ago • written 6 months ago by genomax69k

I do not have a reference genome so I can never be certain about this, only very careful. My organism is an insect so if a certain region is hitting a Pseudomonas I am quit critical and I would trim this down. Also, when I ran some of these reported sequences randomly through blast I see that region 110-350 is nicely aligned to an insect gene, but the rest Pseudomonas. And this is for all the suspicious reads. Regions that were marked as "need to be deleted completely" are aligning to something different than an insect region when blasting, can be E. coli, mouse, human. I took the safe side here and deleted these completely.

ADD REPLYlink written 6 months ago by T_1840

Also, when I ran some of these reported sequences randomly through blast I see that region 110-350 is nicely aligned to an insect gene, but the rest Pseudomonas.

  • That is suggestive that your insert may have an endogenous Pseudomonas species closely associated with it (if you are able to reproduce the observation with other libraries made from independent insects). e.g. Drosophila contain Wolbachia.
  • If you know that simply not to be the case then you may have contamination in your data that would make the entire assembly suspicious.
ADD REPLYlink written 6 months ago by genomax69k

Just for clarity, this is transcriptome data (RNAseq), meaning that some of the assembled transcripts are not necessarily full assembled genes.

Now I just picked out a sequence with a region hitting Pseudomonas. Fact is that the rest of the region does hit the insect's species gene. Furthermore, I also see that especially the regions that are advised to be deleted completely are really short transcripts (~200 bp) and not close to a full gene. Probably they are hitting another species, vector or even strange (not used) adapters simply due to chance..

I always like to stay on the safe side and rather delete suspicious reads (that are apparently not picked up as insect's own anyway, hence useless for the transcriptome analyses)..

But what would you advice..? And my original question still stands, is there a program out there where I can trim a predefined region from a batch of sequences? (with the region being different for each sequence..)

Thanks!

ADD REPLYlink modified 6 months ago • written 6 months ago by T_1840

To answer your question you can create a bed intervals file that contains information in this format (by cutting out relevant info from your report, example below from your original)

id_of_transcript     end_of_bad_hit+1    length_of_transcript
transcr_1                110                             350
transcr_2                324                             550

Plus

 id_of_transcript     start of transcript    start_of_bad_hit-1
 transcr_1                1                             350
 transcr_2                34                            120

Once you have these you can use samtools faidx, bedtools getfasta or pyfaidx with the intervals and your original dataset to extract just the sequences you want to keep.

ADD REPLYlink modified 6 months ago • written 6 months ago by genomax69k

Can you tell me how to solve this feedback from NCBI as shown below?

I have trimmed adapters and removed contaminants and then mapped/splited fasta file as flowing command line .

#Trim adapters
bbduk.sh in=~/r.fa out=trimmed.fa ktrim=r k=23 mink=11 hdist=1 ref=/bbmap/resources/adapters.fa 

#Remove contaminants
bbduk.sh in=trimmed.fa out=filtered.fa k=31 ref=/bbmap/resources/sequencing_artifacts.fa.gz,/bbmap/resources/phix174_ill.ref.fa.gz

#Quantify coverage
bbmap.sh ref=~/reference.fa in=trimmed.fa out=mapped.fa covstats=covstats.txt
OR
bbsplit.sh in=trimmed.fa ref=~/reference.fa basename=out_%.fasta outu=clean.fa int=t

Thank you in advance.

SUBID BioProject BioSample Organism

[] We ran your sequences through our Contamination Screen. The screen found contigs that need to be trimmed and/or excluded. Please adjust the sequences appropriately and then resubmit your sequences. After you remove the contamination, trim any Ns at the ends of the sequence and remove any sequences that are shorter than 200 nt and not part of a multi-component scaffold.

Note that hits in eukaryotic genomes to mitochondrial sequences can be ignored when specific criteria are met. Those criteria are explained below.

Note that mismatches between the name of the adaptor/primer identified in the screen and the sequencing technology used to generate the sequencing data should not be used to discount the validity of the screen results as the adaptors/primers of many different sequencing platforms share sequence similarity.

[] Some of the sequences hit primers or adaptors used in Illumina or 454 or other sequencing strategies or platforms. Adaptor at the end of a sequence should be removed. However, if adaptors are present within sequences then you should strongly consider splitting the sequences at the adaptors because the adaptor sequence could have been the region of overlap, causing a misassembly.

Skipped 2 same as before; no new sequences to screen. 2 sequences with locations to mask/trim

Trim: Sequence name, length, span(s), apparent source

ADD REPLYlink modified 11 weeks ago by genomax69k • written 11 weeks ago by hjafar0

There is not enough information in this post for us to be able to help you. It looks like you seem to have mostly done the right thing with various bbmap commands but we have no idea what exactly you are doing. There is one exception.

bbmap.sh ref=~/reference.fa in=trimmed.fa out=mapped.fa

That command does not make sense. Why would you use .fa format for alignments, if that is what you are doing.

Are you trying to submit just plain fasta data through the genome submission?

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by genomax69k

Yes, I am trying to submit plain fasta data through the genome submission to NCBI.

I submitted the genome using the following tree step yesterday and I am waiting for the feedback.

#Trim adapters
bbduk.sh in=~/r.fasta out=trimmed.fasta ktrim=r k=23 mink=11 hdist=1 ref=/bbmap/resources/adapters.fa 

#Remove contaminants
bbduk.sh in=trimmed.fasta out=filtered.fasta k=31 ref=/bbmap/resources/sequencing_artifacts.fa.gz,/bbmap/resources/phix174_ill.ref.fa.gz

#Quantify coverage
bbmap.sh ref=~/reference.fasta in=trimmed.fasta outm=mapped.fasta outu=unmapped.fasta
ADD REPLYlink modified 11 weeks ago by genomax69k • written 11 weeks ago by hjafar0

Where did this genome come from? i.e. r.fasta. Did you assemble it yourself? If you did assemble the genome yourself, the trimming adapters/removing contaminants is the first step that needs to be done on raw data before one does an assembly.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by genomax69k

I received the following feedback from NCBI as shown below.

Can you tell me your suggestions ?

Thank you for your GenBank submission.

SUBID           BioProject      BioSample       Localid         Accession       Organism
----------------------------------------------------------------------------------------

Before we continue to process the genome(s), we would like to know why this genome was marked as not circular.

Most bacterial genomes have a circular chromosome. Did you mark this as not circular because:

[1] there is a gap at the end between the last and first base

[2] the genome assembly was not circularized

[3] any possible overlap at the ends was not trimmed?

This information will help us design appropriate questions for the submission portal.

ADD REPLYlink modified 11 weeks ago by genomax69k • written 11 weeks ago by hjafar0

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question.

ADD REPLYlink written 11 weeks ago by genomax69k

You have not answered questions I had asked in my responses. So there is not else to say.

You need to answer the questions NCBI is asking. We have no idea what exactly you did and what this data is from/about.

ADD REPLYlink written 11 weeks ago by genomax69k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 555 users visited in the last hour