Question: Trimming sequences based on NCBI contamination screen report
0
gravatar for T_18
28 days ago by
T_1820
T_1820 wrote:

Hi all!

I am sure I am not the first person with this problem and do not want to reinvent the wheel here, therefor I would like to know if someone has a solution for the following problem:

Based on the contamination report I got from the NCBI contamination screen after trying to upload my transcriptome assembly there are some sequences that need to be trimmed. One example from the report looks like this:

"sequence name" 350 1..109 Pseudomonas putida W619

Meaning that bp 1 - 109 is of suspicious origin and needs to be trimmed from this sequence that is 350 bp in total. Is there a program or script known to you that reads in both the contamination report (can be edited so it only includes lines as the example) and the fasta file and spits out a fasta with some of the secuences trimmed?

Or do I need to write one myself (not so experienced)..?

Thanks in advance!

transcriptome assembly ncbi • 151 views
ADD COMMENTlink written 28 days ago by T_1820

Meaning that bp 1 - 109 is of suspicious origin and needs to be trimmed from this sequence that is 350 bp in total.

NCBI's contamination screen report is probably done using some kind of automated pipeline. Do you have reason to believe that observation in it are true/real? It is possible that your organism may have the same sequence in its genome.

ADD REPLYlink modified 28 days ago • written 28 days ago by genomax62k

I do not have a reference genome so I can never be certain about this, only very careful. My organism is an insect so if a certain region is hitting a Pseudomonas I am quit critical and I would trim this down. Also, when I ran some of these reported sequences randomly through blast I see that region 110-350 is nicely aligned to an insect gene, but the rest Pseudomonas. And this is for all the suspicious reads. Regions that were marked as "need to be deleted completely" are aligning to something different than an insect region when blasting, can be E. coli, mouse, human. I took the safe side here and deleted these completely.

ADD REPLYlink written 26 days ago by T_1820

Also, when I ran some of these reported sequences randomly through blast I see that region 110-350 is nicely aligned to an insect gene, but the rest Pseudomonas.

  • That is suggestive that your insert may have an endogenous Pseudomonas species closely associated with it (if you are able to reproduce the observation with other libraries made from independent insects). e.g. Drosophila contain Wolbachia.
  • If you know that simply not to be the case then you may have contamination in your data that would make the entire assembly suspicious.
ADD REPLYlink written 26 days ago by genomax62k

Just for clarity, this is transcriptome data (RNAseq), meaning that some of the assembled transcripts are not necessarily full assembled genes.

Now I just picked out a sequence with a region hitting Pseudomonas. Fact is that the rest of the region does hit the insect's species gene. Furthermore, I also see that especially the regions that are advised to be deleted completely are really short transcripts (~200 bp) and not close to a full gene. Probably they are hitting another species, vector or even strange (not used) adapters simply due to chance..

I always like to stay on the safe side and rather delete suspicious reads (that are apparently not picked up as insect's own anyway, hence useless for the transcriptome analyses)..

But what would you advice..? And my original question still stands, is there a program out there where I can trim a predefined region from a batch of sequences? (with the region being different for each sequence..)

Thanks!

ADD REPLYlink modified 26 days ago • written 26 days ago by T_1820

To answer your question you can create a bed intervals file that contains information in this format (by cutting out relevant info from your report, example below from your original)

id_of_transcript     end_of_bad_hit+1    length_of_transcript
transcr_1                110                             350
transcr_2                324                             550

Plus

 id_of_transcript     start of transcript    start_of_bad_hit-1
 transcr_1                1                             350
 transcr_2                34                            120

Once you have these you can use samtools faidx, bedtools getfasta or pyfaidx with the intervals and your original dataset to extract just the sequences you want to keep.

ADD REPLYlink modified 25 days ago • written 25 days ago by genomax62k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1591 users visited in the last hour