Question: Subsetting fasta files based on blast results
0
gravatar for dpearton
4.0 years ago by
dpearton0
South Africa
dpearton0 wrote:

Hi,

 

I have assembled a de novo transcriptome from my species of interest.  The species has an internal symbiont (algae) and those sequences obviously we sequenced along with it.  I have done a blast on my assembly which has ID'd a number of contigs corresponding to the symbiont.  I would like to use the results of the blast to remove those contigs from the assembly into a new file so I can deal with the two organisms separately.  It there an easy way to do this?

The assembly contains ~350,000 contigs.

Thanks.

blast rna-seq • 1.0k views
ADD COMMENTlink modified 4.0 years ago by michael.ante3.4k • written 4.0 years ago by dpearton0
0
gravatar for michael.ante
4.0 years ago by
michael.ante3.4k
Austria/Vienna
michael.ante3.4k wrote:

Hi,

you can store the IDs of interest in a file and loop over them using fasgrep (from FAST):

for i in $(cat id-list.txt); do fasgrep $i my_contig.fa >> symbiont.fasta ; fasgrep -v $i my_contig.fa >> host.fasta ; done

Cheers,

Michael

ADD COMMENTlink written 4.0 years ago by michael.ante3.4k

Thanks for the advice,

 

Unfortunately I'm having issues trying to install FAST on my system (issues with lib paths, etc).  Is there another way to do this using shell scripts, basic perl or python?

ADD REPLYlink written 4.0 years ago by dpearton0

With the fastx_toolkit, you can format your fasta-files in such a way, that you have a tab-separated format:

fasta_formatter -t -i my_contig.fa -o my_contig.tsv

Input Example:
   >MY-ID
   AAAAAGGGGG
   CCCCCTTTTT
   AGCTN

Output example with tabular output [-t]:
   MY-ID        AAAAAGGGGGCCCCCTTTTAGCTN

From this, you can easily use the standard grep and reformat it with awk/perl afterwards.

ADD REPLYlink written 4.0 years ago by michael.ante3.4k

Hi,

I am having an issue with this.  I finally managed to get the FAST module installed and working so I tried your script.

I recovered the IDs from the BLAST output using cut to give the following (I'm using your file names for simplicity's sake)

$ head -5 id-list.txt
TR75917|c0_g1_i3
TR31996|c0_g1_i1
TR18920|c0_g1_i2
TR90390|c0_g1_i1
TR42315|c0_g1_i1

There are  (cat id-list.txt |wc -l) = 17274 lines

I then used this as my id-list.txt in the code you suggested above

for i in $(cat id-list.txt); do fasgrep $i my_contig.fa >> symbiont.fasta ; fasgrep -v $i my_contig.fa >> host.fasta ; done

 This script runs and runs and eventually I killed it and it gave the following files as output:

-rw-r--r-- 1  707115619 Sep 30 15:06 host.fasta
-rw-r--r-- 1     291823 Sep 30 15:01 id-list.txt
-rw-r--r-- 1  292214587 Sep 30 15:00 my_contig.fa
-rw-r--r-- 1  161001520 Sep 30 15:05 symbiont.fasta

As you can see the output files are considerably larger than the input file and contain many more (obviously repeated) sequences!

I'm assuming that the problem is with the IDs as contained in the id-list.txt file.  What format do they need to be in? -I have them as a single column.  They are unique for each contig in the my_contig.fa file. 

Help!

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by dpearton0

To further clarify, my contigs are the output of a trinity run and so the fasta files have the following format:

>TR1|c0_g1_i1 len=581 path=[584:0-374 585:375-398 586:399-580] [-1, 584, 585, 586, -2]
AGCTGTTTGGCCAAGGCTGCGGCCTGGTGGCAGCCTTGCGAGAGCAAGGGCAGCAAGGGC
AGCAAGGTGCTCCCCGAATCGCTTCCGGCGAGTCCTAGGAGTCCCTGCAGTCAGCGCAAG
>TR2|c0_g1_i1 len=324 path=[327:0-99 328:100-123 329:124-323] [-1, 327, 328, 329, -2]
TTTCAAAGTTTTGCGGAAAACCGCAAAGTTTCCCTAGAATTTTCGGTGCTCACTTTTATC

(Sequence removed for the sake of clarity).

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by dpearton0

I'm sorry, the second command will duplicate the my_contig.fa in each step (except the entry of the ID in the loop). Thus, this will not work.

The problem with your IDs might be that the '|' could be interpreted as a logical or. Searching in case of TR31996|c0_g1_i1 for TR31996 or c0_g1_i1 . Try to replace the '|' with a non-regex character, like '_'.

If this works, the next step would be to generate a second ID list with all the IDs, which were not part of the first ID list.

ADD REPLYlink written 4.0 years ago by michael.ante3.4k

I'm not sure what you are saying here?  Are you saying that the script as written:

for i in $(cat id-list.txt); do fasgrep $i my_contig.fa >> symbiont.fasta ; fasgrep -v $i my_contig.fa >> host.fasta ; done

will not work.  Or that it will work if I replace the '|' in my IDs.  I've already removed all the extraneous info other than the unique ID using sed so it should be simple to replace it with another character. I would prefer to use '-'  rather than '_' given that '_' is already being used in the name.  If I do this should the script work?

 

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by dpearton0

Looping over the id-list and using fasgrep -v each time will generate a very huge file. Each iteration will add the whole file but one entry. Thus, the resulting file would be 17274 times 278MB minus once the entries from the id list. Therefore:

for i in $(cat id-list.txt); do fasgrep $i my_contig.fa >> symbiont.fasta ; done

will work.

 

Replacing the | with a - worked in your given example (with the first fasgrep).

ADD REPLYlink written 4.0 years ago by michael.ante3.4k

Hi,

Yep, I figured out the problem with the -v option.  I managed to great an id list with just the host by using a combination of cat, sort and uniq commands and also replaced the | with - (and simplified the ID line down by removing everything except the unique ID.  If nothing else this excersize has made me much better with sed...

The issue I'm finding is the length of time this process is taking.  I know it is not multi-threaded but each command has been running for over 24hrs on a separate core and has only sorted 6-7,000 items each.  At this rate the full file will take 27 days to run!  I can split the files and run them concurrently but there must be a more efficient way to run this?

ADD REPLYlink written 4.0 years ago by dpearton0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 971 users visited in the last hour