Find barcodes in fasta file
1
0
Entering edit mode
23 months ago
pablo ▴ 300

Hello,

I have a list of forward and reverse complement barcodes (I only show the first rows, there are 30,269 barcodes) :

kas077-1 AAACCCAAGAAGATCT AGATCTTCTTGGGTTT
kas077-2 AAACCCAAGACACACG CGTGTGTCTTGGGTTT
kas077-3 AAACCCAAGCGAGGAG CTCCTCGCTTGGGTTT
kas077-4 AAACCCAAGCGGACAT ATGTCCGCTTGGGTTT
kas077-5 AAACCCAAGCTTTGTG CACAAAGCTTGGGTTT
kas077-6 AAACCCAAGGATACAT ATGTATCCTTGGGTTT
kas077-7 AAACCCAAGGGCCCTT AAGGGCCCTTGGGTTT

And a multi-fasta file (883,993 sequences with a mean size of 1006pb). I need to split this file according to the barcodes. The specifity is :

  • I need to find the forward barcodes into the first 50 pb
  • I need to find the reverse barcodes into the last 50 pb

All the fasta sequences are not necessary found. I guess the half will be found.

from Bio import SeqIO

     barcodes = open("my_barcodes_file.txt", 'r')
        barcodes_fwd = {}
        barcodes_rev = {}

        for line in lines :
           barcodes_fwd[line.split(' ')[0]]=line.split(' ')[1]
           barcodes_rev[line.split(' ')[0]]=line.split(' ')[2]

        with open("my_fasta_file.fasta") as handle:
           for record in SeqIO.parse(handle, "fasta"):
              for (k1,v1),(k2,v2) in zip(barcodes_fwd.items(),barcodes_rev.items()):
                 if v1 in record.seq[0:50] or v2 in record.seq[-50:]:
                    print(record.id,k1+"\n",record.seq)

For the moment, I have something like that which returns :

>read_ID barcode_ID
fasta_sequence

I will split later the output in multi-files by barcode ID.

The problem is that it is very slow. Do you have an idea to update the code?

Best

fasta biopython • 1.5k views
ADD COMMENT
1
Entering edit mode

This is a perfect use case for bbduk.sh from BBmap suite in filter mode (it can do other things to). There is a guide available: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/

While this is not a direct answer for your question if you are worried about performance then this should help.

ADD REPLY
0
Entering edit mode

Thanks, I'll have a look on this tool. Anyway, do you think my code could be improved to get what I expect?

ADD REPLY
0
Entering edit mode

Biopython libraries, as I understand, are slow. Instead of loop on lists, try list comprehension.

ADD REPLY
0
Entering edit mode

Shoud I use : bbduk.sh -Xmx2g in=my_file.fasta outm=matched.fa ref=barcodes.fasta k=16 stats=stats.txt ?

ADD REPLY
0
Entering edit mode

There are restrictleft= and restrictright= options that you will need to add. You can pipe things from one search into other.

An example would be (untested) for first index combo in your example

bbduk.sh -Xmx2g in=your.fa out=stdout.fa literal=AAACCCAAGAAGATCT restrictleft=50 | bbduk.sh -Xmx2g in=stdin.fa out=kas077-1.fa literal=AGATCTTCTTGGGTTT restrictright=50

What this should do is it first finds all reads that have left adapter in first 50 bases and only send those on to the second command that looks for second adapter in right half of reads and then writes out ones that match both instances.

ADD REPLY
0
Entering edit mode

Thanks for your prompt reply. I see the literal option which is interesting , but I have 30,269 x2 barcodes sequences. Is there a way to loop over the full set of barcodes file ? Also, a read has either the right or the left instance, never both.

ADD REPLY
2
Entering edit mode
23 months ago

I think, you could have a look at the source code of UMI-tools for inspiration how to optimize your code.

The software is also written in Python, and the umi-tools extract command is very similar to what you are trying to achieve (albeit from FastQ and not Fasta). I think the relevant code bits should be in the extract_methods.py.

Another Python software that does barcode assignment (I believe, by using minimap on a set of artificial reference sequences?) is Anglerfish. This code could also serve as a template to solve your problem...

Best Matthias

ADD COMMENT

Login before adding your answer.

Traffic: 2573 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6