Question: Selecting fastq sequences
0
gravatar for NestorVB
9 days ago by
NestorVB20
Sweden/Stockholm/Karolinska Institutet
NestorVB20 wrote:

Hello everyone!

So, I am new in python (just did 1 week crash course), and a colleague asked for my help to create a little python3 program that would input a fastq file, a sequence query, a barcode file and an output file, and would find him all the individual barcodes (from the barcode file) that are found with the query sequence in the fastq file.

I wrote this little contraption of mine that almost does the job:

infile = open(sys.argv[1], "r")
vfile = sys.argv[2]
barcode = open(sys.argv[3], "r")
outfile = open(sys.argv[4], "w")
usedbarcodes = []
count = 0
seq = ""
for line in infile:
    if count == 0:
        Id = line.rstrip()
    elif count == 1:
        if line.find(vfile) > 0:
            for bar in barcode:
                print(bar)
                if line.find(bar) > 0:
                    if bar in usedbarcodes:
                        break
                    else:
                        print("Yeah")
                        seq = line.rstrip()
                        bar = bar.rstrip()
                        usedbarcodes.append(bar)
                        print("Used barcodes until now",usedbarcodes)
                        break
    elif count == 2:
        sign = line.rstrip()
    elif count == 3:
        q = line.rstrip()
        if len(seq) > 10:
            sequence = [Id,seq,sign,q]
            print("\n".join(sequence), file = outfile)
        count = 0
        seq = ""
    if count < 3:
        count += 1

infile.close()
barcode.close()
outfile.close()

The problem seems to be that the order of the barcodes matters when trying to find them, which would indicate that the loop does not search for all the barcodes everytime but just the ones that have not been searched for before. I expected it to be restarted everytime, and my question is if anyone could tell me why it does that and how to avoid it.

Sorry again for the probably extremely inefficient and complicated codding. I will gladly accept any contructive criticism on that too =D. Thanks a lot in advance!

Néstor

for loop fastq python • 170 views
ADD COMMENTlink modified 9 days ago by Corentin100 • written 9 days ago by NestorVB20
3

Have you looked into biopython?

ADD REPLYlink written 9 days ago by KVC_bioinfo200
3

While nestorvazquezb will no doubt benefit from biopython, learning how to do something with just basic syntax in a programming language is an important skill to get under one's belt. This is the right way to learn a new language and I commend nestorvazquezb for that.

ADD REPLYlink modified 9 days ago • written 9 days ago by genomax37k

I dont really understand how it works =\

ADD REPLYlink written 9 days ago by NestorVB20
5

The tutorial and cookbook is a great resource.

In a nutshell:

for record in SeqIO.parse("yourfile.fastq", "fastq"):
    record.seq.find("AGACAG")
ADD REPLYlink written 9 days ago by WouterDeCoster23k
5
gravatar for Corentin
9 days ago by
Corentin100
Corentin100 wrote:

Hi,

Your barcode variable is a file object and so an iterator (source: https://docs.python.org/2/library/stdtypes.html#bltin-file-objects A file object is its own iterator and https://stackoverflow.com/questions/16994552/is-file-object-in-python-an-iterable), when you loop through a file object it invokes the next() method, which can only read the file once. So what you can do is create a list object and add all your barcodes to this list, indeed you can go through a list as often as you want,

Regarding the coding I would suggest adding more comments, even if this is only for you, it will be useful if you need to modify this script later.

Moreover, instead of reading the whole fastq line by line (which can be quite big) have you considered to use grep or regular expression to only get the sequences of interest (to create a subset of your fastq file) and then check for barcodes ?

Regards, Corentin

ADD COMMENTlink written 9 days ago by Corentin100
2

I think it's worth highlighting that you can re-loop through a file by using seek() to start iterating back at the beginning of the file (as noted in the link you posted).

Also, strongly agree about commenting. I've come back to a number of my quick, un-commented scripts only to spend time trying to figure what the hell I was doing...

ADD REPLYlink written 9 days ago by Russ150
2

Thank you very much! This was really useful, it still did not solve my problem directly, but I realized I was making it unnecessarily complicated and decided to rewrite the code creating first list/sets for the files of sequences and barcodes and then look through them in a more simple loop. I find now 99% of what I am looking for. About the grep, I am quite new and I feel I would not know how to put it back together into the full fastq after I have found my thing (my final goal is a fastq file). For anyone looking to do something similar, this is my code:

infile = open(sys.argv[1], "r")
vfile = sys.argv[2]
barcodefile = open(sys.argv[3], "r")
outfile = open(sys.argv[4], "w")
# Make a set of sequences (4 lines) out of the input file
infilelist = []
count = 0
for line in infile:
    if count == 0:      # This is the Id line, store
        Id = line.rstrip()
        count += 1
    elif count == 1:    # Sequence line
        seq = line.rstrip()
        count += 1
    elif count == 2:
        sign = line.rstrip()
        count += 1
    elif count == 3:
        q = line.rstrip()
        sequence = [Id,seq,sign,q]
        line = "\n".join(sequence)
        infilelist.append(line)
        count = 0
print("\nThere are {0} sequences in the input file.\n".format(len(infilelist)))

# Make a set out of the barcodes
barset = set()
for bar in barcodefile:
    bar = bar.rstrip()
    barset.add(bar)
print("\nThere are {0} barcodes for this sequence.\n".format(len(barset)))
# Make a set for the used barcodes.
usedbarcodes = set()
x = 0
# Loop through the fastq set and check against the querry sequence and the barcode
for seq in infilelist:
    if seq.count(vfile) > 0:    # If you find the querry sequence in the sequence
        for bar in barset:     # Look for the barcode
            if seq.find(bar) > 0:
                if bar in usedbarcodes  # Make sure the barcode has not been found before
                    break
                else:
                    bar = bar.rstrip()  # If not print
                    usedbarcodes.add(bar)
                    print(seq, file = outfile)
                    break

print("\nWe found {0} barcodes associated to the querry sequence\n".format(len(usedbarcodes)))
infile.close()
barcodefile.close()
outfile.close()
ADD REPLYlink written 8 days ago by NestorVB20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1359 users visited in the last hour