Question: For Eland _Export.Txt From Paired Reads To Fastq, Are Unequal Number Of Reads A Problem?
0
gravatar for grosscol
6.7 years ago by
grosscol40
Bethesda, MD
grosscol40 wrote:

A company performed paired end sequencing of exomic libraries on for us. When I requested the read data in fastq format from them, they mailed me a hard drive with _export.txt files. The alignments were done to UCSC hg18. I want to realign the samples myself. The files from a sample are:

s61_export.txt

s62_export.txt

Interestingly, these files do not have the same number of lines s61 has 67,003,423 while s62 has 144 fewer lines. They also are not sorted such that the paired reads are on the same lines in the files.

My first move was to sort both by the cluster coordinates (columns 5 and 6) so that the paired reads would be on the same line in each file.

sort -t $'\t' -k 5n,5 -k 6n,6 -S 40G s_6_1_export.txt > s_6_1_sorted.txt &

sort -t $'\t' -k 5n,5 -k 6n,6 -S 40G s_6_2_export.txt > s_6_2_sorted.txt  &

However, I do not think this will work out well unless all of the extra reads in the s61 file sort to the end of the resulting file. I was subsequently planning on using casava to convert each of the sorted export.txt files to fastq with the command:

CASAVA -a Export2Fastq -e s_6_1_sorted.txt -o s_6_1.fq --purityFilter=YES 

CASAVA -a Export2Fastq -e s_6_2_sorted.txt -o s_6_2.fq --purityFilter=YES

From the resulting fq files I was going to use BWA to do paired end alignment to UCSC hg19 and more downstream analysis with the resulting bam file.

My issue is that I'm baffled as to how to get from the _export.txt files given to me, back to fastq such that the paired nature of the read data is preserved. Could anyone offer guidance or suggestions here?

--Colin

illumina fastq paired • 2.8k views
ADD COMMENTlink modified 3.1 years ago by ajguerras0 • written 6.7 years ago by grosscol40
1
gravatar for grosscol
6.6 years ago by
grosscol40
Bethesda, MD
grosscol40 wrote:

I answered my own question using sort and python script (2.7.1). I will outline the basic logic I used.

Initially, sort the _export.txt files by cluster coordinate using the sort utility (GNU coreutils 8.4)

Then passed the sorted export files to a python script.

For each file:

 #Buffered file reader/line generator
    mygen = linesfromfile(infile)

    #Pre loop scan to first good (PF Y) and store as best line
    #This whole approach assumes a cluster coord sorted _export.txt file
    for ln in mygen:
        spl = ln.split('\t')
        rows = rows + 1
        if spl[21] =='Y\n':
            bst = spl
            break;

    #Loop to iterate through the lines
    for ln in mygen:
    #split line by tab delimiter
        spl = ln.split('\t')

        #Check if Pass Filter is Y
        if spl[21] == 'Y\n':

            #Compare cluster coords to previous coords.
            if (spl[4] == bst[4]) & (spl[5] == bst[5]):

                #same coords. Compare sum of base call scores
                if sumascii(spl[9]) > sumascii(bst[9]):

                    #replace best with current
                    bst = spl
            else:
                #Different coords. Indicates new set. 
                #Write input line to file, pass coordinate tuples to main thread.
                mainpipe.send( (bst[4],bst[5]) )
                outpf.write(ln)

                #Replace best line with current line.
                bst = spl
        rows = rows + 1

    #write final stored best line coord tuple
    mainpipe.send( (bst[4],bst[5]) )

The result of this is to get a set of unique coordinates from each file. Additionally, the logic removes optical duplicates by only writing the input lines that have the highest sum of base scores for each set of coordinates, and this is why the files have to be sorted by cluster coordinate first.

The next bit is to simply calculate the symmetric differences between the two sets of coordinate tuples.

#Get differences between coordinates of inputfile1 and inputfile2
symdiffs = tups1.symmetric_difference(tups2)

By getting the intersection of the coordinate tuple sets with the set of symmetric differences, we can find the unmated coordinate clusters in each file:

#Reduce tuple sets to those that are the difference entries
tups1.intersection_update(symdiffs)
tups2.intersection_update(symdiffs)

The length of the resulting sets will indicate the number of lines in the file that need to be removed into an unmatedpairsexport.txt file. This was accomplished with a second pass through the files.

The resulting files are then run though the maq script, fq_all2std.pl export2std, in order to get fastq files.

ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by grosscol40
0
gravatar for Istvan Albert
6.7 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

I don't have a file in this format so I can't recall the specifics.

One possible straightforward but memory hungry operation would be to load up both files into memory into two dictionaries that are keyed by the fields that you believe to be indicative of paired-ness. Then iterate over this dictionary keys and output the paired sequences ignoring the missing ones. I would guess that your memory requirements will be around 2-3 times the resulting fastq file size, but since you only have 67 million reads you might get by with say 4GB of ram.

If you don't know how to write such a script just edit your answer and paste in the first few lines of the export file and we'll give you a script that does the work. Please also comment on the edit if you do so, that way it shows up on my notifications.

ADD COMMENTlink written 6.7 years ago by Istvan Albert ♦♦ 80k

Istvan, thank you for your reply.

The specification for the file format can be found in the CASAVA user guide (http://biowulf.nih.gov/apps/CASAVAUG15011196B.pdf)

The relevant first few columns of an ELAND _export.txt file are as follows: Machine, Run Number, Lane, Tile, X Coordinate of cluster, Y Coordinate of cluster, Index sequence, Read number (1 for single reads; 1 or 2 for paired ends or multiplexed single reads)

A couple (paired) example lines are:

from file s61:

HWI-ST776 151 6 2309 1083 28970 .TGTCA 1

from file s62:

HWI-ST776 151 6 2309 1083 28970 .TGTCA 2

Paired reads should have the same cluster coordinates as I understand it.

I wrote a quick little python script to parse the files, make a set of tuples from all the (x,y) coordinates and then compare the two sets. It's not an elegant solution as it eats 21gb ram, but it works in less than half an hour, so I have time for coffee.

The log is as follows:

File 1 rows                    67003423 
File 2 rows                    67003279 
Length orginal set 1           65903883 
Length orginal set 2           65903741 
Length symetric Diff                142 
pickled left overs from file 1. 
pickled left overs from file 2.
Program End               prog duration   1630.02

I'm going to investigate the differences and see about handling the coordinate duplicates in each file and the symmetric differences between the files. From what I've read, the _export.txt was meant as an internal format. I'm less than pleased that I've gotten it returned to me as a copy of the "source data" for an analysis we contracted to a company.

--Colin

ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by grosscol40

I guess I grossly misunderestimated the memory requirements ;-) glad that it seems to work out though

ADD REPLYlink written 6.7 years ago by Istvan Albert ♦♦ 80k
0
gravatar for ajguerras
3.1 years ago by
ajguerras0
ajguerras0 wrote:

Hello everyone, I need 2 or 3 files in Illumina Export Format to test a software function, (size does not matter). I've been looking through all over the web and have not been able to find them. (I understand SRA don't use it). I tried to find CASAVA old versions in order to build my own files, but the software is not available anymore. Could anyone provide a couple of files, a link or suggest a repository ? Thank you very much, ....

ADD COMMENTlink written 3.1 years ago by ajguerras0

If you don't mind my asking .. why are you interested in a format that was short lived (and was probably not used very widely at all) and is now ancient in terms of NGS technology lifespan.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by genomax67k
0
gravatar for ajguerras
3.1 years ago by
ajguerras0
ajguerras0 wrote:

I am just trying to test an old software that worked very well over this format, I may try do something to make it work for current formats. Thanks

ADD COMMENTlink written 3.1 years ago by ajguerras0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 910 users visited in the last hour