Question: Multifasta Headers To Tab
0
gravatar for pedroxantos
7.6 years ago by
pedroxantos0 wrote:

Hi,

I need to convert my pair-ended libraries into tab (only headers) with "insert size" information in a separate column. Basically, something like this: "read1 read2 interread-distance". Are you aware of a perl/python script to do such a job? I am starting in python world and it is a simple and it will be a good exercise however some help would be appreciate.

Basically, I will have two datasets (multifasta files format txt-like, *.fasta) like this:

DATASET #1

>header_1/1 (this can have different txt) 
CACGGTGGGATCGACCTTTTGCCGCTCGAGGGATTTCTGCACCCTGCCGA
>header_2/1
TGGAACAAGGCCGGCCTGCGCGGCTAGGTGATCAACAAGGGGTTCAACGG
>header_3/1
GCGAAAGCGATACCTGGACTCGCAGAACGAAGACCTGGCCGAAGCGGTGG
>header_4/1
GGTTTAGAACGTCGTGAGACAGTTCGGTCCCTATCTGCCGTGGACGTTTG

DATASET #2

>header_1/2 
CACGGTGGGATCGACCTTTTGCCGCTCGAGGGATTTCTGCACCCTGCCGA
>header_2/2
TGGAACAAGGCCGGCCTGCGCGGCTAGGTGATCAACAAGGGGTTCAACGG
>header_3/2
GCGAAAGCGATACCTGGACTCGCAGAACGAAGACCTGGCCGAAGCGGTGG
>header_4/2
GGTTTAGAACGTCGTGAGACAGTTCGGTCCCTATCTGCCGTGGACGTTTG

The only conserved notation is /1 or /2 depending on being dataset1 or 2 (mate-pair reads). The size of the datasets are aroung 5,000,000 header entries but can be more

Additionally, I will have a numeric insert-value input

So, in total I have 3 inputs:

dataset 1 
dataset 2 
insert value (number)

e.g. perl/python headerextract dataset1 dataset2 [#insert value] output

I want to:

1.extract ">" description from each DATASET like:

  header_1/1
  header_2/1
  header_3/1
  header_4/1

and

header_1/2 
header_2/2 
header_3/2 
header_4/2

[to easily extract the headers it can be used grep -e ">" my.fasta since ">" is only in the header line]

2.

print a file (format can be .csv or .txt) with both list of headers (each row with identical header designation only varying in /1 or /2) plus an additional column with insert-value (eg. 300):

header_1/1 header_1/2 300 
header_2/2 header_2/2 300 
header_3/2 header_3/2 300 
header_4/2 header_4/2 300

A real dataset may looks like this:

>ILLUMINA-52179E_0009:5:1:1059:13539#CTTGTA/1 
CACGGTGGGATCGACCTTTTGCCGCTCGAGGGATTTCTGCACCCTGCCGA
>ILLUMINA-52179E_0009:5:1:1094:12872#CTTGTA/1
TGGAACAAGGCCGGCCTGCGCGGCTAGGTGATCAACAAGGGGTTCAACGG
>ILLUMINA-52179E_0009:5:1:1106:7745#CTTGTA/1
GCGAAAGCGATACCTGGACTCGCAGAACGAAGACCTGGCCGAAGCGGTGG
>ILLUMINA-52179E_0009:5:1:1108:16460#CTTGTA/1
GGTTTAGAACGTCGTGAGACAGTTCGGTCCCTATCTGCCGTGGACGTTTG
genome fasta biopython sequencing • 2.0k views
ADD COMMENTlink modified 7.6 years ago by Biomonika (Noolean)3.1k • written 7.6 years ago by pedroxantos0

What do the headers look like?

ADD REPLYlink modified 7.6 years ago • written 7.6 years ago by Niek De Klein2.5k

Hi, my read libraries are from illumina, e.g.:

>ILLUMINA-52179E_0009:5:1:1059:13539#CTTGTA/1 
CACGGTGGGATCGACCTTTTGCCGCTCGAGGGATTTCTGCACCCTGCCGA
>ILLUMINA-52179E_0009:5:1:1094:12872#CTTGTA/1
TGGAACAAGGCCGGCCTGCGCGGCTAGGTGATCAACAAGGGGTTCAACGG
>ILLUMINA-52179E_0009:5:1:1106:7745#CTTGTA/1
GCGAAAGCGATACCTGGACTCGCAGAACGAAGACCTGGCCGAAGCGGTGG
>ILLUMINA-52179E_0009:5:1:1108:16460#CTTGTA/1
GGTTTAGAACGTCGTGAGACAGTTCGGTCCCTATCTGCCGTGGACGTTTG

ideally, such script should be something like: perl/python "fa2csv" reads1 reads2 insertsize output. Currently I am looking on how to convert multifasta to csv in a way that in the same row I do have both pairs...

ADD REPLYlink modified 7.6 years ago by Neilfws48k • written 7.6 years ago by pedroxantos0

how do you compute/extract the "interread-distance"? If you mean the distance between both pairs, the distance is expected to be some value close to the library preparation (i.e. Illumina protocol selects fragments ~400b, if your are sequencing 100b in both ends, your insert size is ~200). The real distance can be known only after you map your reads to some reference.

ADD REPLYlink written 7.6 years ago by JC9.1k

Actually, I do not intend to compute the insert value using such script. That I already know, by, as you referred, mapping the reads against a reference. The issue here is that one of the tools I would like to use to analyze my data requires a tab-like format with each pair header in a row and with an approx. insert value.

ADD REPLYlink written 7.6 years ago by pedroxantos0

You'll have to get the insert length from the mapping data (not raw reads) - so what file format is that in?

ADD REPLYlink written 7.6 years ago by Peter5.8k

Please give me an example of files and of a desired output so I could write the script for you.

ADD REPLYlink written 7.6 years ago by Biomonika (Noolean)3.1k

So you are interested in headers, right? Doesnt need any sequence information?

ADD REPLYlink written 7.6 years ago by Biomonika (Noolean)3.1k

No, I do not need sequence information in the output. The software I want to use requires: - dataset1.fasta - dataset2.fasta - file with header lines plus insert size [the file described in this thread]

ADD REPLYlink written 7.6 years ago by pedroxantos0
1
gravatar for Biomonika (Noolean)
7.6 years ago by
State College, PA, USA
Biomonika (Noolean)3.1k wrote:

May not be the best solution, but it works:

  1. create files only with headers by using

    grep -e ">" file1 >out1.txt
    grep -e ">" file2 >out2.txt

2.create join.sh function with this content

#!/bin/bash
while read f1 <&7
do
    read f2 <&8
    echo $f1 $f2 300
done \
    7<$1 \
    8<$2

3.run it by writing

 sh join.sh out1.txt out2.txt

4.THATS IT :-)

ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by Biomonika (Noolean)3.1k

Thanx a lot. It worked nicely. The only thing was that actually I needed to remove ">" but I add " | sed 's/>//' " to 1. : grep -e ">" file1 | sed 's/>//' >out1.txt

cheers :)

ADD REPLYlink written 7.6 years ago by pedroxantos0

You are welcome. If you are satisfied with my answer, then pls check it as answer:)

ADD REPLYlink written 7.6 years ago by Biomonika (Noolean)3.1k
0
gravatar for JC
7.6 years ago by
JC9.1k
Mexico
JC9.1k wrote:

well, in that case, if you don't have any missing pair and both fasta files are ordered (otherwise you will need to check both files), see this example: http://stackoverflow.com/questions/5617075/perl-read-from-two-files-and-write-to-a-third

ADD COMMENTlink written 7.6 years ago by JC9.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1225 users visited in the last hour