Multifasta Headers To Tab
2
0
Entering edit mode
12.0 years ago

Hi,

I need to convert my pair-ended libraries into tab (only headers) with "insert size" information in a separate column. Basically, something like this: "read1 read2 interread-distance". Are you aware of a perl/python script to do such a job? I am starting in python world and it is a simple and it will be a good exercise however some help would be appreciate.

Basically, I will have two datasets (multifasta files format txt-like, *.fasta) like this:

DATASET #1

>header_1/1 (this can have different txt) 
CACGGTGGGATCGACCTTTTGCCGCTCGAGGGATTTCTGCACCCTGCCGA
>header_2/1
TGGAACAAGGCCGGCCTGCGCGGCTAGGTGATCAACAAGGGGTTCAACGG
>header_3/1
GCGAAAGCGATACCTGGACTCGCAGAACGAAGACCTGGCCGAAGCGGTGG
>header_4/1
GGTTTAGAACGTCGTGAGACAGTTCGGTCCCTATCTGCCGTGGACGTTTG

DATASET #2

>header_1/2 
CACGGTGGGATCGACCTTTTGCCGCTCGAGGGATTTCTGCACCCTGCCGA
>header_2/2
TGGAACAAGGCCGGCCTGCGCGGCTAGGTGATCAACAAGGGGTTCAACGG
>header_3/2
GCGAAAGCGATACCTGGACTCGCAGAACGAAGACCTGGCCGAAGCGGTGG
>header_4/2
GGTTTAGAACGTCGTGAGACAGTTCGGTCCCTATCTGCCGTGGACGTTTG

The only conserved notation is /1 or /2 depending on being dataset1 or 2 (mate-pair reads). The size of the datasets are aroung 5,000,000 header entries but can be more

Additionally, I will have a numeric insert-value input

So, in total I have 3 inputs:

dataset 1 
dataset 2 
insert value (number)

e.g. perl/python headerextract dataset1 dataset2 [#insert value] output

I want to:

1.extract ">" description from each DATASET like:

  header_1/1
  header_2/1
  header_3/1
  header_4/1

and

header_1/2 
header_2/2 
header_3/2 
header_4/2

[to easily extract the headers it can be used grep -e ">" my.fasta since ">" is only in the header line]

2.

print a file (format can be .csv or .txt) with both list of headers (each row with identical header designation only varying in /1 or /2) plus an additional column with insert-value (eg. 300):

header_1/1 header_1/2 300 
header_2/2 header_2/2 300 
header_3/2 header_3/2 300 
header_4/2 header_4/2 300

A real dataset may looks like this:

>ILLUMINA-52179E_0009:5:1:1059:13539#CTTGTA/1 
CACGGTGGGATCGACCTTTTGCCGCTCGAGGGATTTCTGCACCCTGCCGA
>ILLUMINA-52179E_0009:5:1:1094:12872#CTTGTA/1
TGGAACAAGGCCGGCCTGCGCGGCTAGGTGATCAACAAGGGGTTCAACGG
>ILLUMINA-52179E_0009:5:1:1106:7745#CTTGTA/1
GCGAAAGCGATACCTGGACTCGCAGAACGAAGACCTGGCCGAAGCGGTGG
>ILLUMINA-52179E_0009:5:1:1108:16460#CTTGTA/1
GGTTTAGAACGTCGTGAGACAGTTCGGTCCCTATCTGCCGTGGACGTTTG
biopython genome sequencing fasta • 3.2k views
ADD COMMENT
0
Entering edit mode

What do the headers look like?

ADD REPLY
0
Entering edit mode

Hi, my read libraries are from illumina, e.g.:

>ILLUMINA-52179E_0009:5:1:1059:13539#CTTGTA/1 
CACGGTGGGATCGACCTTTTGCCGCTCGAGGGATTTCTGCACCCTGCCGA
>ILLUMINA-52179E_0009:5:1:1094:12872#CTTGTA/1
TGGAACAAGGCCGGCCTGCGCGGCTAGGTGATCAACAAGGGGTTCAACGG
>ILLUMINA-52179E_0009:5:1:1106:7745#CTTGTA/1
GCGAAAGCGATACCTGGACTCGCAGAACGAAGACCTGGCCGAAGCGGTGG
>ILLUMINA-52179E_0009:5:1:1108:16460#CTTGTA/1
GGTTTAGAACGTCGTGAGACAGTTCGGTCCCTATCTGCCGTGGACGTTTG

ideally, such script should be something like: perl/python "fa2csv" reads1 reads2 insertsize output. Currently I am looking on how to convert multifasta to csv in a way that in the same row I do have both pairs...

ADD REPLY
0
Entering edit mode

how do you compute/extract the "interread-distance"? If you mean the distance between both pairs, the distance is expected to be some value close to the library preparation (i.e. Illumina protocol selects fragments ~400b, if your are sequencing 100b in both ends, your insert size is ~200). The real distance can be known only after you map your reads to some reference.

ADD REPLY
0
Entering edit mode

Actually, I do not intend to compute the insert value using such script. That I already know, by, as you referred, mapping the reads against a reference. The issue here is that one of the tools I would like to use to analyze my data requires a tab-like format with each pair header in a row and with an approx. insert value.

ADD REPLY
0
Entering edit mode

You'll have to get the insert length from the mapping data (not raw reads) - so what file format is that in?

ADD REPLY
0
Entering edit mode

Please give me an example of files and of a desired output so I could write the script for you.

ADD REPLY
0
Entering edit mode

So you are interested in headers, right? Doesnt need any sequence information?

ADD REPLY
0
Entering edit mode

No, I do not need sequence information in the output. The software I want to use requires: - dataset1.fasta - dataset2.fasta - file with header lines plus insert size [the file described in this thread]

ADD REPLY
1
Entering edit mode
12.0 years ago

May not be the best solution, but it works:

  1. create files only with headers by using

    grep -e ">" file1 >out1.txt
    grep -e ">" file2 >out2.txt

2.create join.sh function with this content

#!/bin/bash
while read f1 <&7
do
    read f2 <&8
    echo $f1 $f2 300
done \
    7<$1 \
    8<$2

3.run it by writing

 sh join.sh out1.txt out2.txt

4.THATS IT :-)

ADD COMMENT
0
Entering edit mode

Thanx a lot. It worked nicely. The only thing was that actually I needed to remove ">" but I add " | sed 's/>//' " to 1. : grep -e ">" file1 | sed 's/>//' >out1.txt

cheers :)

ADD REPLY
0
Entering edit mode

You are welcome. If you are satisfied with my answer, then pls check it as answer:)

ADD REPLY
0
Entering edit mode
12.0 years ago
JC 13k

well, in that case, if you don't have any missing pair and both fasta files are ordered (otherwise you will need to check both files), see this example: http://stackoverflow.com/questions/5617075/perl-read-from-two-files-and-write-to-a-third

ADD COMMENT

Login before adding your answer.

Traffic: 2238 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6