Question

merging sequences from bedtools options -name, -tab into one file

0

Entering edit mode

8.7 years ago

ggman ▴ 80

Hello all, I wrote a script below to analyze two file formats from bedtools (-tab option, -name option) so it could combine the headers if the sequences match. It works to an extent. The problem I ran into was that if the sequences matched to multiple names it only printed one of the names that correspond to it. I was wondering if anyone had a suggestion to how to approach this. As I want both the position of the sequence and name. Is there an option through bedtools?

My script stores both files into their own hashes and then loops through then if they are equal it is suppose to print out matches in sequences with the appropriate names

sample data: -name output bedtools

>sequence1
AGGT
>sequence2
AAAA
>sequence3
CCCC

sample data: -tab output bedtools

>Chr3
AAAA
>Chr4
ACCT
>Chr2
CCCC

output from script

>sequence2|Chr3
AAAA
>sequence3|Chr2
CCCC

my %sequence;

open(NAMES_FILE, $ARGV[0]) or die "Cannot open the file: $!";

my $hash_key_name;
my $hash_value_name;

while (my $line = <NAMES_FILE>) {
    if ($line =~ /^>(\S+)/) {
        $hash_key_name = $1;
    }
    elsif ($line =~ /\S/) {
        chomp $line;
        $hash_value_name = $line;
        $sequence{$hash_key_name} = $hash_value_name;
    }
}

my %sequence_2;

open (POSITIONS_FILE, $ARGV[1]) or die "Cannot open the file: $!";

my $hash_key_pos;
my $hash_value_pos;

while (my $line2 = <POSITIONS_FILE>) {
    if ($line2 =~ /^>(\S+)/) {
        $hash_key_pos = $1;
    }
    elsif ($line2 =~ /\S/) {
        chomp $line2;
        $hash_value_pos = $line2;
        $sequence_2{$hash_key_pos} = $hash_value_pos;
    }
}

foreach $hash_key_pos (keys %sequence_2) {
    foreach $hash_key_name (keys %sequence) {
        if ($sequence{$hash_key_name} eq $sequence_2{$hash_key_pos})
        {
            print ">$hash_key_name|$hash_key_pos\n$sequence{$hash_key_name}\n"
        }
    }
}

Bedtools Perl Sequences • 2.0k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.7 years ago by ggman ▴ 80

0

Entering edit mode

if i could also get help on how to format code when posting to biostars - that would be grand. Cheers.

ADD REPLY • link 8.7 years ago by ggman ▴ 80

0

Entering edit mode

8.6 years ago

James Ashmore ★ 3.4k

Are you starting from a BED file of regions and do you need the output to be in FASTA format? Also the -tab output from BEDtools outputs tab-delimited files, not the FASTA format shown in your output. To format code on Biostars there should be a drop-down box where you write your question which you can change from 'Text' to 'Code'.

Below is a Python3 script I wrote, although I'm not entirely convinced there isn't a better way of doing this, probably using the BEDtools nuc output.

#!/usr/bin/env python3
import csv
from Bio import SeqIO

def main():

    # Open both the TSV and FASTA output files
    tsv = open('test.tsv')
    fasta = open('test.fasta')

    # Parse tab-delimited and FASTA formats
    f_tsv = csv.reader(tsv, delimiter="\t")
    f_fasta = SeqIO.parse(fasta, "fasta")

    # Dictionary to group headers with the same sequence
    groups = {}

    # Iterate through fasta and tsv records in tandem and group using dictionary
    for record_tsv, record_fasta in zip(f_tsv, f_fasta):
       header_tsv = record_tsv[0]
       sequence_tsv = record_tsv[1]
       header_fasta = record_fasta.id
       sequence_fasta = str(record_fasta.seq)
       full_header = header_tsv + "_" + header_fasta
       if sequence_fasta not in groups:
           groups[sequence_fasta] = []
       groups[sequence_fasta].append(full_header)

    # Print results out in FASTA format, maybe better to use BioPython SeqRecord?
    for sequence, headers in groups.items():
        print('>' + '|'.join(headers), sequence, sep="\n")

if __name__ == '__main__':
    main()

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by James Ashmore ★ 3.4k

Ram · Accepted Answer · 2015-09-02

1

Entering edit mode

8.6 years ago

Alternative ▴ 270

This should do your job:

cat a.txt b.txt | awk '{if($0 ~ /^>/ ){gsub(">",""); header=$0} else{seq[$0]=seq[$0]"|"header}} END {for(i in seq){print seq[i]"\n"i}}' | sed s/\|/">"/

a.txt and b.txt are bedtools output with -name and -tab

P

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by Alternative ▴ 270