merging sequences from bedtools options -name, -tab into one file
2
0
Entering edit mode
8.6 years ago
ggman ▴ 80

Hello all, I wrote a script below to analyze two file formats from bedtools (-tab option, -name option) so it could combine the headers if the sequences match. It works to an extent. The problem I ran into was that if the sequences matched to multiple names it only printed one of the names that correspond to it. I was wondering if anyone had a suggestion to how to approach this. As I want both the position of the sequence and name. Is there an option through bedtools?

My script stores both files into their own hashes and then loops through then if they are equal it is suppose to print out matches in sequences with the appropriate names

sample data: -name output bedtools

>sequence1
AGGT
>sequence2
AAAA
>sequence3
CCCC

sample data: -tab output bedtools

>Chr3
AAAA
>Chr4
ACCT
>Chr2
CCCC

output from script

>sequence2|Chr3
AAAA
>sequence3|Chr2
CCCC
my %sequence;

open(NAMES_FILE, $ARGV[0]) or die "Cannot open the file: $!";

my $hash_key_name;
my $hash_value_name;

while (my $line = <NAMES_FILE>) {
    if ($line =~ /^>(\S+)/) {
        $hash_key_name = $1;
    }
    elsif ($line =~ /\S/) {
        chomp $line;
        $hash_value_name = $line;
        $sequence{$hash_key_name} = $hash_value_name;
    }
}

my %sequence_2;

open (POSITIONS_FILE, $ARGV[1]) or die "Cannot open the file: $!";

my $hash_key_pos;
my $hash_value_pos;

while (my $line2 = <POSITIONS_FILE>) {
    if ($line2 =~ /^>(\S+)/) {
        $hash_key_pos = $1;
    }
    elsif ($line2 =~ /\S/) {
        chomp $line2;
        $hash_value_pos = $line2;
        $sequence_2{$hash_key_pos} = $hash_value_pos;
    }
}

foreach $hash_key_pos (keys %sequence_2) {
    foreach $hash_key_name (keys %sequence) {
        if ($sequence{$hash_key_name} eq $sequence_2{$hash_key_pos})
        {
            print ">$hash_key_name|$hash_key_pos\n$sequence{$hash_key_name}\n"
        }
    }
}
Bedtools Perl Sequences • 2.0k views
ADD COMMENT
0
Entering edit mode

if i could also get help on how to format code when posting to biostars - that would be grand. Cheers.

ADD REPLY
1
Entering edit mode
8.6 years ago
Alternative ▴ 270

This should do your job:

cat a.txt b.txt | awk '{if($0 ~ /^>/ ){gsub(">",""); header=$0} else{seq[$0]=seq[$0]"|"header}} END {for(i in seq){print seq[i]"\n"i}}' | sed s/\|/">"/

a.txt and b.txt are bedtools output with -name and -tab

P

ADD COMMENT
0
Entering edit mode
8.6 years ago
James Ashmore ★ 3.4k

Are you starting from a BED file of regions and do you need the output to be in FASTA format? Also the -tab output from BEDtools outputs tab-delimited files, not the FASTA format shown in your output. To format code on Biostars there should be a drop-down box where you write your question which you can change from 'Text' to 'Code'.

Below is a Python3 script I wrote, although I'm not entirely convinced there isn't a better way of doing this, probably using the BEDtools nuc output.

#!/usr/bin/env python3
import csv
from Bio import SeqIO

def main():

    # Open both the TSV and FASTA output files
    tsv = open('test.tsv')
    fasta = open('test.fasta')

    # Parse tab-delimited and FASTA formats
    f_tsv = csv.reader(tsv, delimiter="\t")
    f_fasta = SeqIO.parse(fasta, "fasta")

    # Dictionary to group headers with the same sequence
    groups = {}

    # Iterate through fasta and tsv records in tandem and group using dictionary
    for record_tsv, record_fasta in zip(f_tsv, f_fasta):
       header_tsv = record_tsv[0]
       sequence_tsv = record_tsv[1]
       header_fasta = record_fasta.id
       sequence_fasta = str(record_fasta.seq)
       full_header = header_tsv + "_" + header_fasta
       if sequence_fasta not in groups:
           groups[sequence_fasta] = []
       groups[sequence_fasta].append(full_header)

    # Print results out in FASTA format, maybe better to use BioPython SeqRecord?
    for sequence, headers in groups.items():
        print('>' + '|'.join(headers), sequence, sep="\n")

if __name__ == '__main__':
    main()
ADD COMMENT

Login before adding your answer.

Traffic: 1466 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6