Question: merging sequences from bedtools options -name, -tab into one file
0
gravatar for ggman
5.4 years ago by
ggman80
United States
ggman80 wrote:

Hello all, I wrote a script below to analyze two file formats from bedtools (-tab option, -name option) so it could combine the headers if the sequences match. It works to an extent. The problem I ran into was that if the sequences matched to multiple names it only printed one of the names that correspond to it. I was wondering if anyone had a suggestion to how to approach this. As I want both the position of the sequence and name. Is there an option through bedtools?

 

My script stores both files into their own hashes and then loops through then if they are equal it is suppose to print out matches in sequences with the appropriate names

sample data: -name output bedtools

>sequence1
AGGT
>sequence2
AAAA
>sequence3
CCCC

sample data: -tab output bedtools

>Chr3
AAAA
>Chr4
ACCT
>Chr2
CCCC

output from script

>sequence2|Chr3
AAAA
>sequence3|Chr2
CCCC
 

   
  my %sequence;
   
  open(NAMES_FILE, $ARGV[0]) or die "Cannot open the file: $!";
  my $hash_key_name;
  my $hash_value_name;
  while (my $line = <NAMES_FILE>) {
  if ($line =~ /^>(\S+)/) {
  $hash_key_name = $1;
  }
  elsif ($line =~ /\S/) {
  chomp $line;
  $hash_value_name = $line;
  $sequence{$hash_key_name} = $hash_value_name;
  }
  }
   
   
  my %sequence_2;
  open (POSITIONS_FILE, $ARGV[1]) or die "Cannot open the file: $!";
  my $hash_key_pos;
  my $hash_value_pos;
  while (my $line2 = <POSITIONS_FILE>) {
  if ($line2 =~ /^>(\S+)/) {
  $hash_key_pos = $1;
  }
  elsif ($line2 =~ /\S/) {
  chomp $line2;
  $hash_value_pos = $line2;
  $sequence_2{$hash_key_pos} = $hash_value_pos;
  }
  }
   
   
  foreach $hash_key_pos (keys %sequence_2) {
  foreach $hash_key_name (keys %sequence) {
  if ($sequence{$hash_key_name} eq $sequence_2{$hash_key_pos})
  {
  print ">$hash_key_name|$hash_key_pos\n$sequence{$hash_key_name}\n"
   
  }
  }
 

 

sequences bedtools perl • 1.4k views
ADD COMMENTlink modified 5.4 years ago by Alternative240 • written 5.4 years ago by ggman80

if i could also get help on how to format code when posting to biostars - that would be grand. Cheers.

ADD REPLYlink written 5.4 years ago by ggman80
1
gravatar for Alternative
5.4 years ago by
Alternative240
Alternative240 wrote:

This should do your job :

cat a.txt b.txt | awk '{if($0 ~ /^>/ ){gsub(">",""); header=$0} else{seq[$0]=seq[$0]"|"header}} END {for(i in seq){print seq[i]"\n"i}}' | sed s/\|/">"/

a.txt and b.txt are bedtools output with -name and -tab

P

ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Alternative240
0
gravatar for James Ashmore
5.4 years ago by
James Ashmore3.0k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore3.0k wrote:

Are you starting from a BED file of regions and do you need the output to be in FASTA format? Also the -tab output from BEDtools outputs tab-delimited files, not the FASTA format shown in your output. To format code on Biostars there should be a drop-down box where you write your question which you can change from 'Text' to 'Code'.

Below is a Python3 script I wrote, although I'm not entirely convinced there isn't a better way of doing this, probably using the BEDtools nuc output.

#!/usr/bin/env python3
import csv
from Bio import SeqIO

def main():

    # Open both the TSV and FASTA output files
    tsv = open('test.tsv')
    fasta = open('test.fasta')

    # Parse tab-delimited and FASTA formats
    f_tsv = csv.reader(tsv, delimiter="\t")
    f_fasta = SeqIO.parse(fasta, "fasta")

    # Dictionary to group headers with the same sequence
    groups = {}

    # Iterate through fasta and tsv records in tandem and group using dictionary
    for record_tsv, record_fasta in zip(f_tsv, f_fasta):
       header_tsv = record_tsv[0]
       sequence_tsv = record_tsv[1]
       header_fasta = record_fasta.id
       sequence_fasta = str(record_fasta.seq)
       full_header = header_tsv + "_" + header_fasta
       if sequence_fasta not in groups:
           groups[sequence_fasta] = []
       groups[sequence_fasta].append(full_header)

    # Print results out in FASTA format, maybe better to use BioPython SeqRecord?
    for sequence, headers in groups.items():
        print('>' + '|'.join(headers), sequence, sep="\n")

if __name__ == '__main__':
    main()
ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by James Ashmore3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1288 users visited in the last hour
_