Question

Redirect Stdout From twoBitToFa in Bash

0

Entering edit mode

16 months ago

Laura ▴ 50

Hi there,

I've been going round and round trying to figure out a way to redirect stdout to use with seqtk. I've read many posts that are similar, so I am confident the answer is out there, but since I'm having such a hard time I figured I might as well add my question and hopefully get some help!

I'm working with a bash script that extracts DNA sequences with a bed file and twoBitToFa . Using twoBitToFa to extract a sequence specified in a bed file and outputting the results to a fasta file is straightforward. I'd like to switch it up and instead redirect the sequence output back into the original bed file in a new column.

So, input bed file would look like this:

chr1    1125000 1128000
chr1    1130000 1131000
chr1    1137000 1138000
chr1    1142000 1143000
chr1    1144000 1145000

and the output would look like this:

chr1    1125000 1128000 GCGATCGATAGCTAGCTA
chr1    1130000 1131000 ATCTATATCCTAGATAGAT
chr1    1137000 1138000 GCGATCGATAGCTAGCTA
chr1    1142000 1143000 ATCTATATCCTAGATAGAT
chr1    1144000 1145000 GCGATCGATAGCTAGCTA

I am trying to skip the intermediate step of making a fasta file. So far, I have been able to output the sequence data in my terminal by using stdout like so:

awk '{print $0" "$1":"$2"-"$3}' "input.bed" | 
twoBitToFa -bed=stdin -udcDir=. "$twobit" stdout \
`

Stdout in terminal looks like:

>chr1:1125000_1128000 
GCGATCGATAGCTAGCTA
ATCTATATCCTAGATAGAT
>chr1:1130000_1131000 
ATCTATATCCTAGATAGAT
GCGATCGATAGCTAGCTA
>chr1:1137000_1138000 
GCGATCGATAGCTAGCTA
ATCTATATCCTAGATAGAT
>chr1:1142000_1143000 
ATCTATATCCTAGATAGAT
GCGATCGATAGCTAGCTA
>chr1:1144000_1145000 
GCGATCGATAGCTAGCTA
ATCTATATCCTAGATAGAT

But I can't figure out how to redirect that stdout to a 4th column in the original input bed file without creating an intermediate fasta file. I know that seqkit accepts input data from standard input, but I can't find an example of it, nor can I get any of the similar solutions to work. Here are some things I've tried:

# saving sequence ids to a variable and matching them with twobit stdout 

seq=$(awk '{print $0" "$1":"$2"-"$3}' "input.bed) | 
twoBitToFa -bed=stdin -udcDir=. "$twobit" stdout \ |
for i in seq; do grep $i stdin; done


# trying to use stdin with seqkit seq to join bed data with sequence data

awk '{print $0" "$1":"$2"-"$3}' "input.bed" | 
twoBitToFa -bed=stdin -udcDir=. "$twobit" stdout \ |
seqkit seq awk '{print $0" "$1":"$2"-"$3}' "input.bed" stdin


# Another variation I found using seqtk with subseq

seq=$(awk '{print $0" "$1":"$2"-"$3}' "input.bed") | 
twoBitToFa -bed=stdin -udcDir=. "$twobit" stdout \ |
seqtk subseq stdin $seq > testbed.bed


# saving both things to variables to try and match

seq=$(awk '{print $0" "$1":"$2"-"$3}' "input.bed") |
myvar=$(twoBitToFa -bed=stdin -udcDir=. "$twobit" stdout) | \
seqkit grep -r -f $seq $myvar

This is quite a mess! How can I redirect output from twoBitToFa and add it to the original bed file without creating an intermediate fasta file?

Thanks in advance!

stdout bed stdin twobittofa bash • 857 views

ADD COMMENT • link updated 16 months ago by iraun 6.2k • written 16 months ago by Laura ▴ 50

0

Entering edit mode

I would exploit the fact that twoBitToFa is giving you the intervals in the fasta name, and use tr to swap a bunch of delimiters (newline -> tab; > -> newline, etc.):

awk '{print $0" "$1":"$2"-"$3}' "input.bed" | 
twoBitToFa -bed=stdin -udcDir=. "$twobit" stdout |
tr '\n' '\t' | tr '>' '\n' | tr ':' '\t' | tr '_' '\t'

ADD REPLY • link 16 months ago by LChart 3.9k

0

Entering edit mode

Hi! This maybe "too simple" solution, but, if you are getting a sequence for each interval of the bed file, and the order of the output sequences is the same as in the input bed file, could you:

twoBitToFa -bed=input.bed -udcDir=. "$twobit" stdout | grep -v '>' | paste input.bed - > output.bed

I haven't tried myself, so leaving this as a comment.

ADD REPLY • link 16 months ago by iraun 6.2k

0

Entering edit mode

This is so close! If the sequences were all on the same line, I think this would be good enough, but I failed to show that the sequence stdout is actually on multiple lines. I've updated the question to show this now. If I could delete the new line characters first, then this might work?

ADD REPLY • link 16 months ago by Laura ▴ 50

score 0 · Answer 1 · 2022-12-15

Alright, so based on the new information, you have to linearize the fasta (from multi line to single line).

twoBitToFa -bed=input.bed -udcDir=. "$twobit" stdout | awk '/^>/ {printf("%s",(NR>1?"\n":""),$0);next;} {printf("%s",$0);} END {printf("\n")}' | paste input.bed - > output.bed

I don't have the option to test the command at the moment. I hope it works :).