Question: Extracting Positions From A Fasta For A Particular Mask
0
gravatar for diviya.smith
5.9 years ago by
diviya.smith50
United States
diviya.smith50 wrote:

Does anyone know of a quick way to extract the positions for a particular mask from a fasta file? So for example if I wanted to know the position for all missing sites for chr 1 and this is coded as "." in my fasta file - how can I generate a bed file with the list of these positions? I can code something in perl where I check each line separately but I was wondering if there are any programs out there like bedtools which have already implemented this?

Input:

>chr1

AAAAA.NNNNCCCCTTTT..A

output: (positions start at 0)

<chr> <start_pos> <end_pos>

chr1: 5 5

chr1: 18 19

Thank you in advance.

fasta • 2.4k views
ADD COMMENTlink modified 5.9 years ago by Alex Reynolds29k • written 5.9 years ago by diviya.smith50
0
gravatar for Alex Reynolds
5.9 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

I don't have any advice on using bedtools to do this, but these two outputs are not correct BED elements:

output: (positions start at 0)
chr1: 5 5
chr1: 18 19

The correct, 0-indexed set would look like:

chr1   5    6
chr1   18   20

You might use something like awk to process FASTA data. Here's some untested code:

$ awk ' \
{ \
    if ($0 ~ /^>/) { \
        chr = substr($1, 2); \
        totalLength = 0; \
        previousChar = 'Z'; \
    } \
    else { \
        split($0, chars, ""); \
        for (idx = 1; idx <= length(chars); idx++) { \
             currentChar = chars[idx]; \
             if ((currentChar == ".") && (previousChar != currentChar)) { \
                 start = idx - 1; \
                 stop = start + 1; \
             } \
             else if ((currentChar == ".") && (previousChar == currentChar)) { \
                 stop++; \
             } \
             else if ((currentChar != ".") && (previousChar == ".")) { \
                 print chr"\t"start"\t"stop; \
             } \
             previousChar = currentChar; \
        } \
        totalLength += length(chars); \
    } \
}' input.fasta
ADD COMMENTlink modified 5.9 years ago • written 5.9 years ago by Alex Reynolds29k

Yes, thats right. The output should be as you pointed out. Thank you for the correction. However, I tried your command as an awk script but I get the wrong answer. chr1 19 21. Also any suggestions no how to expand this script for multiline fasta files? For the example I just used a single line but the file of course has 1000s of lines of data.

ADD REPLYlink written 5.9 years ago by diviya.smith50

Actually adding an additional condition, if ((previousChar == '.') && (currentChar != '.')) { print chr"\t"start"\t"stop; } will print the output as needed. Still trying to figure out the multiple line issue though - suggestions would be very welcome.

ADD REPLYlink written 5.9 years ago by diviya.smith50

Perils of running untested code. I modified the awk script that I think will deal with the bugs you noted.

The modified script appears to run okay on your test FASTA input:

$ awk '...' foo.fa
chr1    5       6
chr1    18      20

I modified the FASTA data to include a trailing period and multiline input support:

$ more bar.fa
>chr1
AAAAA.NNNNCCCCTTTT...
.ACGT
>chr2
GATTICA...GATTICA...
.CAT..T

Running the script on this:

$ awk '...' bar.fa
chr1    5       6
chr1    18      21
chr1    22      23
chr2    7       10
chr2    17      20
chr2    21      22
chr2    25      27
ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by Alex Reynolds29k

Thanks so much for your help. I got it to work up to this point too. I added an end condition instead of the if (newElementFlag == 1) { \ print chr"\t"start"\t"stop; \ } Any suggestion for merging cases where "..." continues on multiple line. Ideally, I would like to merge the cases, chr1 18 21 chr1 22 23

as chr1 18 23

ADD REPLYlink written 5.9 years ago by diviya.smith50

Given input:

$ more bar.fa
>chr1
AAAAA.NNNNCCCCTTTT...
.ACGT
>chr2
GATTICA...GATTICA...
.CAT..T

Here is output from the modified script:

$ awk '...' bar.fa  
chr1    5    6          
chr1    18    22
chr2    7    10
chr2    17    21
chr2    4    6

I seemed to have forgotten that awk arrays are 1-based.

ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by Alex Reynolds29k

Thanks. This is very helpful. Just a couple of changes (positions were not correct as the "totalLength" was not included in updating start and including the case if the last character is "."

BEGIN { \
totalLength = 0; \
newElementFlag = 0; \
char ="-";
previousChar = "Z"; \
 } \
{   if ($0 ~ /^>/) { \
    chr = substr($1, 2); \
} \
else { \
    split($0, chars, ""); \
    for (idx = 1; idx <= length(chars); idx++) { \
         currentChar = chars[idx]; \
         if ((currentChar == char) && (previousChar != currentChar)) { \
             start = idx + totalLength - 1; \
             stop = start + 1; \
         } \
         else if ((currentChar == char) && (previousChar == currentChar)) { \
             stop++; \
         } \
         else if ((currentChar != char) && (previousChar == char)) { \
             print chr"\t"start"\t"stop; \
         } \
         previousChar = currentChar; \
    } \
    totalLength += length(chars); \
} \
}END {
    if (currentChar == char) {
            print chr"\t"start"\t"stop;
    }
}
ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by diviya.smith50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2116 users visited in the last hour