Question: Merge contigs in fasta file
0
gravatar for Rubal
11 days ago by
Rubal220
Germany
Rubal220 wrote:

Hello All,

I am running variant calling on some species whose reference genomes have a very high number of contigs (sometimes >400,000). The variant caller I am using splits the job by the number of chromosomes, and is overwhelmed when this number is too high. Therefore I would like to concatenate the contigs for a given species reference fasta file into ~30 contigs.

I believe I could use code such as below to merge all the contigs into one:

> grep -v "^>" test.fasta | awk 'BEGIN { ORS=""; print
> ">Sequence_name\n" } { print }' > new​.fasta

However I would like to merge them into ~30 contigs so the process can still be parallelised. I would also like to insert 1000 'N' characters between each of the merged contigs within these merged contigs, to avoid mapping issues that could be caused by merging contig sequences from different parts of the genome.

Does anyone have any advice for how to do this or know of any application that could do something similar?

Thanks in advance for your help.

contigs merge fasta • 113 views
ADD COMMENTlink modified 11 days ago by finswimmer11k • written 11 days ago by Rubal220
2
gravatar for finswimmer
11 days ago by
finswimmer11k
Germany
finswimmer11k wrote:

You can try this awk solution. But first you have to calculate how many contigs should be merge into a single sequence.

Save this script as merge_contigs.awk

function print_n(n)
 {
    text = ""
    for(j=0;j<n;j++) text=text"N"
    return text
 }

BEGIN {n=1} 
$0 ~ "^>" { 
    if(n<=i) {
        header = header""substr($0,2)" ";
        if(n > 1) seq=seq""print_n(1000)
        n++;
    } 
    else {
        print ">"header"\n"seq; 
        n=1; 
        header=substr($0,2); 
        seq=""
    } 

    next;
} 
{seq=seq$0} 
END {print ">"header"\n"seq;}

Run it like this:

$ awk -v i=140 -f merge_contigs.awk input.fa > output.fa

Set the value for -v i to the number of contigs that should get merged into a single sequence.

fin swimmer

ADD COMMENTlink modified 10 days ago • written 11 days ago by finswimmer11k

Thanks for this. Unfortunately when I save the file and try to run the script as suggested it only shows the awk help manual. It doesn't seem to recognise the -i option? The first lines of what is printed:

Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options:      GNU long options:
    -f progfile     --file=progfile
    -F fs           --field-separator=fs
    -v var=val      --assign=var=val
ADD REPLYlink modified 10 days ago by finswimmer11k • written 10 days ago by Rubal220

Ooops, my mistake. It has to be -v i=140.

ADD REPLYlink written 10 days ago by finswimmer11k

brilliant thank you

ADD REPLYlink written 10 days ago by Rubal220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1476 users visited in the last hour