Question: Merge contigs in fasta file
0
gravatar for Rubal
5 months ago by
Rubal250
Germany
Rubal250 wrote:

Hello All,

I am running variant calling on some species whose reference genomes have a very high number of contigs (sometimes >400,000). The variant caller I am using splits the job by the number of chromosomes, and is overwhelmed when this number is too high. Therefore I would like to concatenate the contigs for a given species reference fasta file into ~30 contigs.

I believe I could use code such as below to merge all the contigs into one:

> grep -v "^>" test.fasta | awk 'BEGIN { ORS=""; print
> ">Sequence_name\n" } { print }' > new​.fasta

However I would like to merge them into ~30 contigs so the process can still be parallelised. I would also like to insert 1000 'N' characters between each of the merged contigs within these merged contigs, to avoid mapping issues that could be caused by merging contig sequences from different parts of the genome.

Does anyone have any advice for how to do this or know of any application that could do something similar?

Thanks in advance for your help.

contigs merge fasta • 308 views
ADD COMMENTlink modified 5 months ago by finswimmer12k • written 5 months ago by Rubal250
2
gravatar for finswimmer
5 months ago by
finswimmer12k
Germany
finswimmer12k wrote:

You can try this awk solution. But first you have to calculate how many contigs should be merge into a single sequence.

Save this script as merge_contigs.awk

function print_n(n)
 {
    text = ""
    for(j=0;j<n;j++) text=text"N"
    return text
 }

BEGIN {n=1} 
$0 ~ "^>" { 
    if(n<=i) {
        header = header""substr($0,2)" ";
        if(n > 1) seq=seq""print_n(1000)
        n++;
    } 
    else {
        print ">"header"\n"seq; 
        n=1; 
        header=substr($0,2); 
        seq=""
    } 

    next;
} 
{seq=seq$0} 
END {print ">"header"\n"seq;}

Run it like this:

$ awk -v i=140 -f merge_contigs.awk input.fa > output.fa

Set the value for -v i to the number of contigs that should get merged into a single sequence.

fin swimmer

ADD COMMENTlink modified 5 months ago • written 5 months ago by finswimmer12k

Thanks for this. Unfortunately when I save the file and try to run the script as suggested it only shows the awk help manual. It doesn't seem to recognise the -i option? The first lines of what is printed:

Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options:      GNU long options:
    -f progfile     --file=progfile
    -F fs           --field-separator=fs
    -v var=val      --assign=var=val
ADD REPLYlink modified 5 months ago by finswimmer12k • written 5 months ago by Rubal250

Ooops, my mistake. It has to be -v i=140.

ADD REPLYlink written 5 months ago by finswimmer12k

brilliant thank you

ADD REPLYlink written 5 months ago by Rubal250
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1442 users visited in the last hour