Merge contigs in fasta file
1
2
Entering edit mode
2.0 years ago
Rubal ▴ 340

Hello All,

I am running variant calling on some species whose reference genomes have a very high number of contigs (sometimes >400,000). The variant caller I am using splits the job by the number of chromosomes, and is overwhelmed when this number is too high. Therefore I would like to concatenate the contigs for a given species reference fasta file into ~30 contigs.

I believe I could use code such as below to merge all the contigs into one:

> grep -v "^>" test.fasta | awk 'BEGIN { ORS=""; print
> ">Sequence_name\n" } { print }' > new​.fasta

However I would like to merge them into ~30 contigs so the process can still be parallelised. I would also like to insert 1000 'N' characters between each of the merged contigs within these merged contigs, to avoid mapping issues that could be caused by merging contig sequences from different parts of the genome.

Does anyone have any advice for how to do this or know of any application that could do something similar?

Thanks in advance for your help.

fasta merge contigs • 1.8k views
ADD COMMENT
4
Entering edit mode
2.0 years ago

You can try this awk solution. But first you have to calculate how many contigs should be merge into a single sequence.

Save this script as merge_contigs.awk

function print_n(n)
 {
    text = ""
    for(j=0;j<n;j++) text=text"N"
    return text
 }

BEGIN {n=1} 
$0 ~ "^>" { 
    if(n<=i) {
        header = header""substr($0,2)" ";
        if(n > 1) seq=seq""print_n(1000)
        n++;
    } 
    else {
        print ">"header"\n"seq; 
        n=1; 
        header=substr($0,2); 
        seq=""
    } 

    next;
} 
{seq=seq$0} 
END {print ">"header"\n"seq;}

Run it like this:

$ awk -v i=140 -f merge_contigs.awk input.fa > output.fa

Set the value for -v i to the number of contigs that should get merged into a single sequence.

fin swimmer

ADD COMMENT
0
Entering edit mode

Thanks for this. Unfortunately when I save the file and try to run the script as suggested it only shows the awk help manual. It doesn't seem to recognise the -i option? The first lines of what is printed:

Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options:      GNU long options:
    -f progfile     --file=progfile
    -F fs           --field-separator=fs
    -v var=val      --assign=var=val
ADD REPLY
0
Entering edit mode

Ooops, my mistake. It has to be -v i=140.

ADD REPLY
0
Entering edit mode

brilliant thank you

ADD REPLY

Login before adding your answer.

Traffic: 2076 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6