Question: Using the ABySS pipeline to scaffold a DISCOVAR de novo assembly
0
gravatar for dbrowne.up
4.0 years ago by
dbrowne.up60
United States
dbrowne.up60 wrote:

Hi all,

I have a DISCOVAR de novo assembly that was made with 2x250 bp paired end reads with a fragment size of ~800. Since I also have a couple 2x150 bp mate pair libraries (insert sizes 1.5 kb and 4 kb), I now want to scaffold the DDN assembly. To do this I am trying SSPACE, but I also want to try the ABySS pipeline (everything after the ABYSS/ABYSS-P command). Everything works up until the first mapping step:

AdjList -v   -k200 -m50 --dot SXPX_DDN_BROKEN-1.fa >SXPX_DDN_BROKEN-1.dot
abyss-filtergraph -v --dot   -k200 -g SXPX_DDN_BROKEN-2.dot1 SXPX_DDN_BROKEN-1.dot SXPX_DDN_BROKEN-1.fa >SXPX_DDN_BROKEN-1.path
MergeContigs -v  -k200 -g SXPX_DDN_BROKEN-2.dot -o SXPX_DDN_BROKEN-2.fa SXPX_DDN_BROKEN-1.fa SXPX_DDN_BROKEN-2.dot1 SXPX_DDN_BROKEN-1.path
PopBubbles -v --dot -j20 -k200  -p0.9  -g SXPX_DDN_BROKEN-3.dot SXPX_DDN_BROKEN-2.fa SXPX_DDN_BROKEN-2.dot >SXPX_DDN_BROKEN-2.path
MergeContigs -v  -k200 -o SXPX_DDN_BROKEN-3.fa SXPX_DDN_BROKEN-2.fa SXPX_DDN_BROKEN-2.dot SXPX_DDN_BROKEN-2.path
awk '!/^>/ {x[">" $1]=1; next} {getline s} $1 in x {print $0 "\n" s}' \
                SXPX_DDN_BROKEN-2.path SXPX_DDN_BROKEN-1.fa >SXPX_DDN_BROKEN-indel.fa
ln -sf SXPX_DDN_BROKEN-3.fa SXPX_DDN_BROKEN-unitigs.fa
abyss-map -v  -j20 -l200    ../02_Separated_Pairs/Library.SXPX.L.fq.gz ../02_Separated_Pairs/Library.SXPX.R.fq.gz SXPX_DDN_BROKEN-3.fa \
                |abyss-fixmate -v  -l200  -h pe1-3.hist \
                |sort -snk3 -k4 \
                |DistanceEst -v  -j20 -k200 -l200 -s500 -n10   -o pe1-3.dist pe1-3.hist

Here is the corresponding output:

Reading `SXPX_DDN_BROKEN-1.fa'...
Finding overlaps of exactly k-1 bp...
V=353130 E=130474 E/V=0.369
Degree: █▂
        01234
0: 70% 1: 24% 2-4: 6.2% 5+: 0% max: 4
Finding overlaps of fewer than k-1 bp...
V=353130 E=168079956 E/V=476
Degree: ▅█▁
        01234
0: 20% 1: 30% 2-4: 14% 5+: 37% max: 8203
Loading graph from file: SXPX_DDN_BROKEN-1.dot
Graph stats before:
V=353130 E=168079956 E/V=476
Degree: ▅█▁
        01234
0: 20% 1: 30% 2-4: 14% 5+: 37% max: 8203
Removing shim contigs from the graph...
Pass 1: Checking 9670 contigs.
Pass 2: Checking 17 contigs.
Shim removal stats:
Removed: 4498 Too Complex: 55937 Tails: 55781 Too Long: 59860 Self Adjacent: 488 Parallel Edges: 72
Reading `SXPX_DDN_BROKEN-1.fa'...
Edge removal stats:
Removed: 0
Graph stats after:
V=344132 E=168070667 E/V=488
Degree: ▆█▂_
        01234
0: 20% 1: 28% 2-4: 14% 5+: 38% max: 8203
Reading `SXPX_DDN_BROKEN-2.dot1'...
Read 344132 vertices. Using 1.86 GB of memory.
Reading `SXPX_DDN_BROKEN-1.fa'...
Read 172066 sequences. Using 2.18 GB of memory.
Reading `SXPX_DDN_BROKEN-1.path'...
Read 0 paths. Using 2.18 GB of memory.
Writing `SXPX_DDN_BROKEN-2.dot'...
V=344132 E=168070667 E/V=488
Degree: ▆█▂_
        01234
0: 20% 1: 28% 2-4: 14% 5+: 38% max: 8203
Reading `SXPX_DDN_BROKEN-2.dot'...
V=344132 E=168070667 E/V=488
Degree: ▆█▂_
        01234
0: 20% 1: 28% 2-4: 14% 5+: 38% max: 8203
Reading `SXPX_DDN_BROKEN-2.fa'...
Bubbles: 534 Popped: 388 Scaffolds: 0 Complex: 85 Too long: 0 Too many: 45 Dissimilar: 16
V=294910 E=168020663 E/V=570
Degree: █▆▃__
        01234
0: 23% 1: 17% 2-4: 16% 5+: 44% max: 8203
Reading `SXPX_DDN_BROKEN-2.dot'...
Read 344132 vertices. Using 1.86 GB of memory.
Reading `SXPX_DDN_BROKEN-2.fa'...
Read 172066 sequences. Using 2.08 GB of memory. Reading `SXPX_DDN_BROKEN-2.path'...
Read 20613 paths. Using 2.09 GB of memory.
The minimum coverage of single-end contigs is inf.
The minimum coverage of merged contigs is inf.
n       n:200   L50     min     N80     N50     N20     E-size  max     sum     name
147455  147455  12083   300     751     3925    10323   34989   1317114 207.5e6 SXPX_DDN_BROKEN-3.fa
Reading from standard input...
Reading `SXPX_DDN_BROKEN-3.fa'...
Using 18.6 MB of memory and 126 B/sequence.
Reading `SXPX_DDN_BROKEN-3.fa'...
Building the suffix array...
Building the Burrows-Wheeler transform...
Building the character occurrence table...
Read 212 MB in 147455 contigs.
Using 1.88 GB of memory and 8.87 B/bp.
Read 15 alignments. Hash load: 5 / 8 = 0.625 using 975 kB.
Read 131 alignments. Hash load: 9 / 16 = 0.5625 using 975 kB.
Read 1000000 alignments. Hash load: 4 / 16 = 0.25 using 975 kB.
Read 2000000 alignments. Hash load: 4 / 16 = 0.25 using 975 kB.
[............]
Read 499000000 alignments. Hash load: 4 / 32 = 0.125 using 1.22 MB.
Mapped 302150492 of 499073402 reads (60.5%)
Mapped 299438060 of 499073402 reads uniquely (60%)
Read 499073402 alignments
Mateless 0
Unaligned 56417991 22.6%
Singleton 84086928 33.7%
FR 92547076 37.1%
RF 7349 0.00295%
FF 23043 0.00923%
Different 16454314 6.59%
Total 249536701
FR Stats mean: 740.3 median: 796 sd: 149.4 n: 92539745 min: 227 max: 1096 ignored: 14680
_____________▁▁▁▁▁▁▁▂▂▂▂▂▃▃▄▅▇███▇▆▅▃▂▁_
Mate orientation FR: 92547076 (100%) RF: 7349 (0.00794%)
The library pe1-3.hist is oriented forward-reverse (FR).
Stats mean: 740.3 median: 796 sd: 149.4 n: 92539745 min: 227 max: 1096
_____________▁▁▁▁▁▁▁▂▂▂▂▂▃▃▄▅▇███▇▆▅▃▂▁_
Minimum and maximum distance are set to -199 and 1096 bp.
error: input must be sorted: saw `flattened_line_137693_1' before `flattened_line_86968_1'

Everything after this point does not work properly because the file "pe1-3.dist" is empty. Clearly, I think the input for DistanceEst is not properly sorted by the "sort" command, which likely has to do with the difference between the fasta sequence names in the DDN output (e.g. >flattened_line_0_1) versus ABYSS output (e.g. >0 100 315).

Can I modify the sort command some how to work properly with the DDN fasta names? Or should I rename all the fasta sequences in the DDN output to something more similar to the ABYSS format? I'm not sure how exactly the "sort" command is working, so I'm not sure how the input needs to be formatted. Any suggestions are appreciated.

Best,

Dan

 

abyss assembly • 2.3k views
ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by dbrowne.up60

I suppose it's worth mentioning that the version of ABySS that I am using to do this was compiled with --enable-maxk=256 so using k=200 isn't a problem.

ADD REPLYlink written 4.0 years ago by dbrowne.up60
2
gravatar for dbrowne.up
4.0 years ago by
dbrowne.up60
United States
dbrowne.up60 wrote:

Answer from Ben Vandervalk, for anyone who is interested:

Hi Dan,

Sorry for the slow response.

If you just want to ABySS for scaffolding with MPET then what you will have to do "trick" ABySS pipeline into starting at the scaffolding stage with your assembly FASTA file.  With your current setup, you are actually running a couple stages of the assembly pipeline prior to scaffolding (the unitig and contig stages), in addition to doing the scaffolding stage itself.  

So how do you "trick" ABySS to starting at the scaffolding stage?  It is not as user-friendly as we would like.  Since ABySS is a Makefile, what you need to do is setup the prerequisite files for the scaffold stage with exactly the right filenames and then run your abyss-pe as normal.

Scaffolding starts after stage 6 (contig stage), so I think the prerequisite files would be:

SXPX_DDN_BROKEN-6.fa
SXPX_DDN_BROKEN-6.dot

To set those up, rename the FASTA IDs in your assembly FASTA file to numeric IDS, change the filename to SXPX_DDN_BROKEN-6.fa, and then build the corresponding dot file with the `AdjList` program. (The 'dot' file is a graph describing end-to-end overlaps between the sequences).  You are correct that ABySS expects integers for the FASTA IDs.  Only the first number in the FASTA headers (the ID) should be important.  Don't worry about replicating the other numbers, as they shouldn't have any effect. 

In addition, I think you will also need to tell `make` not to build any of the prerequisite files for SXPX_DDN_BROKEN-6.fa / SXPX_DDN_BROKEN-6.dot
by adding the following options to the end of your `abyss-pe` command:

-o SXPX_DDN_BROKEN-6.fa -o SXPX_DDN_BROKEN-6.dot

As you may be aware, you can add the `-n` (--dry-run) to see what commands would be run by abyss-pe before actually running the ABySS pipeline for real.  I highly recommend that. If everything goes as planned, the first commands should be alignments of the MPET files against the contigs (-6 files) with `abyss-map`.   Then following that, the commands to generate the -7 and -8 files.  The -8.fa is the scaffolds file.

Phew.  HTH,

- Ben

With that established, it's super easy to change the fasta headers to meet the spec, as follows (source: https://www.biostars.org/p/53212/#53219):

awk '/^>/{print ">" ++i;next}{print}' < SXPX_DDN_BROKEN-1.fa > SXPX_DDN_BROKEN_RN-1.fa 
ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by dbrowne.up60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1184 users visited in the last hour