Dna Sequencing Using Abyss-Pe Of Paired End Reads
1
0
Entering edit mode
10.2 years ago
nocgod ▴ 10

Hello! I've got to run abyss-pe on 2 files I've got and find the best parameters (k and so on) that would create the best contig coverage (sorry if I'm messing things up, I'm a programmer that had a short bioinformatics crash course). The files I've got are paired end reads of the Clorella Variabilis chloroplast and each take 304MB.

This is my run command

    abyss-pe k=25 n=10 v=-v c=118 e=51 name=test in='reads1.fastq reads2.fastq'

Since I run the program on verbose and see a memory load I think that this is not a RAM issue (I have 16GB of ram on the system). I've tried running the fastQValidator tool and it said the files seem to be invalid, however my lecturer assures me they are valid and in fact paired ends. The files are in FASTQ format (I guess created on a Illumina/Sanger machine) so they contain some strange characters like scopes and non-ACGT characters in the quality strings and the sequence it self.

this is where the progress stops:

Mapped 1432486 of 1481614 reads (96.7%)
Mapped 1430166 of 1481614 reads uniquely (96.5%)
Read 1481614 alignments
Mateless   1481614  100%
Unaligned        0
Singleton        0
FR               0
RF               0
FF               0
Different        0
Total      1481614
abyss-fixmate: error: All reads are mateless. This can happen when first and second read IDs do not match.
error: `test-3.hist': No such file or directory

The link I'm submitting contains the whole output during the runtime of abyss-pe in verbose mode.

I've tried all I could (or know for that matter - since it is a crash course for software engineers I took as a diversity course during my BSc), I'd appreciate some help. Thanks ahead!

dna paired-end • 6.0k views
ADD COMMENT
1
Entering edit mode
10.2 years ago

It could be that your reads are valid but not named in the way abyss wants them. From the manual:

The suffix of the read identifier for a pair of reads must be one of '1' and '2', or 'A' and 'B', or 'F' and 'R', or 'F3' and 'R3', or 'forward' and 'reverse'

If you do head -1 reads1.fq and head -1 reads2.fq you might see that these identifiers are missing. These are optional and often not included, so your lecturer is technically right in saying that the files are valid. The easiest way to fix this is to just add "1" to all identifiers of reads1.fq and "2" to all identifiers in reads2.fq, I assume they're sorted based on how they're paired, right?

ADD COMMENT
0
Entering edit mode

My files are already named as you've suggested, reads1.fastq and reads2.fastq

Edit: I might have misunderstood you. You mean that all identifiers in the file should have those prefixes.

As I understand (from the files) they already have the read identifiers thought I'm not sure if that's correct.

This is a the first 4 lines in the read1.fastq file (they include as I understand the identifier line, the the sequence, the identifier again and the quality string)

@MISEQ:6:000000000-A0REL:1:1101:15372:1440_1:N:0:NTTTCG
TGGTTGTGCTACACCGGGAGTAGTAAGATATTCACCTCGATCAAAAGGAGCAACTTCAACATCAAGAAATTGAAGATTACTTTCCGGGCTTGACAATAAGCCAATTCCACCAATATCTTTGACACGGGTTTGGTCAAAAATAGTTGTTTTA
+MISEQ:6:000000000-A0REL:1:1101:15372:1440_1:N:0:NTTTCG
?EEAGGGGGECEBGEDDBGGGGEBGGFFFDF?<@BFEHFFEDFFHHHHGHHGGFHHHHIIIIIIIIIIIIIIIIIHFHHHBHFHHIHHHIIIHIIIIIIIIGIHHHFE?IIIIIIIHIHHHHHEHHIIIGGGGGGDDDDDDDDAAB?A???

these are the corresponding lines in reads2.fastq:

@MISEQ:6:000000000-A0REL:1:1101:15372:1440_2:N:0:NTTTCG
GTGGTTTTTTTAGTCTTTCAAAACTTTTTAAAAAACAATTAAAATCTCAAGACCAAATAGATTCTCGTCAGCTTGATCGAGGTGATATCAACACATTCCTTTCTAAAAATGATCAGACTGCTGCGTTTGATCTGTTAGACACTAGTTCTCN
+MISEQ:6:000000000-A0REL:1:1101:15372:1440_2:N:0:NTTTCG
ECEEEEB===B=CDDDDD@DD=DEFFFFFEFFHFFHHFHHHHHFDCCGD=D?FHHHHHHHHHEEFHHHHHHHHHHHHHHHHGHHHHHHHHHHGHHFFHHHHHHHHHHHHFHHHHHHFHHHHHHHHHHHHFFFFFFDDDBB5DBBB???55#

Does this seem correct?

ADD REPLY
0
Entering edit mode

The read identifiers in your files are the lines starting with @, so in your case,

@MISEQ:6:000000000-A0REL:1:1101:15372:1440_1:N:0:NTTTCG

and

@MISEQ:6:000000000-A0REL:1:1101:15372:1440_2:N:0:NTTTCG

As you can see, the suffixes of these are not 1, 2, or R1, R2, etc. - so you either have to cut off the :N:0:NTTTCG or add "1" to all lines starting with @ in reads1.fq and "2" to all lines starting with @ in reads2.fq using python, perl, sed or grep etc.

ADD REPLY
0
Entering edit mode

Unfortunately that didn't help as well.

Just to be on the save side of things:

reads1.fastq

@MISEQ:6:000000000-A0REL:1:1101:15372:1440_1
TGGTTGTGCTACACCGGGAGTAGTAAGATATTCACCTCGATCAAAAGGAGCAACTTCAACATCAAGAAATTGAAGATTACTTTCCGGGCTTGACAATAAGCCAATTCCACCAATATCTTTGACACGGGTTTGGTCAAAAATAGTTGTTTTA
+MISEQ:6:000000000-A0REL:1:1101:15372:1440_1
?EEAGGGGGECEBGEDDBGGGGEBGGFFFDF?<@BFEHFFEDFFHHHHGHHGGFHHHHIIIIIIIIIIIIIIIIIHFHHHBHFHHIHHHIIIHIIIIIIIIGIHHHFE?IIIIIIIHIHHHHHEHHIIIGGGGGGDDDDDDDDAAB?A???

reads2.fastq

@MISEQ:6:000000000-A0REL:1:1101:15372:1440_2
GTGGTTTTTTTAGTCTTTCAAAACTTTTTAAAAAACAATTAAAATCTCAAGACCAAATAGATTCTCGTCAGCTTGATCGAGGTGATATCAACACATTCCTTTCTAAAAATGATCAGACTGCTGCGTTTGATCTGTTAGACACTAGTTCTCN
+MISEQ:6:000000000-A0REL:1:1101:15372:1440_2
ECEEEEB===B=CDDDDD@DD=DEFFFFFEFFHFFHHFHHHHHFDCCGD=D?FHHHHHHHHHEEFHHHHHHHHHHHHHHHHGHHHHHHHHHHGHHFFHHHHHHHHHHHHFHHHHHHFHHHHHHHHHHHHFFFFFFDDDBB5DBBB???55#

I've removed all the suffixes you've mentioned.

ADD REPLY
1
Entering edit mode

Hm, lhat looks correct - could you please try one more thing? This page on the Abyss homepage has some test input data, direct link. Could you compare the test data with yours and see whether your Abyss installation can handle the test data?

Some differences I see that might not be documented:

  • in Abyss' data the number-suffix has a slash (i.e., "/1", "/2") - maybe it needs the slash?
  • in Abyss' data the + line has no identifier (that one's optional and I rarely see it)
ADD REPLY
1
Entering edit mode

Please write this down as an answer since it solved my problem (adding /1 and /2 to the corresponding read files)

ADD REPLY
0
Entering edit mode

the first thing I did after compiling ABySS on my machine was to get the test data and run it. It does run up to the end and present the table with contigs coverage etc' so I presume that ABySS it self is working fine. I have a real feeling that the data provided to me is not in a standard format and ABySS can not handle it. I'll try putting /1 and /2 in the suffix instead of only 1 and 2 About the + identifier line, I'll research how to remove those... I hope some regex could do the trick.

ADD REPLY

Login before adding your answer.

Traffic: 2515 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6