adding suffix /1 & /2 to PE data - abyss input data
5.9 years ago

Hi Abyssers,

I have run trimmomatric on my PE data which generates R1 & R2 files that do not have any suffixes. Now it is not obvious to me whether I must add the /1 & /2 to each read name or by simply telling Abyss that the reads are pairs using pe='r1.fastq r2.fastq' it should recognise the pairs and get on with the assembly correctly.

/SB

5.9 years ago

Alternatively, with BBMap:

reformat.sh in=r1.fq in2=r2.fq out1=renamed1.fq out2=renamed2.fq addslash

did not work for me because the names were of the type

@HISEQ:149:C76YNACXX:3:1101:1159:2191 1:N:0:GCCAAT

and MBBMap added the /1 to look like this

@HISEQ:149:C76YNACXX:3:1101:1159:2191 1:N:0:GCCAAT /1

@HISEQ:149:C76YNACXX:3:1101:1159:2191/1

for me, the solution (thanks @Salim!) was to use sed (spaces below are voluntary)

cat reads_1.fq | sed -e 's, 1:N:0:ATCACG,/1,g' > corrected1.fq

cat reads_2.fq | sed -e 's, 2:N:0:ATCACG,/2,g' > corrected2.fq

5.9 years ago
Yes, I believe that abyss requires the pair information to be present (either 1/2, forward/reverse or A/B) and the files may be separate or interleaved. You can add the pair information back with Pairfq. Here is an example (requires curl and perl):

curl -sL git.io/pairfq_lite | perl - addinfo -i R1.fq -o R1_info.fq -p 1
curl -sL git.io/pairfq_lite | perl - addinfo -i R2.fq -o R2_info.fq -p 2


That should go pretty fast and the input can be fasta or fastq (compressed is fine also I believe).

5.9 years ago

Solutions posted so far would work great. I just remembered an old blogpost where there was a onliner to convert new illumna naming scheme to old using this one liner:

cat new-style_.fastq | awk '{if (NR % 4 == 1) {split($1, arr, ":"); printf "%s_%s:%s:%s:%s:%s#0/%s (%s)\n", arr[1], arr[3], arr[4], arr[5], arr[6], arr[7], substr($2, 1, 1), $0} else if (NR % 4 == 3){print "+"} else {print$0} }' > old-style.fastq


It's kinda nice since you really are not relying on any other tools, just bash and good ol' awk. I think this will only work if you do have the new header (something like 1:N:0 and 2:N:0), it may not if you have no info about pairs in your header.

5.9 years ago

Thank you lads. I went ahead and assumed that the suffixes are important. Thank you for confirming that.

I used BBMap's reformatter script for this.

I have a related question though. I run abyss-2fastq on some data I was analysing a week ago and it added /1/2 not at the very end but towards the end before the last few characters e.g. below. Is this recognisable by Abyss?

@HISEQ:149:C76YNACXX:3:1101:1159:2191/1 1:N:0:GCCAAT
--------------------------------------^
ATAATTAAAGCAGGAATAGTAAAAAAACGTCCCTTAAAACGTATCAAGAAATCCGACCCAGACTGGGATTACGCAACCTGCGACGGCCCGTTGTGCCTGCG
+
BBBFFFFFFFFFFIBFIFIIIIIIIIIIIIIIIFIFFIIIFFFIBFIIIIIIFFFIFFFFFFFFFFFBBFFFFBFFFFFFBBFFFFFFF<BFFBFBBFFFF
@HISEQ:149:C76YNACXX:3:1101:1159:2191/2 2:N:0:GCCAAT
--------------------------------------^
AACCTTGCGACGACCTGAAGGACGGACCGTCGCAGGCACAACGGGCCGTCGCAGGTTGCGTAATCCCAGTCTGGGTCGGATTTCTTGATACGTTTTAAGGG
+
BBBFFFFFFFFFFFIIIIFFIFIIIFFFFIFFFFFFFFFFFFFBFBBFFF7<77B<BB<BBBBFFFFBBBFBFFF<BBF7BBBBFFB<BBFBBFFF<<BBF

hard to tell, but these can be easily removed....

sed -i 's, 1:N:0:GCCAAT,,g' file_r1.fastq

Hi Salim,

ABySS treats the first whitespace-separated word in the line as the read ID, so there is no need to remove the 1:N:0:GCCAAT or 2:N:0:GCCAAT. Everything after the first space is considered to be a comment/description.

5.9 years ago

Hello again,

I received a reply from Ben Vandervalk who is one of the authors of Abyss and it goes as follows:

pe="r1.fastq r2.fastq" should suffice.

ABySS requires that either:

(i) the read names for both reads are identical, OR (ii) the read names have an identical prefix, followed by "/1" and "/2", respectively.

- Ben