How To Check If Illumina Fastq Is Single Or Paired End With Minimal Sequence Id
2
10
Entering edit mode
8.6 years ago
fbrundu ▴ 330

Hi all, I am trying to check if a FASTQ is single or paired end. From wikipedia I saw that default format has to be like this:

@HWUSI-EAS100R:6:73:941:1973#0/1

but in my case the sequence id is like

@HWUSI-EAS100R:6:73:941:1973

with missing # part.. Can I assume that it is single end? I could not find a good source to learn from about it.. can you also point me to something like this? Thanks

fastq paired-end illumina • 42k views
ADD COMMENT
3
Entering edit mode

Also check for duplicated read ids, to verify that the two files were not interleaved. This is as simple as grepping for a few read ids:

grep HWUSI-EAS100R:6:73:941:1973 readfile.fq

or just cut out the read ids and count them this way:

grep @HWUSI-EAS100R readfile.fq | head -100000  | sort | uniq -c | sort -rgk 1,1 | head
ADD REPLY
0
Entering edit mode

I am using the second one and I'll let you know later, (why the sort flag -k ? And the first sort - after head -?). Moreover, if it is available only one file, can I assume that is single end? Thanks

ADD REPLY
1
Entering edit mode

there was a missing | now edited. the flag is perhaps not needed in this case, it is just habit to restrict it to the field that I am interested in rather than all fields.

Having one file seems to indicate a single end data. But if you have identical read ids it might be interleaved paired end data

ADD REPLY
0
Entering edit mode

Ah ok.. I misread your first comment. Perfect, so files can be interleaved. So I have to check if seq id before # have duplicates, because paired end read have the same seqid, am I correct?

ADD REPLY
2
Entering edit mode

Yes, as Istvan suggest just grep one id, you should have one results if it's single-end. If you have two results, it's an interleaved paired-end file

ADD REPLY
0
Entering edit mode

Ah ok, I got it. So there is always a match between two reads (or conversely there are no unique reads with a given seq id) if they are paired end. Add a new answer or update the existing with all info if you can, so I can accept it for future references.

ADD REPLY
1
Entering edit mode

Yes if paired-end data, you will have the same id for one pair of reads.

ADD REPLY
12
Entering edit mode
8.6 years ago

If it's paired-end you should have two files ( R1.fastq and R2.fastq). In the id of the read you should have an information about the pair. Here an example of one pair of reads; The information is at the end of the read in 1:N:0:28. The 1 represent read from R1.fastq. FYI, it's from a HiSeq 2000 run

R1.fastq :

@M00991:61:000000000-A7EML:1:1101:14011:1001 1:N:0:28
NGCTCCTAGGTCGGCATGATGGGGGAAGGAGAGCATGGGAAGAAATGAGAGAGTAGCAA
+
#8BCCGGGGGFEFECFGGGGGGGGG@;FFGGGEG@FF<EE<@FFC,CEGCCGGFF<FGF

R2.fastq :

@M00991:61:000000000-A7EML:1:1101:14011:1001 2:N:0:28
TTGCTACTCTCTCATTTCTTCCCATGCCTTCCTTCCCCCATCATGCCGACCTAGGAGCC
+
CCCCC,;FF,EA9CEE<6CFAFGGGD@,,6CC<FA@FG:FF8@F9EE7@FGCFGFFFFG

So in your example, it should be single-end data..

ADD COMMENT
0
Entering edit mode

Two things: the seq id I put is only an example, my seq id is different but formatted in the same way; the seq id you provided as example is clear to me, but it seems to be the new format not the old one - see wikipedia under "Illumina sequence identifiers". Moreover, I have to say that I have only one fastq (and only one is avalaible): can I assume that it is single end because it has only one file?

ADD REPLY
1
Entering edit mode
8.6 years ago
Chris Fields ★ 2.2k

As @NicoBxl indicated, there are normally two files, one for each pair, so the best bet is that this is single-end. However, you'll note that the ID in that example (the part prior to the space, not including the pairing info) matches in both cases. This ID contains the lane coordinates for the cluster the read belongs to, a paired read originates from the same cluster and would have the same coordinates.

So, if you really want to be sure, you could parse through the data to make sure all the IDs are unique; if you have two sequences per ID this may mean your sequence is paired but interleaved.

ADD COMMENT

Login before adding your answer.

Traffic: 772 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6