Question: How To Check If Illumina Fastq Is Single Or Paired End With Minimal Sequence Id
8
gravatar for fbrundu
6.5 years ago by
fbrundu300
European Union
fbrundu300 wrote:

Hi all, I am trying to check if a FASTQ is single or paired end. From wikipedia I saw that default format has to be like this:

@HWUSI-EAS100R:6:73:941:1973#0/1

but in my case the sequence id is like

@HWUSI-EAS100R:6:73:941:1973

with missing # part.. Can I assume that it is single end? I could not find a good source to learn from about it.. can you also point me to something like this? Thanks

illumina fastq paired-end • 32k views
ADD COMMENTlink modified 6.5 years ago by Chris Fields2.1k • written 6.5 years ago by fbrundu300
2

Also check for duplicated read ids, to verify that the two files were not interleaved. This is as simple as grepping for a few read ids:

grep HWUSI-EAS100R:6:73:941:1973 readfile.fq

or just cut out the read ids and count them this way:

grep @HWUSI-EAS100R readfile.fq | head -100000  | sort | uniq -c | sort -rgk 1,1 | head
ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by Istvan Albert ♦♦ 84k

I am using the second one and I'll let you know later, (why the sort flag -k ? And the first sort - after head -?). Moreover, if it is available only one file, can I assume that is single end? Thanks

ADD REPLYlink written 6.5 years ago by fbrundu300
1

there was a missing | now edited. the flag is perhaps not needed in this case, it is just habit to restrict it to the field that I am interested in rather than all fields.

Having one file seems to indicate a single end data. But if you have identical read ids it might be interleaved paired end data

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by Istvan Albert ♦♦ 84k

Ah ok.. I misread your first comment. Perfect, so files can be interleaved. So I have to check if seq id before # have duplicates, because paired end read have the same seqid, am I correct?

ADD REPLYlink written 6.5 years ago by fbrundu300
2

Yes, as Istvan suggest just grep one id, you should have one results if it's single-end. If you have two results, it's an interleaved paired-end file

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by Nicolas Rosewick9.0k

Ah ok, I got it. So there is always a match between two reads (or conversely there are no unique reads with a given seq id) if they are paired end. Add a new answer or update the existing with all info if you can, so I can accept it for future references.

ADD REPLYlink written 6.5 years ago by fbrundu300
1

Yes if paired-end data, you will have the same id for one pair of reads.

ADD REPLYlink written 6.5 years ago by Nicolas Rosewick9.0k
4
gravatar for Nicolas Rosewick
6.5 years ago by
Belgium, Brussels
Nicolas Rosewick9.0k wrote:

If it's paired-end you should have two files ( R1.fastq and R2.fastq). In the id of the read you should have an information about the pair. Here an example of one pair of reads; The information is at the end of the read in 1:N:0:28. The 1 represent read from R1.fastq. FYI, it's from a HiSeq 2000 run

R1.fastq :

@M00991:61:000000000-A7EML:1:1101:14011:1001 1:N:0:28
NGCTCCTAGGTCGGCATGATGGGGGAAGGAGAGCATGGGAAGAAATGAGAGAGTAGCAA
+
#8BCCGGGGGFEFECFGGGGGGGGG@;FFGGGEG@FF<EE<@FFC,CEGCCGGFF<FGF

R2.fastq :

@M00991:61:000000000-A7EML:1:1101:14011:1001 2:N:0:28
TTGCTACTCTCTCATTTCTTCCCATGCCTTCCTTCCCCCATCATGCCGACCTAGGAGCC
+
CCCCC,;FF,EA9CEE<6CFAFGGGD@,,6CC<FA@FG:FF8@F9EE7@FGCFGFFFFG

So in your example, it should be single-end data..

ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by Nicolas Rosewick9.0k

Two things: the seq id I put is only an example, my seq id is different but formatted in the same way; the seq id you provided as example is clear to me, but it seems to be the new format not the old one - see wikipedia under "Illumina sequence identifiers". Moreover, I have to say that I have only one fastq (and only one is avalaible): can I assume that it is single end because it has only one file?

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by fbrundu300
1
gravatar for Chris Fields
6.5 years ago by
Chris Fields2.1k
University of Illinois Urbana-Champaign
Chris Fields2.1k wrote:

As @NicoBxl indicated, there are normally two files, one for each pair, so the best bet is that this is single-end. However, you'll note that the ID in that example (the part prior to the space, not including the pairing info) matches in both cases. This ID contains the lane coordinates for the cluster the read belongs to, a paired read originates from the same cluster and would have the same coordinates.

So, if you really want to be sure, you could parse through the data to make sure all the IDs are unique; if you have two sequences per ID this may mean your sequence is paired but interleaved.

ADD COMMENTlink written 6.5 years ago by Chris Fields2.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1123 users visited in the last hour