Question: How To Cluster Mate Pair, Paie End And Single End Reads From Single File??
gravatar for HG
5.9 years ago by
HG1.1k wrote:

Hi, I made mate pair illumina sequencing. In fastq file i checked mate pair, paie end and single end reads are present. Can any one tell me how to separate mate pair, paie end and single end reads from that single fastq file. ??

Thank you advance

illumina fastq • 2.8k views
ADD COMMENTlink modified 5.9 years ago by cts1.6k • written 5.9 years ago by HG1.1k

Is this mix of reads a result of trimming or did something go amiss during the library prep. or sequencing? Mate-pairs have a different orientation than paired-end, so you could just exploit that during mapping. If you have single-end reads in there, you'll need to explicitly remove them. There are threads elsewhere on this forum about syncing fastq files (see How to sort two mate pair (fastq) files so that the order of the identifiers is the same? and Combining the paired reads from Illumina run).

ADD REPLYlink written 5.9 years ago by Devon Ryan92k

It may be during sequencing or may be due to the time of library prep. I dont have so much prior information but while mapping i can see some reads are pair end some are mate pair and some of them are single end. Now i want to separate all three types into 3 file from a single fastq file. Is it possible ?? Please also let me know how can i remove single end from a mixture of reads .

ADD REPLYlink written 5.9 years ago by HG1.1k

What you're actually observing is that some of the reads align better in one orientation than what you expect and, for still others, one of the reads simply won't align so the aligner just goes ahead and aligns the mate as a singleton. This doesn't mean that you have a mix of reads in the fastq file.

ADD REPLYlink written 5.9 years ago by Devon Ryan92k
gravatar for cts
5.9 years ago by
cts1.6k wrote:

Unfortunately with Illumina matepairs there is always a mix of paired-end and mate pair. You can perform some preprocessing steps to try and segregate based on the presence of the adaptor sequence. I've used nextclip for doing this, but it is specific for the Nextera mate pair protocol. After that to remove the final contaminants you will need to do an alignment against a reference genome and then look at the observed insert sizes. You can do this by looking at the tlen column in a sam file (the 9th column), if the reads are paired this should equal the insert size, if its a single mapped read the column will be 0.

below is a sample awk script that will separate a sam file into two based on the insert size of paired reads. It's not quite complete, you may need to add in some extra logic to get it to work properly. I also haven't tested it so I apologise in advance if there is a bug in it.

awk -F'\t' '
function abs(x){
    return ((x < 0.0) ? -x : x)
    print $0 >>"pe.sam"
    print $0 >>"mp.sam"
    if (abs($9) < 500) {  # <-- change number here to be the cutoff between paired-end and matepair
        print $0 >>"pe.sam"
        print $0 >> "mp.sam"
}' file.sam
ADD COMMENTlink written 5.9 years ago by cts1.6k

Illumina's new Nextera mate-pair protocol actually is much better and has a biotin-labeled stuffer adaptor, so you can fish for those with the adaptor (and thus have a much higher confidence set of mate-pairs). Not to mention the protocol does a very good job of removing most of the PE (even up to 15-20kb inserts). We've been pretty happy with it.

ADD REPLYlink written 5.9 years ago by Chris Fields2.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1096 users visited in the last hour