How To Cluster Mate Pair, Paie End And Single End Reads From Single File??
1
0
Entering edit mode
8.6 years ago
HG ★ 1.2k

Hi, I made mate pair illumina sequencing. In fastq file i checked mate pair, paie end and single end reads are present. Can any one tell me how to separate mate pair, paie end and single end reads from that single fastq file. ??

illumina fastq • 3.7k views
0
Entering edit mode

Is this mix of reads a result of trimming or did something go amiss during the library prep. or sequencing? Mate-pairs have a different orientation than paired-end, so you could just exploit that during mapping. If you have single-end reads in there, you'll need to explicitly remove them. There are threads elsewhere on this forum about syncing fastq files (see How to sort two mate pair (fastq) files so that the order of the identifiers is the same? and Combining the paired reads from Illumina run).

0
Entering edit mode

It may be during sequencing or may be due to the time of library prep. I dont have so much prior information but while mapping i can see some reads are pair end some are mate pair and some of them are single end. Now i want to separate all three types into 3 file from a single fastq file. Is it possible ?? Please also let me know how can i remove single end from a mixture of reads .

0
Entering edit mode

What you're actually observing is that some of the reads align better in one orientation than what you expect and, for still others, one of the reads simply won't align so the aligner just goes ahead and aligns the mate as a singleton. This doesn't mean that you have a mix of reads in the fastq file.

1
Entering edit mode
8.6 years ago
cts ★ 1.7k

Unfortunately with Illumina matepairs there is always a mix of paired-end and mate pair. You can perform some preprocessing steps to try and segregate based on the presence of the adaptor sequence. I've used nextclip for doing this, but it is specific for the Nextera mate pair protocol. After that to remove the final contaminants you will need to do an alignment against a reference genome and then look at the observed insert sizes. You can do this by looking at the tlen column in a sam file (the 9th column), if the reads are paired this should equal the insert size, if its a single mapped read the column will be 0.

below is a sample awk script that will separate a sam file into two based on the insert size of paired reads. It's not quite complete, you may need to add in some extra logic to get it to work properly. I also haven't tested it so I apologise in advance if there is a bug in it.

awk -F'\t' '
function abs(x){
return ((x < 0.0) ? -x : x)
}
/^@/{
print $0 >>"pe.sam" print$0 >>"mp.sam"
next
}
{
if (abs($9) < 500) { # <-- change number here to be the cutoff between paired-end and matepair print$0 >>"pe.sam"
}else{
print \$0 >> "mp.sam"
}
}' file.sam

0
Entering edit mode

Illumina's new Nextera mate-pair protocol actually is much better and has a biotin-labeled stuffer adaptor, so you can fish for those with the adaptor (and thus have a much higher confidence set of mate-pairs). Not to mention the protocol does a very good job of removing most of the PE (even up to 15-20kb inserts). We've been pretty happy with it.