Searching for the common sequences in multiple (>2) fastq files
0
1
Entering edit mode
4.1 years ago
BPors ▴ 40

Hi,

I am trying to see that if my fastq files ( 15 of them) coming from different sources are sharing the same or very similar reads (allowing 1-2 mismatches), and then I would like to grab each whole sequence that they were sharing ( Even though they are sharing 50 bp, I want to grab the whole read, which is 150 bp).

I checked Biopieces and Bowtie2, but I could not find multiple comparisons.

I would appreciate if you can share your thoughts about it.

Thank you!

fastq align sequence common • 1.6k views
0
Entering edit mode

Hi,

To do that I would align the reads from all the fastq you have on your reference genome.

Then I would detect regions where a minimum coverage is present on all of 15 bams, then extract the reads from those regions.

If they align all to the same regions, they should be similar.

0
Entering edit mode

Thank you for your reply! I would like to try this approach, however can you explain why should I look for minimum coverage regions? Because they(the common sites) would be more rare than the others? I couldnt understand, sorry

0
Entering edit mode

The regions where you will have a minimum coverage on all bams will have matching reads from all the fastq.

For example if you take a threshold of 5X, for a given position on the genome, at least 5 reads on each fastq is mapped on this position.

As all those reads will be matching the same regions, you can deduct that those reads are similar between each others.

0
Entering edit mode

True..Many thanks for the explanation. Would you suggest the usage of samtools ?

0
Entering edit mode

To get the coverage I personally use bedtools :

bedtools genomecov -d -ibam your_bam_1.bam -g your_genome.fa > coverage_1.txt &

0
Entering edit mode

I doubt it'd be fast, but I don't know how big your fastqs are, but you could parse all the reads in through BioPython. You can then just use python's string operons such as:

if mystring in fastq.seq:
print(fastq.seq)


Not sure off the top of my head how to deal with allowing a couple of mismatches, but I'm sure some sort of module would allow it (maybe pairwise2), but that would be a LOT of aligning which would probably make it even slower.

I definitely think you should consider assembling the reads first though and then just find common contigs or something, since you may get a huge, huge dataset from comparing mutiple fastqs.

0
Entering edit mode

I am trying to see that if my fastq files ( 15 of them) coming from different sources are sharing the same or very similar reads (allowing 1-2 mismatches),

and

and then I would like to grab each whole sequence that they were sharing ( Even though they are sharing 50 bp, I want to grab the whole read, which is 150 bp).

Those seem to contradict each other, as if there is only a 50bp overlap, the reads are not shared. Do you actually want reads that share kmers? Or reads that come from the same organism? Perhaps you could explain exactly what you are trying to do, and why; otherwise "similar reads" is a pretty vague requirement.

0
Entering edit mode

Sure thing. What I meant is, my reads are 150 bp each in all fastq files, and I want to get the reads which are shared by other fastq files. The ultimate goal would be to grab the ones which are matching in more than 140 bps. However, I can put this threshold to 50 bp. So, if the read1 from fileA.fastq is sharing 50 bp in read12 of fileB.fastq and read56 and read78 of fileC.fastq, I want to retrieve the read1.

0
Entering edit mode

You could do this approximately with BBDuk:

bbduk.sh in=fileA.fq ref=fileB.fq,fileC.fq outm=matched.fq k=50 mm=f


That will give you all of the reads in fileA that contain at least 50bp contiguous regions totally covered by 31-mers in fileV or fileC. It's not exactly the same as sharing 50-mers, but very similar.

0
Entering edit mode

Just to throw some fresh perspective:

I would produce a consensus FASTA separately for both and then align one to the other to check the alignment percent. I believe that SAMtools allows you to produce a consensus FASTA via pileup or mpileup.

You could also go back to the first bioinformatics tools like BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) and compare them there. BLAST allows multiple comparisons. Just go to Nucleotide BLAST and check the 'Align two or more sequences' checkbox - it allows you to upload FASTA.

Kevin