Question

How can I find the position of a sequence in a fastq file

0

Entering edit mode

3.5 years ago

pomodoro_sinensis ▴ 110

Hello, I have a fastq file with 4 lines per sequence. This is an example of one of the entries:

@m54284U_200831_121835/3/ccs
ATGGGTGGGCTGGCAGTAGCCAGGGACGATGGGCTCTTCTCTGGGGATCCCAACTGGTTTCCTAAGAAATCCAAGGAGAACCCTCGGAACTTCTCGGACAACCAGTTGCAAGAGGGCAAGAACGTGATTGGGTTGCAGATGGGCACCAACCGTGGAGCATCTCAGGCCGGCATGACCGGCTATGGGATGCCACGGCAGATCCTCTGATCATACTCTCTCTCCTTCCCCTGCCCTCCATGAATGGTTAATATATATGTATATATATGTTTTAGCAGACATTCCCTGAGAGCCCCTGGATTGCTGAACCCCCCTCTGCCAGGGTCCAGGCCAGCCTATCTTGTCACCACTGGCAGGGCCTGATAATTGCCTCTCTCTCTCTCTCTCTTTCTCTCTCTCTCTCTCTCTGGGCTTACTAATGCATTCCTCCCCCCACATTATTCCCACAGTCTCAAGCACGTGGATTCTGCTGTAGTCGTACGCCGATGCGAAACATCGGCCACGTCGCTATTGCAGCGAGTAGATCGGAAGAGCACACGTCTGAACTCCA
+
5555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555

I have the following sequence: CAGACGTGTGCTCTTCCGATCT. I would like to know:

1) In which entries this sequence is present (both the number of entries and which entries).

2) Where in the sequence does this string match (which nucleotide position within the entry).

For example, I would like something like: The sequence is present in all entries. In entry 1, the sequence starts at nucleotide 10. etc.

I have tried with grep and awk but I certainly lack the knowledge to do this. Any help would be so appreciated!!!

fastq sequence position find • 1.2k views

ADD COMMENT • link updated 3.5 years ago by GenoMax 141k • written 3.5 years ago by pomodoro_sinensis ▴ 110

score 0 · Answer 1 · 2020-11-05

0

Entering edit mode

3.5 years ago

GenoMax 141k

You can use bbduk.sh in filter mode for question 1.

bbduk.sh -Xmx2g in=file.fq literal=CAGACGTGTGCTCTTCCGATCT k=10 outm=file.fq

Capture STDERR (where logs are written), they will show you how many sequences had that sequence. A guide for bbduk is available.

ADD COMMENT • link 3.5 years ago by GenoMax 141k

0

Entering edit mode

Thank you! I used your command but I am getting the following error:

!/content/bbmap/bbduk.sh -Xmx2g in=small.fastq literal=CAGACGTGTGCTCTTCCGATCT k=10 outm=small.fastq

java -ea -Xmx2g -Xms2g -cp /content/bbmap/current/ jgi.BBDuk -Xmx2g in=small.fastq literal=CAGACGTGTGCTCTTCCGATCT k=10 outm=small.fastq
Error: Unable to initialize main class jgi.BBDuk
Caused by: java.lang.NoClassDefFoundError: stream/ConcurrentReadStreamInterface

ADD REPLY • link 3.5 years ago by pomodoro_sinensis ▴ 110

0

Entering edit mode

Do you have java (1.8 and above) installed? You just unzipped the bbmap software bundle and did not move any files/folders inside it anywhere. You can use conda to install bbmap otherwise.

ADD REPLY • link 3.5 years ago by GenoMax 141k

0

Entering edit mode

That worked thank you!! I am getting the following result in a reduced version with only 250 reads:

0.028 seconds.
Initial:
Memory: max=2058m, total=2058m, free=2014m, used=44m

Added 13 kmers; time:   0.005 seconds.
Memory: max=2058m, total=2058m, free=2003m, used=55m

Input is being processed as unpaired
Started output streams: 0.015 seconds.
Processing time:        0.148 seconds.

Input:                      250 reads       250791 bases.
Contaminants:               134 reads (53.60%)  127677 bases (50.91%)
Total Removed:              134 reads (53.60%)  127677 bases (50.91%)
Result:                     116 reads (46.40%)  123114 bases (49.09%)

Time:                           0.170 seconds.
Reads Processed:         250    1.47k reads/sec
Bases Processed:        250k    1.47m bases/sec

I am assuming the "contaminants" are my sequence of interest?

ADD REPLY • link 3.5 years ago by pomodoro_sinensis ▴ 110

0

Entering edit mode

You can easily check that. Add rcomp=f if you only want to search in the reads in forward direction.

ADD REPLY • link 3.5 years ago by GenoMax 141k

0

Entering edit mode

This is great thanks so much!! Do you know if this tool allows for checking where in the read this sequence was found? Or do you know of any other tool? I played around with grep but couldn't make it work...

ADD REPLY • link 3.5 years ago by pomodoro_sinensis ▴ 110

0

Entering edit mode

Not easily I would think. You will need to parse alignments (CIGAR strings) or simply convert the sequence you find to fasta and do a multiple-sequence alignment.

ADD REPLY • link 3.5 years ago by GenoMax 141k