Question

Split a sequence in a fastq file

0

Entering edit mode

5.7 years ago

ste.lu ▴ 80

Hi All,

Could you suggest a way to split a read in a fastq file (on a particular motif) and keep the 2 resulting sequences as 2 independent reads?

I'll give an example of what I want to do:

@K00252:388:H2LM2BBXY:3:1101:1397:1138 1:N:0:ATCACG TGTGACCTTCAGGACAGTCCTAAGGCTGTGGGAAAAACACTNAAAACATGAGTTCAAAAATATATATATATTTTCCCAACTATGCAAAAATATAAGGATGCAATATGGATTGTATAATGAGCTTCACAGATATAAAGGAACAGNGGCAT +

AAAAJJ77<7JJJ7FAJJJJJJJFFFJF< FFF7AFJJJJFA#JFJJFJJJJ< AA-F-< JJFJAJFAAJ< JJJJJ--<<< -FFFF7AJJJJFFJJAFFFFA<<-7< FFJA< JJJJAJF< AAFF7-F< AF-A7A-< -< J-FFJ<f#ajaa<< p="">

Then grep for a sequence. e.g TATATATATA and cut on that string and keep the 2 resulting as 2 reads:

@K00252:388:H2LM2BBXY:3:1101:1397:1138 1:N:0:ATCACG

TGTGACCTTCAGGACAGTCCTAAGGCTGTGGGAAAAACACTNAAAACATGAGTTCAAAAATATATATAT

+

AAAAJJ77<7JJJ7FAJJJJJJJFFFJF< FFF7AFJJJJFA#JFJJFJJJJ< AA-F-< JJFJAJFAAJ< JJJJJ

@K00252:388:H2LM2BBXY:3:1101:1397:1138 1:N:0:ATCACG

TTTTCCCAACTATGCAAAAATATAAGGATGCAATATGGATTGTATAATGAGCTTCACAGATATAAAGGAACAGNGGCAT

+

--<<< -FFFF7AJJJJFFJJAFFFFA<<-7< FFJA< JJJJAJF< AAFF7-F< AF-A7A-< -< J-FFJ< F#AJAA<

Thank you

fastq sequencing sequence next-gen • 2.4k views

ADD COMMENT • link updated 5.7 years ago by Pierre Lindenbaum 163k • written 5.7 years ago by ste.lu ▴ 80

0

Entering edit mode

I'd suggest writing a biopython script for something like that. Do you have any programming experience?

ADD REPLY • link 5.7 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank for your answer. I've coded a bit my background is different. What would you suggest? a link to out me on the right track is more than enough.

ADD REPLY • link 5.7 years ago by ste.lu ▴ 80

1

Entering edit mode

I'd recommend going through some sections of the Biopython cookbook and tutorial. That would put you on track on how to solve this and further questions about handling common file formats.

While one-liners like the one of Pierre are pretty (and efficient) it would probably take me less time to write it in Python, especially if I have scripts saved from earlier/similar applications which I just have to adapt a bit.

ADD REPLY • link 5.7 years ago by WouterDeCoster 47k

score 3 · Answer 1 · 2018-11-17

3

Entering edit mode

5.7 years ago

Pierre Lindenbaum 163k

linearize, use awk to detect the position of the patern, print the two sequences, convert back to fastq

cat input.fastq |\
paste - - - - |\
awk -F '\t' 'BEGIN{S="TATATATATA";N=length(S);}{i=index($2,S);if(i==0) {print} else {printf("%s\t%s\t+\t%s\n%s\t%s+\t%s\n",$1,substr($2,1,i),substr($4,1,i),$1,substr($2,i+N),substr($4,i+N));}}' |\
tr "\t" "\n"

ADD COMMENT • link 5.7 years ago by Pierre Lindenbaum 163k

0

Entering edit mode

Hi Pierre,

Thanks for your script! In this way I keep all the reads, the original one and the 2 derived, isn't it?

ADD REPLY • link 5.7 years ago by ste.lu ▴ 80

1

Entering edit mode

no, you will only get the two substrings as output. But that's what you asked for, no?

ADD REPLY • link 5.7 years ago by lieven.sterck 15k

0

Entering edit mode

yeah, definetly. Thanks!

ADD REPLY • link 5.7 years ago by ste.lu ▴ 80

0

Entering edit mode

lovely oneliner Pierre Lindenbaum !

some remarks though: I think the 'motif' is missing in your output (at least that's what I understood from OP's example, to also still include the 'motif' , and there might be an off-by-one mistake in it as well ?

ADD REPLY • link 5.7 years ago by lieven.sterck 15k

0

Entering edit mode

an off-by-one mistake in it as well ?

may be :-D

ADD REPLY • link 5.7 years ago by Pierre Lindenbaum 163k