I have reads in R1 and R2 files, and I1 file including 12nc indexes (also fastq). In order to divide reads into separate files, I've performed merging step.
I used join_paired_ends.py program with following command:
join_paired_end.py -f <R1> -r <R2> -o demultiplexed/ -m fasts-join -b <I1> -p 15
It ended up with almost 94% of joined reads. Next, i've performed split_libraries_fastq.py script. When I looked at the histogram the number of reads per sequence length was very diverse:
Length Count 249.0 14545440 . . . 489.0 1467
The amplification was performed for V4 region which length is around 291 nc, the sequencing was 2x250bp. So my question is, how come I have 1467 reads with almost 500 length? Is this a contamination?
Should I discard all read longer than 300bp for further analysis? What do you think about it?
Thanks in advance! Best, Agata