Question: my fastq file have two same reads' name
0
gravatar for jiangzhiyong12
23 months ago by
jiangzhiyong120 wrote:

Hi, i have PE resequencing data, but in some fastq file, it has two same reads' names, now i want to delete one of the two, so i want some suggestions from you all. Thank you.

sequence genome • 1.1k views
ADD COMMENTlink modified 23 months ago by genomax75k • written 23 months ago by jiangzhiyong120

That is not possible unless the files were messed with in some way/mistreated. How did you determine (which program/error message) that you have this condition?

ADD REPLYlink modified 23 months ago • written 23 months ago by genomax75k

when i run the GATK workflow when markduplicates, i got this error:

Exception in thread "main" htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once. 1: HWI-ST1307:159:C48TVACXX:7:1109:1787:63474
at htsjdk.samtools.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:133)
at htsjdk.samtools.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:86)
at htsjdk.samtools.SamFileValidator$CoordinateSortedPairEndInfoMap.remove(SamFileValidator.java:765)
at htsjdk.samtools.SamFileValidator.validateMateFields(SamFileValidator.java:499)
at htsjdk.samtools.SamFileValidator.validateSamRecordsAndQualityFormat(SamFileValidator.java:297)
at htsjdk.samtools.SamFileValidator.validateSamFile(SamFileValidator.java:215)
at htsjdk.samtools.SamFileValidator.validateSamFileSummary(SamFileValidator.java:143)
at picard.sam.ValidateSamFile.doWork(ValidateSamFile.java:196)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)

so i search the read name HWI-ST1307:159:C48TVACXX:7:1109:1787:63474 in my fastq file ,i got two same reads' names, i also search in my .sam file, i got 4 same reads' names, weired......which i thought it's the fault of the sequencing company, maybe they just copy any data within the same file, and put them together.......

ADD REPLYlink modified 23 months ago by genomax75k • written 23 months ago by jiangzhiyong120

Can you use grep -A and tell us if the content of the two reads with identical names is the same in terms of sequence and quality scores?

ADD REPLYlink modified 23 months ago • written 23 months ago by Dan D7.0k

i search my sam file

$more 11.sam | grep HWI-ST1307:159:C48TVACXX:7:1116:16586:11978

Result:

HWI-ST1307:159:C48TVACXX:7:1116:16586:11978 65  gi|539359185|ref|NW_005087554.1|    565080  1   52M gi|539359184|ref|NW_005087555.1|    40387192    0   AGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGA    BCCFFFFFHHHDDIJGBGGHDGGGGHGIGGGFDEGG;DHA?FEB=F=@@FGE    AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:52 YT:Z:UP
HWI-ST1307:159:C48TVACXX:7:1116:16586:11978 129 gi|539359184|ref|NW_005087555.1|    40387192    0   10M4I12M4I31M4D23M  gi|539359185|ref|NW_005087554.1|    565080  0   CCTTCCGCCACTTCCTTCCTTCCGCCACTTACTTCCTTCCTCCACTTCCTTCCTTCCGCCACTTCCTTCCGCCACTTCCTTCCG    @@@FFFFFHGHHHJJIJJJJJJIIIJAHHI>DHIJJIIJGIGGHEHGIGGBFAGGHIBHIA:CHECDE?;9>ABBACD;@CC?B    AS:i:-51    XS:i:-51XN:i:0  XM:i:0  XO:i:3  XG:i:12 NM:i:12 MD:Z:53^CTTC23  YT:Z:UP
HWI-ST1307:159:C48TVACXX:7:1116:16586:11978 65  gi|539359185|ref|NW_005087554.1|    565080  1   52M gi|539359184|ref|NW_005087555.1|    40387192    0   AGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGA    BCCFFFFFHHHDDIJGBGGHDGGGGHGIGGGFDEGG;DHA?FEB=F=@@FGE    AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:52 YT:Z:UP
HWI-ST1307:159:C48TVACXX:7:1116:16586:11978 129 gi|539359184|ref|NW_005087555.1|    40387192    0   10M4I12M4I31M4D23M  gi|539359185|ref|NW_005087554.1|    565080  0   CCTTCCGCCACTTCCTTCCTTCCGCCACTTACTTCCTTCCTCCACTTCCTTCCTTCCGCCACTTCCTTCCGCCACTTCCTTCCG    @@@FFFFFHGHHHJJIJJJJJJIIIJAHHI>DHIJJIIJGIGGHEHGIGGBFAGGHIBHIA:CHECDE?;9>ABBACD;@CC?B    AS:i:-51    XS:i:-51XN:i:0  XM:i:0  XO:i:3  XG:i:12 NM:i:12 MD:Z:53^CTTC23  YT:Z:UP

I really don'y know the reason. any help would be appreciated

ADD REPLYlink modified 23 months ago by genomax75k • written 23 months ago by jiangzhiyong120

That is odd. If you have not done anything to your SAM file then it is likely that your original fastq file has that read in there two times. Can you check that next?

ADD REPLYlink written 23 months ago by genomax75k

It' true, I do have two same reads' name in my original fastq file. I get my fastq file reads' name and to get the unique reads' name, more weired thing is, 43063238(total reads' name) - 24218735(unique reads' name) = 18844503(duplicates' reads' name).......I don't understand......

ADD REPLYlink written 23 months ago by jiangzhiyong120

So the problem is much bigger than you expected. If the sequence is identical for the duplicate reads then you will have to deduplicate them or get a new copy of the original data.

ADD REPLYlink written 23 months ago by genomax75k

Yes, you are right, i deduplicate them, just get unique reads, with the next command: $seqtk subseq /disk5/jiangzy/bowtie2/trimmomatic/1_1_clean.fastq remaining_1.list > 1_1.remain.fastq just got 2.71G fastq file, compare to the original data 10.22G. then i used the FastQC tools to get the info of my data. here is the most weired thing: https://ibb.co/fDd3e6 https://ibb.co/jvrOe6 https://ibb.co/nvdQsR https://ibb.co/bWGdCR https://ibb.co/hCKbz6 https://ibb.co/jYWpK6

Still thank you.

ADD REPLYlink written 23 months ago by jiangzhiyong120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 877 users visited in the last hour