for paired-end unmapped reads after star mapping
0
0
Entering edit mode
19 months ago
tvibhaps • 0

I tried to assemble the unmapped reads (mate1/R1 and mate2/R2 using --outReadsUnmapped Fastx ) from STAR alignment using Trinity for Paired-end reads.

By putting the --left unmapped_R1.fastq and --right unmapped_R2.fastq I am getting the error primarily as Error, found read_type 1 but expecting read_type 2 . From the error, it looks (from a discussuion thread) like R1 and R2 are switched for left and right in Trinity. But I am certain, that I used mate1 and R1 and mate2 as R2.

Again if i switched like --left unmapped_R2.fastq and --right unmapped_R1.fastq in Trinity assembly commmand, then it works perfectly. What is the reason behind it? Thank you

STAR-mapping Trinity assembly unmapped-reads • 2.2k views
ADD COMMENT
2
Entering edit mode

Can you show us the output of head -n 1 unmapped_R1.fastq and head -n 1 unmapped_R2.fastq?

It is possible that the file names were switched somewhere. But if that is the case then you will need to redo your STAR alignment since it would be incorrect.

ADD REPLY
0
Entering edit mode

Sure!! First 10 lines of both unmapped mates are-

NC-RL-7Unmapped_R1.fastq

GGGGGGTGTGGTGTATGGCCAATGTGCCGCGGCTGAGGGCTGTTATTAGGCACGATGCAATCCGGCCGTATATCACCAGCCCCCAAGGTGCCTTACTGCAGTTGTAGGCTGTTTACCAATGTGATTAGACCAGTGAAAGTAAGTGTTTTG
+
FFFFFFF,FF,:FF::F,FF,F:F,FFFFFFFFFFFF:FFFFFF,,FFFFF:F:F:FFFFFFFF:FFF::,FFFFFF,FFFFFFF:FF::FF,FFFFFFFF:F,FF:FFFFFFFFFFFFFF:FFF:FFF,FFF:,FFFFFFFF:,FFF,F
@A00774R:130:GW2104111111th:4:1101:4110:1188 0:N:  00
CTGGCCCCGTTTACGCGGGGTCTCCGGCGGGTTGCCTCGGCCGGCGCCTAGCAGCTGACTTAGAACTGGTGCGGACCAGGGGAATCCGACTGTTTAATTAAAACAAAGCATCGCGAAGGCCCAAGGTGGGTGATGACGCGATGTGATTTC
+
,F,FFFFFFFFFFFFFFFFFFFFF::FFFFFFFFF,FFFF:FFFFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFF,FFFFFFFFFFFF
@A00774R:130:GW2104111111th:4:1101:18285:1188 0:N:  00
CTGATGTGTGTGTAAGTATACGTGTGTTATGTTCTTGAGAAGTTCCTCTTCAGCTCCTTCTCTGATGTTACAGAAGAGCAGAGGTCTCTCAAATCTTCTGGATACCTTTCCCAGAGTTTTCAGAGTCCGTGTCGGAAGTGCTGGCCGGAA

NC-RL-7Unmapped_R2.fastq

@A00774R:130:GW2104111111th:4:1101:26205:1000 1:N:  00
GCCATCACAAACAGGTTACCAACTCAACCAGAGCAGCAAAAATACACGTTCCGTCATACCCAGGGTATACGGCATGATATACCACAGCTTTCAGCCAATCAGCATTCAGGGCTCGAAACACCCAGTTGATTATAGACCGTATACCACAAG
+
F,:F,FF:FFF:,,FF:FF,FFF,FFF:FF,F,FFF:FFFFFFFF,FF,FFF:FFFFFFFFFF:F,,FFFF,F,F::,FFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFF,FFFFFF,FFF:FFF,F,,F,FFFF:FFFFFFFFFFFFFF
@A00774R:130:GW2104111111th:4:1101:4110:1188 1:N:  00
GCCGTTTACCCGCGCTTCATTGAATTTCTTCACTTTGACATTCAGAGCACTGGGCAGAAATCACATCGCGTCATCACCCACCTTGGGCCTTCGCGATGCTTTGTTTTAATTAAACAGTCGGATTCCCCTGGTCCGCACCAGTTCTAAGTC
+
FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00774R:130:GW2104111111th:4:1101:18285:1188 1:N:  00
ATTTGGAAGAAGAAAAACAAGAAAGGCTTTGTTCCGGCCAGCACTTCCGACACGGACTCTGAAAACTCTGGGAAAGGTATCCAGAAGATTTGAGAGACCTCTGCTCTTCTGTAACATCAGAGAAGGAGCTGAAGAGGAACTTCTCAAGAA

Thank you

ADD REPLY
1
Entering edit mode

I see a couple of problems. If the output is complete then you seem to be missing a header line for R1 file at the beginning (unless it was a copy/paste error).

Also your read headers have 1:N: in R2 file. Normally R1 file has that header and R2 file will have 2:N:. Your option is to either rename that part of the header (or remove that part completely, you will need to check if trinity cares about that).

Ref: https://en.wikipedia.org/wiki/FASTQ_format#Illumina_sequence_identifiers

ADD REPLY
0
Entering edit mode

Thanks for pointing it out. Going one by one for the raised concerns-

In STAR mapping code for my data, i added the command --outReadsUnmapped Fastx, and in the result, i got the Unmapped.out.mate1 and unmapped.out.mate2. I did not see any error in my slurm mapping out or stderr. I renamed mate1 and 2 as R1 and R2 fatsqs. Again, the headers (line 1) of the unmapped.out.mate1 and unmapped.out.mate2 are as below. Probably previous one was copy/paste error. NC-RL-7Unmapped_R1.fastq @A00774R:130:GW2104111111th:4:1101:26205:1000 0:N: 00 mate2 NC-RL-7Unmapped_R2.fastq @A00774R:130:GW2104111111th:4:1101:26205:1000 1:N: 00 mate2 The format of my unmapped looks almost same (for mate 1 and 2 format) as what STAR's author is discussing n this thread https://groups.google.com/g/rna-star/c/tVmkXrYbb2k.

I assumed mate1 and mate2 as R1.fastq and R2.fastq.

Please point out any error or its cause so i can troubleshoot it.

Regarding Trinity-

When I see this page 11 of a Trinity workshop link https://biohpc.cornell.edu/lab/doc/trinity_workshop_part1.pdf, then it shows that Trinity takes the paired-end file in the way (R1 file should have 1 and R2 should have 2) you shared the file formatting option link.

However, if i switch R1 and R2 i.e. --left unmapped_R2.fastq (mate2) and --right unmapped_R1.fastq (mate1) in Trinity assembly command,Assembler works fine at least without throwing any error and i got the output i.e. assembly of 30 contigs. But i am suspicious that it may deviate or spoil the downstream analysis.

Not sure, where i am wrong!! Alternative? Should i do star mapping again, and add --outSAMunmapped Within option to get unmapped.bam, and convert it into fasta OR paired-end fastq, then i do de novo assembly? or else? Any suggestions are most welcome. Thank you

ADD REPLY
1
Entering edit mode
@A00774R:130:GW2104111111th:4:1101:4110:1188 **0:N:**  00 - mate1
@A00774R:130:GW2104111111th:4:1101:26205:1000 **1:N:**  00 - mate2

This is non-standard nomenclature. Normal Illumina data should have 1:N in R1 file and 2:N in R2 file. Programs depend on this header convention as you discovered with trinity (at least it is using file with 1:N as the first file even though it contains read 2, and not checking the second file or so it seems).

Is this your own data, if so why is this discrepancy there? If this is data you obtained online then you will need to fix the fastq headers if you want to get logical results from trinity.

if i switch R1 and R2 i.e. --left unmapped_R2.fastq (mate2) and --right unmapped_R1.fastq (mate1) in Trinity assembly command,Assembler works fine at least without throwing any error

Because something worked does not mean the results produced are usable or logical especially in the inputs were not what the program expects.

STAR appears to not check fastq headers and has probably produced acceptable results as based on the order of file inputs. You should check on the alignments in a genome viewer to make sure they look OK.

ADD REPLY
0
Entering edit mode

Thanks for the explanation and suggestion.

This is my own data. I always check read1 and read2 for the correct place as input 1 and input2 for trimming and collecting the output. But i can recheck them again. Since STAR output (according to STAR manual) the unmapped in the same mate 1 and mate 2 format as i showed the headers, therefore i got convinced that i am moving in right direction.

This might be possible that Trinity is accepting PE fastqs as R1 with header nomenclature 1 and R2 with header nomenclature 2, And STAR outputs as mate 1 with header containing 0:N, and mate2 as 1:N, therefore Trinity is getting confused in reading them.

For switiching the read names, you are right. I also suspect of getting incorrect results. That's why i paused here to explore more. Since i feed --left as R2.fastq in Trinity and R2 has 1:N , so Trinity considers it as Read 1 and runs the code. But it's not correct. I understand.

I will check it on alignment viewer. Thanks again for your time.

ADD REPLY
1
Entering edit mode

Can you check what the headers look like in your original data files? I am hesitant to think that STAR made this change (but anything is possible).

You could also extract unmapped reads from your BAM file by using samtools and not depend on STAR. If your original reads have correct headers then this should avoid that header problem.

ADD REPLY
0
Entering edit mode

Certainly!.. Seems they look correct. Here are headers (line1) of raw fastqs of the given sample-

Sample7_Clean_Data1.fq @A00583:290:H3MGWDSXY:4:1101:11288:1000 1:N:0:TCCTACCT+TTGCAGAC

Sample7_Clean_Data2.fq @A00583:290:H3MGWDSXY:4:1101:11288:1000 2:N:0:TCCTACCT+TTGCAGAC

Actually, i saw one the discussion threads https://groups.google.com/g/rna-star/c/tVmkXrYbb2k where STAR author recognizes (if i understood it correctly) the unmapped fastq format in the same way i.e. 0:N as R1 and 1:N as R2 format. So, I got convinced. STAR manual (section 5.4) reads- 00: mates were not mapped; 10: 1st mate mapped, 2nd unmapped 01: 1st unmapped, 2nd mapped 0:N 00 means, R1, but both reads are unmapped, 1:N 00 - R2, but both reads are unmapped

Yes, i am also thinking to get unmapped.bam. OR I replace 0:N 00 by /1 and 1:N 00 by /2 in unmapped_R1.fastq and unmapped_R2.fastq respectivly. Then, I should feed them as Trinity input.

Thanks for your feedback and time.

ADD REPLY
1
Entering edit mode

Assuming the read ID's are correct you can also filter the relevant reads out of these original data files using filterbyname.sh tool from BBMap suite.

ADD REPLY

Login before adding your answer.

Traffic: 1462 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6