We want to identify lncrna from rainbow trout.(2 treatments,6sample;paired-end, illumina hiseq2500) After using fastQC an HISAT2, we used stringtie, then we used –merge and with gffcompare we tried to have our final annotated gtf file. I will have my codes below.(one sample)
java -jar trimmomatic-0.32.jar PE -threads 8 -phred33 /home/user/MahmoodPhd/Trimmomatic-master/R_19191_1.fq.gz /home/user/MahmoodPhd/Trimmomatic-master/R_19191_2.fq.gz /home/user/MahmoodPhd/Trimmomatic-master/R_19191_1_paired.fq.gz /home/user/MahmoodPhd/Trimmomatic-master/R_19191_2_unpaired.fq.gz /home/user/MahmoodPhd/Trimmomatic-master/R_19191_1_unpaired.fq.gz /home/user/MahmoodPhd/Trimmomatic-master/R_19191_2_paired.fq.gz LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
gzip -d Oncorhynchus_mykiss.Omyk_1.0.dna.toplevel.fa.gz
mv GCF_002163495.1_Omyk_1.0_genomic.fna GCF_002163495.1_Omyk_1.0_genomic.fa
hisat2-build GCF_002163495.1_Omyk_1.0_genomic.fa Omykindex
./hisat2 -t --known-splicesite-infile splicesites.txt --dta --summary-file -x Omykindex -1 R_19191_1.fq.gz, -2 R_19191_2.fq.gz -p 8 R_19191.sam
samtools sort -@ 8 -o R_19340_sort.bam R_19340.sam
./stringtie /home/user/MahmoodPhd/hisat2-2.1.0-Linux_x86_64/hisat2-2.1.0/R_19340_sort.bam -l R_19340 -p 8 -G GCF_002163495.1_Omyk_1.0_genomic.gff -o R_19340.gtf
./stringtie --merge -p 8 -G /home/user/MahmoodPhd/Stringtie/GCF_002163495.1_Omyk_1.0_genomic.gtf -o stringtie_merged1.gtf list.txt
# ./stringtie --merge -p 8 -G /home/user/MahmoodPhd/Stringtie/GCF_002163495.1_Omyk_1.0_genomic.gtf -o stringtie_merged1.gtf list.txt
# StringTie version 2.1.4
stringtie_merged1.gtf:
NC_001717.1 StringTie transcript 1004 16642 1000 + . gene_id "MSTRG.1"; transcript_id "unknown_transcript_1"; ref_gene_id "";
NC_001717.1 StringTie exon 1004 4890 1000 + . gene_id "MSTRG.1"; transcript_id "unknown_transcript_1"; exon_number "1"; ref_gene_id "";
NC_001717.1 StringTie exon 4958 6147 1000 + . gene_id "MSTRG.1"; transcript_id "unknown_transcript_1"; exon_number "2"; ref_gene_id "";
NC_001717.1 StringTie exon 6468 8018 1000 + . gene_id "MSTRG.1"; transcript_id "unknown_transcript_1"; exon_number "3"; ref_gene_id "";
NC_001717.1 StringTie exon 8094 8166 1000 + . gene_id "MSTRG.1"; transcript_id "unknown_transcript_1"; exon_number "4"; ref_gene_id "";
NC_001717.1 StringTie exon 8181 14770 1000 + . gene_id "MSTRG.1"; transcript_id "unknown_transcript_1"; exon_number "5"; ref_gene_id "";
NC_001717.1 StringTie exon 15361 16642 1000 + . gene_id "MSTRG.1"; transcript_id "unknown_transcript_1"; exon_number "6"; ref_gene_id "";
NC_001717.1 StringTie transcript 4888 16642 1000 - . gene_id "MSTRG.2"; transcript_id "unknown_transcript_1"; ref_gene_id "";
NC_001717.1 StringTie exon 4888 4958 1000 - . gene_id "MSTRG.2"; transcript_id "unknown_transcript_1"; exon_number "1"; ref_gene_id "";
NC_001717.1 StringTie exon 6149 6291 1000 - . gene_id "MSTRG.2"; transcript_id "unknown_transcript_1"; exon_number "2"; ref_gene_id "";
NC_001717.1 StringTie exon 6329 6466 1000 - . gene_id "MSTRG.2"; transcript_id "unknown_transcript_1"; exon_number "3"; ref_gene_id "";
NC_001717.1 StringTie exon 8019 8089 1000 - . gene_id "MSTRG.2"; transcript_id "unknown_transcript_1"; exon_number "4"; ref_gene_id "";
NC_001717.1 StringTie exon 14767 15357 1000 - . gene_id "MSTRG.2"; transcript_id "unknown_transcript_1"; exon_number "5"; ref_gene_id "";
NC_001717.1 StringTie exon 16573 16642 1000 - . gene_id "MSTRG.2"; transcript_id "unknown_transcript_1"; exon_number "6"; ref_gene_id "";
NC_035077.1 StringTie transcript 26931 50257 1000 + . gene_id "MSTRG.3"; transcript_id "XM_021602508.1"; gene_name "LOC110523613"; ref_gene_id "LOC110523613";
NC_035077.1 StringTie exon 26931 27080 1000 + . gene_id "MSTRG.3"; transcript_id "XM_021602508.1"; exon_number "1"; gene_name "LOC110523613"; ref_gene_id "LOC110523613";
NC_035077.1 StringTie exon 33577 33719 1000 + . gene_id "MSTRG.3"; transcript_id "XM_021602508.1"; exon_number "2"; gene_name "LOC110523613"; ref_gene_id "LOC110523613";
NC_035077.1 StringTie exon 34556 34671 1000 + . gene_id "MSTRG.3"; transcript_id "XM_021602508.1"; exon_number "3"; gene_name "LOC110523613"; ref_gene_id "LOC110523613";
NC_035077.1 StringTie exon 40440 40620 1000 + . gene_id "MSTRG.3"; transcript_id "XM_021602508.1"; exon_number "4"; gene_name "LOC110523613"; ref_gene_id "LOC110523613";
NC_035077.1 StringTie exon 47972 50257 1000 + . gene_id "MSTRG.3"; transcript_id "XM_021602508.1"; exon_number "5"; gene_name "LOC110523613"; ref_gene_id "LOC110523613";
…..
./gffcompare -r /home/user/MahmoodPhd/Stringtie/GCF_002163495.1_Omyk_1.0_genomic.gtf -G -o finalgffestimation.gtf /home/user/MahmoodPhd/Stringtie/stringtie_merged1.gtf
Our problem is that our last gtf file (finalgffestimation.gtf) is like below: when I filter out size (200nt) and class codes(uiojx) and then merge them (with cat) I cannot convert it to FASTA for downstream analysis.
finalgffestimation.gtf: (exon numbers are seen in different row and I can filter out correctly)
NC_001717.1 StringTie transcript 1004 16642 . + . transcript_id "unknown_transcript_1"; gene_id "MSTRG.1"; xloc "XLOC_000001"; ref_gene_id ""; cmp_ref "unknown_transcript_1"; class_code "j"; tss_id "TSS1";
NC_001717.1 StringTie exon 1004 4890 . + . transcript_id "unknown_transcript_1"; gene_id "MSTRG.1"; exon_number "1";
NC_001717.1 StringTie exon 4958 6147 . + . transcript_id "unknown_transcript_1"; gene_id "MSTRG.1"; exon_number "2";
NC_001717.1 StringTie exon 6468 8018 . + . transcript_id "unknown_transcript_1"; gene_id "MSTRG.1"; exon_number "3";
NC_001717.1 StringTie exon 8094 8166 . + . transcript_id "unknown_transcript_1"; gene_id "MSTRG.1"; exon_number "4";
NC_001717.1 StringTie exon 8181 14770 . + . transcript_id "unknown_transcript_1"; gene_id "MSTRG.1"; exon_number "5";
NC_001717.1 StringTie exon 15361 16642 . + . transcript_id "unknown_transcript_1"; gene_id "MSTRG.1"; exon_number "6";
NC_001717.1 StringTie transcript 4888 16642 . - . transcript_id "unknown_transcript_1"; gene_id "MSTRG.2"; xloc "XLOC_000002"; ref_gene_id ""; cmp_ref "unknown_transcript_1"; class_code "j"; tss_id "TSS2";
NC_001717.1 StringTie exon 4888 4958 . - . transcript_id "unknown_transcript_1"; gene_id "MSTRG.2"; exon_number "1";
NC_001717.1 StringTie exon 6149 6291 . - . transcript_id "unknown_transcript_1"; gene_id "MSTRG.2"; exon_number "2";
NC_001717.1 StringTie exon 6329 6466 . - . transcript_id "unknown_transcript_1"; gene_id "MSTRG.2"; exon_number "3";
NC_001717.1 StringTie exon 8019 8089 . - . transcript_id "unknown_transcript_1"; gene_id "MSTRG.2"; exon_number "4";
NC_001717.1 StringTie exon 14767 15357 . - . transcript_id "unknown_transcript_1"; gene_id "MSTRG.2"; exon_number "5";
NC_001717.1 StringTie exon 16573 16642 . - . transcript_id "unknown_transcript_1"; gene_id "MSTRG.2"; exon_number "6";
NC_035077.1 StringTie transcript 26931 50257 . + . transcript_id "XM_021602508.1"; gene_id "MSTRG.3"; gene_name "LOC110523613"; xloc "XLOC_000003"; ref_gene_id "LOC110523613"; cmp_ref "XM_021602508.1"; class_code "="; tss_id "TSS3";
NC_035077.1 StringTie exon 26931 27080 . + . transcript_id "XM_021602508.1"; gene_id "MSTRG.3"; exon_number "1";
NC_035077.1 StringTie exon 33577 33719 . + . transcript_id "XM_021602508.1"; gene_id "MSTRG.3"; exon_number "2";
NC_035077.1 StringTie exon 34556 34671 . + . transcript_id "XM_021602508.1"; gene_id "MSTRG.3"; exon_number "3";
NC_035077.1 StringTie exon 40440 40620 . + . transcript_id "XM_021602508.1"; gene_id "MSTRG.3"; exon_number "4";
NC_035077.1 StringTie exon 47972 50257 . + . transcript_id "XM_021602508.1"; gene_id "MSTRG.3"; exon_number "5";
NC_035077.1 StringTie transcript 32704 50257 . + . transcript_id "MSTRG.3.2"; gene_id "MSTRG.3"; gene_name "LOC110523613"; xloc "XLOC_000003"; cmp_ref "XM_021602508.1"; class_code "j"; tss_id "TSS4";
NC_035077.1 StringTie exon 32704 33070 . + . transcript_id "MSTRG.3.2"; gene_id "MSTRG.3"; exon_number "1";
NC_035077.1 StringTie exon 40440 40620 . + . transcript_id "MSTRG.3.2"; gene_id "MSTRG.3"; exon_number "2";
NC_035077.1 StringTie exon 47972 50257 . + . transcript_id "MSTRG.3.2"; gene_id "MSTRG.3"; exon_number "3";
NC_035077.1 StringTie transcript 145370 152927 . + . transcript_id "XM_021602762.1"; gene_id "MSTRG.5"; gene_name "LOC110523749"; xloc "XLOC_000004"; ref_gene_id "LOC110523749"; cmp_ref "XM_021602762.1"; class_code "="; tss_id "TSS5";
NC_035077.1 StringTie exon 145370 145969 . + . transcript_id "XM_021602762.1"; gene_id "MSTRG.5"; exon_number "1";
NC_035077.1 StringTie exon 146059 146626 . + . transcript_id "XM_021602762.1"; gene_id "MSTRG.5"; exon_number "2";
NC_035077.1 StringTie exon 146738 152927 . + . transcript_id "XM_021602762.1"; gene_id "MSTRG.5"; exon_number "3";
NC_035077.1 Gnomon transcript 177504 190839 . + . transcript_id "XM_021603017.1"; gene_id "LOC110523873"; gene_name "LOC110523873"; xloc "XLOC_000005"; ref_gene_id "LOC110523873"; cmp_ref "XM_021603017.1"; class_code "="; tss_id "TSS6";
NC_035077.1 Gnomon exon 177504 177974 . + . transcript_id "XM_021603017.1"; gene_id "LOC110523873"; exon_number "1";
NC_035077.1 Gnomon exon 190591 190839 . + . transcript_id "XM_021603017.1"; gene_id "LOC110523873"; exon_number "2";
NC_035077.1 Gnomon transcript 201613 226609 . + . transcript_id "XM_021602890.1"; gene_id "LOC110523811"; gene_name "LOC110523811"; xloc "XLOC_000006"; ref_gene_id "LOC110523811"; cmp_ref "XM_021602890.1"; class_code "="; tss_id "TSS7";
NC_035077.1 Gnomon exon 201613 202200 . + . transcript_id "XM_021602890.1"; gene_id "LOC110523811"; exon_number "1";
NC_035077.1 Gnomon exon 206366 206502 . + . transcript_id "XM_021602890.1"; gene_id "LOC110523811"; exon_number "2";
NC_035077.1 Gnomon exon 209520 209564 . + . transcript_id "XM_021602890.1"; gene_id "LOC110523811"; exon_number "3";
NC_035077.1 Gnomon exon 209917 209985 . + . transcript_id "XM_021602890.1"; gene_id "LOC110523811"; exon_number "4";
NC_035077.1 Gnomon exon 217023 217200 . + . transcript_id "XM_021602890.1"; gene_id "LOC110523811"; exon_number "5";
NC_035077.1 Gnomon exon 226478 226609 . + . transcript_id "XM_021602890.1"; gene_id "LOC110523811"; exon_number "6";
NC_035077.1 BestRefSeq transcript 355926 382310 . + . transcript_id "NM_001124315.1"; gene_id "LOC100135976"; gene_name "LOC100135976"; xloc "XLOC_000007"; ref_gene_id "LOC100135976"; cmp_ref "NM_001124315.1"; class_code "="; tss_id "TSS8";
NC_035077.1 BestRefSeq exon 355926 356160 . + . transcript_id "NM_001124315.1"; gene_id "LOC100135976"; exon_number "1";
NC_035077.1 BestRefSeq exon 365943 366101 . + . transcript_id "NM_001124315.1"; gene_id "LOC100135976"; exon_number "2";
NC_035077.1 BestRefSeq exon 367207 367323 . + . transcript_id "NM_001124315.1"; gene_id "LOC100135976"; exon_number "3";
NC_035077.1 BestRefSeq exon 368982 369110 . + . transcript_id "NM_001124315.1"; gene_id "LOC100135976"; exon_number "4";
NC_035077.1 BestRefSeq exon 370615 370674 . + . transcript_id "NM_001124315.1"; gene_id "LOC100135976"; exon_number "5";
NC_035077.1 BestRefSeq exon 374855 375049 . + . transcript_id "NM_001124315.1"; gene_id "LOC100135976"; exon_number "6";
NC_035077.1 BestRefSeq exon 375557 375658 . + . transcript_id "NM_001124315.1"; gene_id "LOC100135976"; exon_number "7";
NC_035077.1 BestRefSeq exon 375795 375901 . + . transcript_id "NM_001124315.1"; gene_id "LOC100135976"; exon_number "8";
NC_035077.1 BestRefSeq exon 379788 382310 . + . transcript_id "NM_001124315.1"; gene_id "LOC100135976"; exon_number "9";
NC_035077.1 Gnomon transcript 422138 465051 . + . transcript_id "XM_021601631.1"; gene_id "LOC110523159"; gene_name "LOC110523159"; xloc "XLOC_000008"; ref_gene_id "LOC110523159"; cmp_ref "XM_021601631.1"; class_code "="; tss_id "TSS9";
NC_035077.1 Gnomon exon 422138 422304 . + . transcript_id "XM_021601631.1"; gene_id "LOC110523159"; exon_number "1";
NC_035077.1 Gnomon exon 430842 430903 . + . transcript_id "XM_021601631.1"; gene_id "LOC110523159"; exon_number "2";
NC_035077.1 Gnomon exon 431903 431943 . + . transcript_id "XM_021601631.1"; gene_id "LOC110523159"; exon_number "3";
NC_035077.1 Gnomon exon 452095 452215 . + . transcript_id "XM_021601631.1"; gene_id "LOC110523159"; exon_number "4";
NC_035077.1 Gnomon exon 456772 456888 . + . transcript_id "XM_021601631.1"; gene_id "LOC110523159"; exon_number "5";
NC_035077.1 Gnomon exon 457131 457323 . + . transcript_id "XM_021601631.1"; gene_id "LOC110523159"; exon_number "6";
NC_035077.1 Gnomon exon 465044 465051 . + . transcript_id "XM_021601631.1"; gene_id "LOC110523159"; exon_number "7";
….
Codes for filtring
awk '{ if ($5-$4>200) print $0 }' /home/user/MahmoodPhd/FEELnc/merged.annotated.gtf > merged.annotated_200.gtf
awk '$20 ~ /"x"/ { print }' '/home/user/MahmoodPhd/merged.annotated_200.gtf' > x20.gtf
cat u16.gtf j20.gtf i20.gtf o20.gtf x20.gtf > ujiox.gtf
gffread /home/user/MahmoodPhd/cuffcom/ujiox.gtf -g /home/user/MahmoodPhd/files/GCF_002163495.1_Omyk_1.0_genomic.fa -w /home/user/MahmoodPhd/ujiox.fasta
ujiox.fasta is empty!!
Thanks my friend
De nada amigo. No hay de que.