I obtained the output file from PopoolationTE2 for my sample which generates TE insertions sites. It looks like that (col2 is the chromosome number, col3 - position, col5 - TE family):
1 1 4254339 . hAT|9 hAT R - 0,954
1 1 34804000 . Stowaway|41 Stowaway R - 1,000
1 1 12839440 . Tourist|15 Tourist F - 1,000
1 1 11521962 . Tourist|10 Tourist R - 1,000
1 1 28197852 . Tourist|11 Tourist F - 1,000
1 1 7367886 . Stowaway|36 Stowaway R - 1,000
1 1 13130538 . Stowaway|36 Stowaway R - 1,000
1 1 6177708 . hAT|4 hAT F - 1,000
1 1 3783728 . hAT|20 hAT F - 1,000
1 1 10332288 . uc|12 uc R - 1,000
1 1 15780052 . uc|5 uc R - 1,000
1 1 28309928 . uc|5 uc R - 1,000
1 1 31010266 . uc|33 uc R - 0,967
1 1 4758653 . uc|10 uc F - 1,000
1 1 3815830 . uc|31 uc R - 0,879
1 1 5037968 . Mutator|4 Mutator F - 1,000
I want to compare it with the bed file representing TE sites for the reference genome. It looks like that:
1 12005 12348 RefBeet_TSD_Len:3_Tourist|7
1 56229 56700 RefBeet_TSD_Len:8_hAT|9
1 66241 66528 RefBeet_TSD_Len:9_Mutator|21
1 81966 82251 RefBeet_TSD_Len:2_Stowaway|39
1 84155 84402 RefBeet_TSD_Len:2_uc|1
1 84714 84841 RefBeet_Unknow_un_uc|28
1 98136 98349 RefBeet_TSD_Len:2_Stowaway|3
1 102325 102582 RefBeet_TSD_Len:2_Stowaway|12
1 103132 103267 RefBeet_Unknow_un_uc|33
1 108250 108580 RefBeet_TSD_Len:3_Tourist|17
1 115434 115695 RefBeet_Unknow_Len:8_uc|9
I want to check if TE insertions found in my sample occur in the reference, for example, if the first TE: hAT|9 in position on chromosome 1 in 4254339 will be found in the bed file in the range defined by column 2 as the start and 3 as the end.
I try to do it with pandas but I'm pretty confused.
Thanks for the suggestions!