-
Notifications
You must be signed in to change notification settings - Fork 2
Algorithm to find the tree for a point on the genome of a given sample #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you a 8000 gree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You mean a local tree, right? |
By 'local tree sequence' I meant a tree sequence in the coordinate space of the given GIG node; since it seems from #2 that each distinct coordinate space has a distinct tree sequence associated with it. We have a lot to still figure out there. Do you think that's a sensible term? |
Ah, I see, sorry. You mean that you have an algorithm for generating a single tree for a single sample at position x, and by running it along that genome for all values of x you will generate a "local tree sequence"? Two problems I see with this: a) we normally take "local" to mean relative to a position, not to a sample, so maybe a different word is better For the moment, I would focus on the simply extraction of a "local, sample-focussed" tree. |
One of the other main uses of this algorithm will be to identify "equivalent" places for recombination to take place. In a forward simulation, that would involve taking a fixed position X in one genome, and finding the equivalent position in one other specific genome (chosen at random). I imagine we might want to do that fast, many times, for different arbitrarily chosen values of X. |
Thanks, I agree that the term 'local' is confusing here - I'm inclined to kick the terminology ball down the road, until after we figure out how to construct such a sequence in the first place! Keeping track of node identities will be tricky indeed.
Agreed. My current implementation at least touches each node as few times as possible, but it's still very clunky and certainly not fast. I hope it will be good enough for small simulations once I've refined it a bit. |
It would be very useful to be able to calculate a local tree sequence for an arbitrary genome of a GIG, as I did by hand in #2. As a starting point, we should at least be able to choose a sample node and a point x along its genome, then traverse the GIG from there, keeping track of x as we go through coordinate changes due to SVs. The resulting tree can then be visualised with tskit. I think I have a working prototype for the algorithm:
The
intervals
object needs to be a Pandas dataframe for it to work, as discussed in #34. Thetranslate_coordinate_up
function looks like this:The function for translating down is identical, except the direction of shifted coordinates is reversed. To run the algorithm, you call
traverse_up
. It moves up the tree from the sample node to the root, traversing subtrees down as it goes.The algorithm seems to generate trees which match what I calculated in #2, but there are a lot of refinements it needs to make it work more generally. For example, I haven't accounted for inversions where the parent interval is in a different part of the genome than the child. Also, there are going to be endless headaches in figuring out which inequalities should be open or closed! At the moment it just prints out what edges should be added: the next step is to actually construct a tskit tree instead.
My hope is that we can use this algorithm to translate recombination breakpoints from one sample genome to another, which will be crucial to doing the forward simulator with recombination as per #16.
The text was updated successfully, but these errors were encountered: