[Gary] Thanks for the conversation; see my comments inline below
[Grant] Thank you for your answer—it will take me some time before I can fully appreciate the finer points of what you said, but I can say this.
- [Grant] Data drift, as far as I’m aware refers to new data not being distributed identically with the original data. Thus if we were fortunate enough to have all of the data that could ever be collected past present and future, it would consist of mixed populations each with its own data distribution. When I say data distribution, there is nothing tied to statistics or likelihood per say, but rather, relative to all data that could ever be collected past present and future, and relevant variable that could possibly occur to us to measure – there is a true data distribution. When I say mixed populations I am referring to either parametric distributions over which a discrete mixture is taken, or somehow incomplete non-parametric distributions, over which a discrete mixture is taken. However, point #1 is just a big detour because the point I was trying to elucidate about selection in the case of sparse data
[Gary] Yes, I understand. My point would be that if there is a common mechanism by which the data is produced/generated/comes to be in each different context, and such a mechanistic model could be used to generate the entire set of possible data configurations constrained to some degree by what data points are present, then an ANN could learn that function and be applied to different contexts without data drift. Data drift occurs because, because of sparsity in the sample, the true function cannot be found because the sampled distribution is not representative of the true, comprehensive possibility space.
- [Grant] My main point is that selection in the setting of sparse data will always yield random in the limit estimands. To illustrate this point imagine that the outcome variable and all state variables are 0/1 valued. The true distribution of the outcome conditional upon the state variables is a mapping from state variables in \{0, 1\}^p to the probability that the outcome is 1, which is between 0 and 1. We can visualize the dataset as a 2^(p+1) contingency table. Many if not all cells in this ridiculously humongous table has a nonzero probability that Y=1. Imaging a grayscale from white (p=0) to black (p=1). The point with sparse data is that not all the grayish cells will have any counts in them. If we have a total of n observations then many gray-ish scales will be empty. Next time we draw n observations, a different set of cells will be empty. This is the nature of the problem upon which my India/US example is based.
[Gary] Hmm. I don’t know how to cast this example into the problem I am trying to address. Is there a mechanism/algorithm that determines whether the “grayness” of the box? Is the grayness a measurement? Is there a time evolution/trajectory of the becoming-grayness that would allow you to try and predict the forthcoming grayness? My point is that there is some hypothesized process/algorithm/mechanism by which the greyness of each element comes about, and that configuration of the greyness-ness of the table overall evolves, and that pattern is itself of interest (system phenotype). If there such a algorithm/mechanism could be simulated, and essentially supplement the n observations (which would be practically constrained) with a s-n (simulated/synthetic observations), then there would be an improved ability for a ANN to learn (and predict) what the configuration of the table would be next.
To link this to the specific target of the paper/talk: imagine that the greyness of the table-cells are measured mediators; there is a mechanistic process by which the greyness is generated and propagated. There are configurations of the table/system-level phenotypes that correspond to disease state. You only get to sample a sparse section of the overall table at each time point; how then can you forecast what the next configuration of the table would be? Let’s say that you have different time point samples, but given that the n observations moving forward may be a different set of cells (this is not what really would happen, because across a population you would have different configurations of the table clustered together to represent the same phenotype). Can you figure out what synthetic n would be in order to have the ANN not arriving at an approximated solution biased by the sparsity (and variation and non-representativeness) of the available observations? I pose that it is impossible to do by just looking at the configurations of the table with repeated sparse samples and trying to “learn” the true distributions of trajectories (e.g. a statistical approach); rather you would hypothesize a generative algorithm, generate a bunch of synthetic observations, and see if the performance of a system trained on such augmented observations would be better that one without. Practically, this is the task of trying to personalize forecasts of disease trajectories using time series molecular mediator data.
If nothing else, I think my attempt to convert your example into something that I understand (and which is the problem I am trying to address) demonstrates that I am not a statistician! My suspicion is that we are not necessarily disagreeing as opposed to talking about somewhat completely different things 😊. But I always want to learn more, so if it isn’t too tiresome for you, this interaction is very useful to me.
Best,
Gary
-------------------------------------------------------------------------------------
Hi Grant,
Thanks for your questions and follow up. WRT your example, I completely agree! This is exactly the problem I see with statistical methods, and is represented by data drift in the ML/AI world. What I would propose (and give me some liberties here since I am not an epidemiology/health care system modeler) is someone would come up with a dynamic, knowledge-based simulation model that describes the steps by which a potential patient develops COVID, is identified to have COVID, and is subsequently treated for COVID. This “health-care process” model would include a who bunch of parameters that affect the variables in the model (so, for instance, percent resistance based on immunization, likelihood of going to the hospital, etc); this model is constructed in a fashion that allows for unrepresented secondary/tertiary effects (so allows representation of the MRM), then use our GA/AL pipeline to find sets of configurations that encompass a data set (be it India or US), with the emphasis here that we are NOT worried about the distribution of time series measurements but rather the outlier values (since the system was able to generate those data points). After this process, you then have a set of MRMs on a mechanistic model that cannot be falsified by the available data. The key here is while the distribution/mean or whatever statistical metric that is affected by the sparsity of the sample may be different between the two data sets, the max/min values in each data set are “relatively” close. This means that the possibility space of the time series trajectories of both countries will overlap, and therefore synthetic data could be generated with the mechanistic model (operating over the space of non-falsifiable MRMs) would be generalizable to both data sets, irrespective of the sparsely identified distributions, and thereby 1) address the potential non-representativeness of each data due to its sparsity and 2) overcome data drift because the synthetic data covers the possible range of trajectories. In this case then you could conceivably train an ANN to try and forecast a trajectory of an individual patient over time as they move through the health care system. There are several points that also need to be noted:
- This approach is important if the goal is to try and project the trajectory of an individual person, based on some updatable measurement of where they are in the multidimensional state space (this is the link to digital twins, and updatable forecasting cones). You don’t need to do this if you are only interested in identifying “true” differences at a population level (since the synthetic data is actually intended to discount the effect of statistical likelihood in the trained ANN, but rather to find the generative function).
- The second point goes to my comments about not using ODEs or simple models for this, because the ANN will find that representation and just recapitulate it, and if you have that then you don’t need the AI (which is perhaps sometimes okay). This is why I said that the generative mechanism-based model must be sufficiently complex (essentially multi-hierarchical) and stochastic. These features, plus the fact you are operating over a set of non falsifiable MRMs will (theoretically) obscure the exact form of the generative model (NOTE: this last bit is actually obviated by the set of MRMs, since that set represents a constrained set of candidate models).
The reason I introduced my talk ( and the paper) as focusing on multiplexed molecular time series data is exactly because those systems have a supposition of connected mechanisms that are 1) to complex to represent at a high degree of detail and 2) are necessarily incomplete because of necessary abstractions. The mechanistically causal chain of interactions in a health process model may not have those features, and it may be that the mapping of that example to the one for which this approach is constructed, does not map.
Sorry for the long answer, but as you can see it does represent a very unconventional way of thinking about things, and I appreciate the chance to engage on it, so thanks again for your questions! Let me know what you think!
Best,
Gary
-----------------------------------------------------------------
Hi Professor Glazier:
Thank you for organizing such an interesting seminar – and I thank Professor An for such an insightful presentation.
I came up with a concrete example that summarizes my concerns about NN’s fit to sparse data, e.g. selection in the case of sparse data are estimating a random variable rather than a constant, to which he replied, from what I can understand, that one dataset determines one realization of the limiting random variable.
To this I offer the following. What if we have two countries, say, each with their own sparse time series data, to which each has applied the method. In the end we see that the probability that a group 2 person will suffer respiratory distress in India is close to 15%, while in the US it is close to 30%. We conclude that either based upon genetics, bad care (or antivaxers), that people in the US are much worse off than people in India because, our training conditional estimates of error for our model suggest that this difference is too large to be due to random error alone.
My concern is that even if the two datasets were drawn from the same true distribution, differences which extend beyond what we would expect in a training conditional error analysis are possible, because the target of the two groups estimate is in fact a random variable with error arising from selection in the case of a sparse training set.
Thank you!
Grant Izmirlian (NIH/NCI)