BaseVarSelector
BaseVarSelector
handles variable selection for constructing the base of the feedback loop that models the variables'
growth in the simulation process. It uses two implementations of Granger Causality Tests to determine the variables
with the strongest causality over the whole feature set. The tests are of similar fashion with one being a Bivariate GCT
and the other being a Multivariate GCT. Both tests are structured tho check statistical significance of the statement
\(H_0: x \underset{\text{Granger-cause}}{\not\to} z\) and \(H_\alpha: x\underset{\text{Granger-cause}}{\to} z\). As any classical
hypothesis test, we check of the p-value of \(H_0\). Lower p-values indicate greater statistical evidence against \(H_0\),
suggesting that x Granger-causes z. Importantly, GCT checks for linear causality so the significance we're computing
is for the probability of a function existing such that:
The multivariate case follow the same pattern and test if \((x, y) \underset{Granger-cause}{\to}z\). Unfortunately, we're
only testing for the existence of a function that satisfies the conditions defined above, therefore a multivariate test
will not provide context regarding which variable is more significant in creating causality. BaseVarSelector
ranks
variables (or pairs of variables) based on their p-values instead of using a threshold of significance to assert causality.
This ensures that the best performing variables will always be selected as opposed tests returning no eligible variables
for a given feature set.
Example Usage
from macrosim import BaseVarSelector
import pandas as pd
df = pd.DataFrame(...)
bvs = BaseVarSelector(df=df)
bvs.granger_matrix(score=True) # Compute a matrix of p-values for all combinations of Bivariate GCTs;
# record variable ranks on average p-value if score=True
bvs.multivar_granger_matrix() # Compute p-values for all possible combinations of 2-predictor GCTs;
# record best variable ranks by average p-values of all pairs they've been a part of.
# (No matrix returned as there the raw data is often too large to visualise)
print(bvs.score_dict) # Print a dict listing the ranks of variables in terms of their performance (separate for both tests)
overall_score = {
k: (2/3)*v['Granger'] + (1/3)*v['Multivar_Granger']
for k, v in bvs.score_dict
} # Weighted sum of scores from both tests for each variable
# Lower is better
Methods
BaseVarSelector.granger_matrix
Computes a matrix of p-values for every possible Bivariate GCT of the inputted data.
Params:
score: bool
: Record the ranks of variables intoBaseVarSelector.score_dict
if True.
Returns:
pd.DataFrame
: Matrix of p-values for all GCTs computed.
BaseVarSelector.multivar_granger_matrix
Computes all Multivariate GCTs with two predictors (\((x,y) \underset{\text{Granger-cause}}{\to}z\)) and records the
variable ranks to BaseVarSelector.score_dict
.
Params:
- None
Returns:
- None