-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed forecast period generation function for multiseries #4320
Conversation
Codecov ReportAll modified lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #4320 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 357 357
Lines 39767 39867 +100
=======================================
+ Hits 39647 39747 +100
Misses 120 120
☔ View full report in Codecov by Sentry. |
7fa8ec0
to
4135939
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions for simplification, let me know if they actually work though!
pred_intervals = self.estimator.get_prediction_intervals( | ||
X=estimator_input, | ||
y=y, | ||
coverage=coverage, | ||
) | ||
trans_pred_intervals = {} | ||
intervals_labels = list(list(pred_intervals.values())[0].keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to go into the debugger and play with the code myself to figure out what this line was doing 😅 A much simpler way:
intervals_labels = pred_intervals[0].keys()
That may need to be cast to a list for later on, I didn't test fully, but either way it's more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this doesn't work since pred_intervals
is a dict and this would pull the value at key 0 rather than the first value of the dictionary. Is there a better way to pull the first value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah oops, I see what I missed. I don't know of a better way to manipulate the dictionaries, but we could also just do intervals_labels = pd.DataFrame(pred_intervals).index
😂
for key, orig_pi_values in intervals.items(): | ||
series_id_target_name = ( | ||
self.input_target_name + "_" + str(series_id) | ||
) | ||
interval_series_pred_intervals[key][ | ||
series_id_target_name | ||
] = pd.Series( | ||
(orig_pi_values.values - residuals[series_id].values) | ||
+ trend_pred_intervals[series_id_target_name][key].values | ||
+ y[series_id_target_name].values, | ||
index=orig_pi_values.index, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of repeated code with the other logical branch, which is going to make life very hard for us if we ever need to update this code. Could you abstract it out into a local helper function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like
def _get_series_intervals(intervals, residuals, trend_pred_intervals, y):
return_intervals = {}
for key, orig_pi_values in intervals.items():
return_intervals[key] = pd.Series(
(orig_pi_values.values - residuals.values)
+ trend_pred_intervals[key].values
+ y.values,
index=origin_pi_values.index
)
return return_intervals
if is_multiseries(problem_type):
for series_id, series_intervals in pred_intervals.items():
series_id_target_name = self.input_target_name + "_" + str(series_id)
interval_series_pred_intervals[series_id_target_name] = _get_series_intervals(
series_intervals,
residuals[series_id],
trend_pred_intervals[series_id_target_name],
y[series_id_target_name]
)
else:
trans_pred_intervals = _get_series_intervals(pred_intervals, residuals, trend_pred_intervals, y)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code I suggested does make a change to the dictionary structure for the multiseries case, which you'll have to let me know if it works or not - I swapped the intervals with series ids, to give us {series_1: {0.75_lower: <>, 0.75_upper: <>, ...}, series_2: {...}...}
instead of {0.75_lower: {series_1: <>, series_2: <>, ...}, ...}
Personally, I think this would make it easier to get per-series prediction intervals, but you'll have to let me know if it's too much effort to swap things around at this point. We could also completely overhaul the data structure for this to be something actually 2D like a dataframe instead of nested dictionaries, but that might just be tech debt for the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up using your implementation but I tweaked it slightly. I still kept the original dictionary structure since it makes stacking each prediction interval in the end slightly easier. Let me know what you think!
trans_pred_intervals = {} | ||
trend_pred_intervals = self.get_component( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got so confused here for a second, these names are so similar 😅 can one of them be renamed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trans_pred_intervals
-> transformed_pred_intervals
for interval, interval_data in series_id_interval_result.items(): | ||
interval_series_pred_intervals[interval][ | ||
series_id_target_name | ||
] = interval_data | ||
for interval in intervals_labels: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the word interval
has no meaning to me any more 😂
for interval, interval_data in series_id_interval_result.items(): | ||
interval_series_pred_intervals[interval][ | ||
series_id_target_name | ||
] = interval_data | ||
for interval in intervals_labels: | ||
series_id_df = pd.DataFrame( | ||
interval_series_pred_intervals[interval], | ||
) | ||
stacked_pred_interval = stack_data( | ||
data=series_id_df, | ||
series_id_name=self.series_id, | ||
) | ||
trans_pred_intervals[interval] = stacked_pred_interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read over this like 5 times in a row and I can't figure out what the point of all of it is. It seems to be a lot of rearranging the same data in different ways? I'd love if we can make this clearer, even if it's just through comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave variable renaming + additional comments adding a shot. Lmk if you think there's a way to additionally clarify it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - nothing to add on my end
Resolves #4323