Call Me William: Prelude: I have a Masters from Carnegie Mellon in logic & computation.

Thursday, April 12, 2018

Prelude: I have a Masters from Carnegie Mellon in logic & computation.

Prelude: I have a Masters from Carnegie Mellon in logic & computation. Essentially, a degree in scientific methodology. I know the difference between a p-value and Bayesian inference. My scholarly work has been cited a few times.

I am far from an expert. If you think you know than me about predictive algorithms, you probably do. If you've ever touched a machine-learned algorithm, you could teach me things. I know enough to spot straightforward bullshit.

And this set off my bullshit odometer: predictive financial modeling with 35 data points, a third order polynomial function with two coefficients at essentially zero (< 0.00001, naturally on the square and the cube), no listed p-value, and an R-square of around 0.994 (pretty decent, actually).

Does this sound like absolute garbage to anyone else?

26 comments:

Craig MaloneyApril 12, 2018 at 9:21 AM
I took some stats in college and forgot most of it.

That looks like Dunning- Kruger levels of overconfidence.
ReplyDelete
Replies
William NicholsApril 12, 2018 at 9:22 AM
In me or them, Craig Maloney?
ReplyDelete
Replies
Craig MaloneyApril 12, 2018 at 9:23 AM
William Nichols them.
ReplyDelete
Replies
Hans MessersmithApril 12, 2018 at 9:28 AM
The higher order the polynomial, the better the fit. In the absence of any justification for why a polynomial function is the right choice to model the relationship, that in and of itself seems suspect to me.

Also, if the financial data is time series data, any kind of polynomial regression fitting seems suspect. Time series data should probably be modeled using time series methods that account for autocorrelation.
ReplyDelete
Replies
William NicholsApril 12, 2018 at 9:53 AM
This is exactly as I suspected. It's a little too close to what I suspected, which makes me concerned we're all wrong.
ReplyDelete
Replies
Hans MessersmithApril 12, 2018 at 10:00 AM
Don't keep me in suspense, William Nichols, what are these data? I'm intrigued. Unless it is personal in which case I can live my curiosity.
ReplyDelete
Replies
William NicholsApril 12, 2018 at 10:01 AM
Hans Messersmith Time-based financial data. ;-)
ReplyDelete
Replies
Sandy J-TApril 12, 2018 at 10:20 AM
me rn

https://plus.google.com/photos/...
ReplyDelete
Replies
Gerrit ReininghausApril 12, 2018 at 11:08 AM
Sounds like someone found the add trend function in Excel and played a bit around until the graph looks nice.
ReplyDelete
Replies
William NicholsApril 12, 2018 at 11:25 AM
Gerrit Reininghaus You are closer than I want to admit.
ReplyDelete
Replies
Gerrit ReininghausApril 12, 2018 at 11:30 AM
William Nichols 😋 I have a Master in math and philosophy of science. But I worked for too long as a strategy and risk consultant in the financial industry sector not to be guilty of this for real.
ReplyDelete
Replies
Tony Lower-BaschApril 12, 2018 at 12:53 PM
Predictive modelling with 35 data points?

Everything else was just Peanuts adult-talk "mwa-ma-mwaaaa" to me after that. My instincts for anything predictive is that the number of data points in the training set needs at least two generous handfuls more zeroes. I don't have anything but humor and moral support to contribute.
ReplyDelete
Replies
William NicholsApril 12, 2018 at 6:09 PM
Yes, Tony Lower-Basch: a third order polynomial used to figure out year-end revenue.

I'll allow it for a while, but under no circumstances will we do it this way next year. Not so long as that "we" includes "me".
ReplyDelete
Replies
Hans MessersmithApril 13, 2018 at 7:47 AM
It is possible that having some understanding of the "velocity" and "acceleration" of the time series at different points in time is worthwhile (which a higher order polynomial fit of the curve might imply, I guess? Maybe?).

But if my memory of a distant past of doing some time series analysis is correct, that is better handled through looking at the first and second difference of the series.

robjhyndman.com - robjhyndman.com/talks/RevolutionR/8-Differencing.pdf
ReplyDelete
Replies
William NicholsApril 13, 2018 at 8:59 AM
So I graphed it, which apparently no one had ever done.
I graphed and look at the following:
1. Actual
2. Third order
3. Ignoring the 2nd and 3rd, just the first from (2)
4. a quick linear regression.

(1) showed a small decline at end of year, (2) showed a large decline. (3) and (4), of course, showed no decline. End of year, (1) was half as far off (call it 10%) than (4), which was nearly 20% too high.

All are nonsense, of course. My initial hypothesis was the coefficients on the 2nd and 3rd order polynomials were so low as to not affect the outcome, and I wasn't correct.

That is, I rejected the null hypothesis.
ReplyDelete
Replies
William NicholsApril 13, 2018 at 9:00 AM
It does make sense to have a "key period" where the slope is increased. I need to consider how to do that.
ReplyDelete
Replies
William NicholsApril 13, 2018 at 11:39 AM
Mayyyyyyyyyyyyyybe I calculate two different slopes, with the second one only being in use during specific months. Ie, do bad MLR instead of bad third-order polynomials.

Is that any better?
ReplyDelete
Replies
Hans MessersmithApril 13, 2018 at 1:35 PM
William Nichols Are you trying to detect whether something changed during the time series? Or to detect a change in the future? Simple CUSUM methods on time series data can be used for that purpose fairly easily.

en.wikipedia.org - CUSUM - Wikipedia
ReplyDelete
Replies
Gretchen S.April 13, 2018 at 1:36 PM
I'll run it by the data scientist spouse when he gets home, but only 35 data points already automatically looks like bullshit to me, and I only ever seat-of-pants this stuff. 35 data points: trollolol.
ReplyDelete
Replies
Hans MessersmithApril 13, 2018 at 1:37 PM
or EWMA charts...
en.wikipedia.org - EWMA chart - Wikipedia
ReplyDelete
Replies
Gretchen S.April 13, 2018 at 1:41 PM
(He actually does predictive analysis for a living, if you want for serious advice. But based on the kibitzing I've done with him over the quality of data, that's not really a data set.)
ReplyDelete
Replies
Tony Lower-BaschApril 13, 2018 at 1:47 PM
My sense is that from the perspective of serious data science, many industry concerns are so trivial as to be insulting, and simultaneously so questionably measured and scoped as to be unanswerable.

It's like asking a theoretical mathematician "Could you invent a system of arithmetic in which ninety degrees isn't a right angle?"
ReplyDelete
Replies
William NicholsApril 13, 2018 at 2:05 PM
Hans Messersmith the business question is: how much money will we make?
ReplyDelete
Replies
Gerrit ReininghausApril 13, 2018 at 3:35 PM
I would admit here that I have already built parameters for business risk with less than 30 data points of quarterly earnings. For a bank calling itself largest in its country. You need to come up with a figure in the end - what else to do?
ReplyDelete
Replies
Gretchen S.April 13, 2018 at 3:50 PM
His comment: "That R-squared is suspiciously high. Amazingly." With the eyebrow raise of "I'm being polite in my phrasing."

Though he points out that the data he gets tends to be messy real world data, and often there are gaps in the domain knowledge.

Also: "This is why reproducible research is a thing: you share the data set and the code so other people can see how you achieved that result."

He's had to use that magnitude of predictors in the worst case (I think from someone who had retained only monthly summary reports for historical data) but his models were in the 70s-80s.... Not intended to be precise but better than a WAG. (Wild-Ass Guess.)
ReplyDelete
Replies
Gretchen S.April 13, 2018 at 3:54 PM
Tony Lower-Basch His comment on that: "Yup." One of the things he does is improve measurement if it's worth it, but it's not always worth the additional cost of measurement.
ReplyDelete
Replies

Call Me William

Thursday, April 12, 2018

Prelude: I have a Masters from Carnegie Mellon in logic & computation.

26 comments:

Followers

Blog Archive

About Me