Thursday, April 12, 2018

Prelude: I have a Masters from Carnegie Mellon in logic & computation.

Prelude: I have a Masters from Carnegie Mellon in logic & computation. Essentially, a degree in scientific methodology. I know the difference between a p-value and Bayesian inference. My scholarly work has been cited a few times.

I am far from an expert. If you think you know than me about predictive algorithms, you probably do. If you've ever touched a machine-learned algorithm, you could teach me things. I know enough to spot straightforward bullshit.

And this set off my bullshit odometer: predictive financial modeling with 35 data points, a third order polynomial function with two coefficients at essentially zero (< 0.00001, naturally on the square and the cube), no listed p-value, and an R-square of around 0.994 (pretty decent, actually).

Does this sound like absolute garbage to anyone else?

26 comments:

  1. I took some stats in college and forgot most of it.

    That looks like Dunning- Kruger levels of overconfidence.

    ReplyDelete
  2. The higher order the polynomial, the better the fit. In the absence of any justification for why a polynomial function is the right choice to model the relationship, that in and of itself seems suspect to me.

    Also, if the financial data is time series data, any kind of polynomial regression fitting seems suspect. Time series data should probably be modeled using time series methods that account for autocorrelation.

    ReplyDelete
  3. This is exactly as I suspected. It's a little too close to what I suspected, which makes me concerned we're all wrong.

    ReplyDelete
  4. Don't keep me in suspense, William Nichols, what are these data? I'm intrigued. Unless it is personal in which case I can live my curiosity.

    ReplyDelete
  5. Hans Messersmith Time-based financial data. ;-)

    ReplyDelete
  6. Sounds like someone found the add trend function in Excel and played a bit around until the graph looks nice.

    ReplyDelete
  7. Gerrit Reininghaus You are closer than I want to admit.

    ReplyDelete
  8. William Nichols 😋 I have a Master in math and philosophy of science. But I worked for too long as a strategy and risk consultant in the financial industry sector not to be guilty of this for real.

    ReplyDelete
  9. Predictive modelling with 35 data points?

    Everything else was just Peanuts adult-talk "mwa-ma-mwaaaa" to me after that. My instincts for anything predictive is that the number of data points in the training set needs at least two generous handfuls more zeroes. I don't have anything but humor and moral support to contribute.

    ReplyDelete
  10. Yes, Tony Lower-Basch: a third order polynomial used to figure out year-end revenue.

    I'll allow it for a while, but under no circumstances will we do it this way next year. Not so long as that "we" includes "me".

    ReplyDelete
  11. It is possible that having some understanding of the "velocity" and "acceleration" of the time series at different points in time is worthwhile (which a higher order polynomial fit of the curve might imply, I guess? Maybe?).

    But if my memory of a distant past of doing some time series analysis is correct, that is better handled through looking at the first and second difference of the series.

    robjhyndman.com - robjhyndman.com/talks/RevolutionR/8-Differencing.pdf

    ReplyDelete
  12. So I graphed it, which apparently no one had ever done.
    I graphed and look at the following:
    1. Actual
    2. Third order
    3. Ignoring the 2nd and 3rd, just the first from (2)
    4. a quick linear regression.

    (1) showed a small decline at end of year, (2) showed a large decline. (3) and (4), of course, showed no decline. End of year, (1) was half as far off (call it 10%) than (4), which was nearly 20% too high.

    All are nonsense, of course. My initial hypothesis was the coefficients on the 2nd and 3rd order polynomials were so low as to not affect the outcome, and I wasn't correct.

    That is, I rejected the null hypothesis.

    ReplyDelete
  13. It does make sense to have a "key period" where the slope is increased. I need to consider how to do that.

    ReplyDelete
  14. Mayyyyyyyyyyyyyybe I calculate two different slopes, with the second one only being in use during specific months. Ie, do bad MLR instead of bad third-order polynomials.

    Is that any better?

    ReplyDelete
  15. William Nichols Are you trying to detect whether something changed during the time series? Or to detect a change in the future? Simple CUSUM methods on time series data can be used for that purpose fairly easily.

    en.wikipedia.org - CUSUM - Wikipedia

    ReplyDelete
  16. I'll run it by the data scientist spouse when he gets home, but only 35 data points already automatically looks like bullshit to me, and I only ever seat-of-pants this stuff. 35 data points: trollolol.

    ReplyDelete
  17. (He actually does predictive analysis for a living, if you want for serious advice. But based on the kibitzing I've done with him over the quality of data, that's not really a data set.)

    ReplyDelete
  18. My sense is that from the perspective of serious data science, many industry concerns are so trivial as to be insulting, and simultaneously so questionably measured and scoped as to be unanswerable.

    It's like asking a theoretical mathematician "Could you invent a system of arithmetic in which ninety degrees isn't a right angle?"

    ReplyDelete
  19. Hans Messersmith the business question is: how much money will we make?

    ReplyDelete
  20. I would admit here that I have already built parameters for business risk with less than 30 data points of quarterly earnings. For a bank calling itself largest in its country. You need to come up with a figure in the end - what else to do?

    ReplyDelete
  21. His comment: "That R-squared is suspiciously high. Amazingly." With the eyebrow raise of "I'm being polite in my phrasing."

    Though he points out that the data he gets tends to be messy real world data, and often there are gaps in the domain knowledge.

    Also: "This is why reproducible research is a thing: you share the data set and the code so other people can see how you achieved that result."

    He's had to use that magnitude of predictors in the worst case (I think from someone who had retained only monthly summary reports for historical data) but his models were in the 70s-80s.... Not intended to be precise but better than a WAG. (Wild-Ass Guess.)

    ReplyDelete
  22. Tony Lower-Basch His comment on that: "Yup." One of the things he does is improve measurement if it's worth it, but it's not always worth the additional cost of measurement.

    ReplyDelete