Ist es sinnvoll, Diagramme von Residuen in Bezug auf die abhängige Variable zu untersuchen?

11

Ich würde gerne wissen, ob es sinnvoll ist, die Diagramme der Residuen in Bezug auf die abhängige Variable zu untersuchen, wenn ich eine univariate Regression habe. Wenn es sinnvoll ist, was bedeutet eine starke, lineare, wachsende Korrelation zwischen Residuen (auf der y-Achse) und den geschätzten Werten der abhängigen Variablen (auf der x-Achse)?

Geben Sie hier die Bildbeschreibung ein

Luigi
quelle
3
Ich bin mir nicht sicher, was Sie unter "starker, linearer, wachsender Korrelation" verstehen. Können Sie die Handlung zeigen? Es ist durchaus sinnvoll, Residuen gegen angepasste Werte zu zeichnen. Im Allgemeinen möchten Sie, dass keine Beziehung besteht - eine flache horizontale Linie, die durch die Mitte verläuft. Außerdem soll die vertikale Streuung der Residuen von der linken Seite Ihres Diagramms nach rechts konstant sein.
Gung - Reinstate Monica
Hallo. Vielen Dank für Ihre Antwort. Dies ist die Handlung: img100.imageshack.us/img100/7414/bwages.png
Luigi
Das ist verwirrend. Lassen Sie mich sicherstellen, dass ich verstehe: Sie haben ein Regressionsmodell erstellt und dann die Residuen gegen die angepassten Werte aufgetragen, und das haben Sie, stimmt das? Es sollte nicht so aussehen. Können Sie Ihre Frage bearbeiten und den Code einfügen, den Sie für das Modell und den Plot verwendet haben?
Gung - Reinstate Monica
Du hast richtig verstanden. Es tut mir leid, aber ich weiß nicht, wie ich den Code abrufen soll. Ich habe die Regression ausgeführt und die Residuen mit dem Programm Gretl geplottet.
Luigi
2
Ich habe den Kommentar von @ mark999 anfangs nicht gesehen, als ich meine Antwort unten schrieb. Ich denke, sein Verdacht ist richtig, dass dies Residuen gegen y-Werte sind. Luigi, wiederholen Sie Ihr Diagramm - versuchen Sie nicht, es zu interpretieren, wenn Sie sich hinsichtlich der Variablen irren.
Michael Bishop

Antworten:

12

Angenommen, Sie haben die Regression , wobei β 10 ist . Dann y i - β 0 & ap ; & egr; i . Je höher der y- Wert ist, desto größer ist der Rest. Im Gegenteil, eine Auftragung der Residuen gegen x sollte keine systematische Beziehung zeigen. Auch der vorhergesagte Wert y i sollte etwa β 0yi=β0+β1xi+ϵiβ10yiβ0ϵiyxy^ichβ^0--- das gleiche für jede Beobachtung. Wenn alle vorhergesagten Werte ungefähr gleich sind, sollten sie nicht mit den Fehlern korreliert sein.

Die Handlung sagt mir, dass und y im Wesentlichen nichts miteinander zu tun haben (natürlich gibt es bessere Möglichkeiten, dies zu zeigen). Lassen Sie uns wissen , wenn Ihr Koeffizient ß 1 nicht nahe 0 ist.xyβ^1

Verwenden Sie zur besseren Diagnose eine grafische Darstellung der Residuen gegen den vorhergesagten Lohn oder gegen den Wert. Sie sollten in diesen Darstellungen kein unterscheidbares Muster beobachten.x

Wenn Sie eine kleine R-Demonstration wünschen, können Sie loslegen:

y      <- rnorm(100, 0, 5)
x      <- rnorm(100, 0, 2)
res    <- lm(y ~ x)$residuals
fitted <- lm(y ~ x)$fitted.values
plot(y, res)
plot(x, res)
plot(fitted, res)
Charlie
quelle
β1=0
5

Vorausgesetzt, das geschätzte Modell ist korrekt angegeben ...

PX=X(XX)1XPX is a projection matrix, so PX2=PX and PX=PX.

Cov(Y^,e^)=Cov(PXY,(IPX)Y)=PXCov(Y,Y)(IPX)=σ2PX(IPX)=0.

So the scatter-plot of residuals against predicted dependent variable should show no correlation.

But!

Cov(Y,e^)=Cov(Y,(IPX)Y)=Cov(Y,Y)(IPX)=σ2(IPX).

The matrix σ2(IPX) is a projection matrix, its eigenvalues are 0 or +1, it's positive semidefinite. So it should have non-negative values on the diagonal. So the scatter-plot of residuals against original dependent variable should show positive correlation.

As far as i know Gretl produces by default the graph of residuals against original dependent variable (not the predicted one!).

Roah
quelle
I appreciate the different possibility. This is where some knowledge of Gretl is helpful. I wonder however, how plausible it is that this is as the real answer. Using my simulated data, I correlated and plotted residuals vs. original dv; r=.22 and the plot looks a lot like my 3rd plot, not the question plot. Of course, I worked up those data to check the plausibility of my story--they may not be appropriate to check yours.
gung - Reinstate Monica
@gung what do you mean you used your simulated data?
Michael Bishop
@MichaelBishop if you look at my answer, you see that I simulated data to try out my story to see if it would look like the posted plot. My code and plots are presented. Since I specified the seed, it is reproducible by anyone with access to R.
gung - Reinstate Monica
4

Is it possible you are confusing fitted/predicted values with the actual values?

As @gung and @biostat have said, you hope there is no relationship between fitted values and residuals. On the other hand, finding a linear relationship between the actual values of the dependent/outcome variable and the residuals is to be expected and is not particularly informative.

Added to clarify the previous sentence: Not just any linear relationship between residuals and actual values of the out come is to be expected... For low measured values of Y, the predicted values of Y from a useful model will tend to be higher than the actual measured values, and vice versa.

Michael Bishop
quelle
The implication of what you're saying is that, if values are consistently underpredicted at low values of Y, and consistently overpredicted at high values of Y, that's OK. That's a problem, right?
rolando2
@rolando2, I have not implied what you say I've implied though perhaps I should clarify my answer. As you said, consistently underpredicting at low vales of Y and overpredicting at high values of Y would be a sign of a very bad model. I imagined the opposite, overpredicting at low values of Y and underpredicting at high values of Y. This phenomena is common, and is to be expected roughly in proportion to how much of the variance in the dependent variable you are able to explain. Imagine you lack any variables which predict Y, so you always use the mean as your prediction
Michael Bishop
1
what you've said makes sense to me, except for one thing. I'm having trouble imagining that a trend as strong as the one Luigi has shown would ever show up in a sound or desirable solution, even if the trend went from upper left to lower right.
rolando2
1
@rolando2, Residuals are typically defined as observed - fitted, therefore negative residuals are over-predictions. In a properly specified model with little explanatory power - I'm a social scientist so I see these all the time - there will be a strong positive relationship between residuals and the observed outcome values. If this is a residuals vs. actual plot, then a trend from upper left to lower right, would be the signal of a badly mis-specified model which you initially worried about.
Michael Bishop
Ok, my fault. As Michael Bishop and Roah wrote, Gretl plots residuals with respect to the observed y, not the predicted one. I'm very sorry for all this mess, I really didn't expect all these answers. I'm a beginner and I made this error, so I hope you can "forgive" me. Anyway, I think that this should indicate me that I should have used more explanatory variables. Thanks to all!
Luigi
3

The answers offered are giving me some ideas about what's going on here. I do believe there may have been some mistakes made by accident. See if the following story makes sense: To start, I think there is probably a strong relationship between X & Y in the data (here's some code and a plot):

set.seed(5)
wage <- rlnorm(1000, meanlog=2.3, sdlog=.5)
something_else <- .7*wage + rnorm(1000, mean=0, sd=1)
plot(wage, something_else, pch=3, col="red", main="Plot X vs. Y")

enter image description here

But by mistake Y was predicted just from the mean. Compounding this, the residuals from the mean only model are plotted against X, even though what was intended was to plot against the fitted values (code & plot):

meanModel <- lm(something_else~1)
windows()
plot(wage, meanModel$residuals, pch=3, col="red", 
    main="Plot of residuals from Mean only Model against X")
abline(h=0, lty="dotted")

enter image description here

We can fix this by fitting the appropriate model and plotting the residuals from that (code & plot):

appropriateModel <- lm(something_else~wage)
windows()
plot(appropriateModel$fitted.values, appropriateModel$residuals, pch=3, col="red",
main="Plot of residuals from the appropriate\nmodel against fitted values")
lines(lowess(appropriateModel$residuals~appropriateModel$fitted.values))

enter image description here

This seems like just the kinds of goof-ups I made when I was starting.

gung - Reinstate Monica
quelle
0

This graph indicates that the model you fitted is not good. As @gung said in the first comments on the main question that there should be no relationship between predicated response and residual.

" an analyst should expect a regression model to err in predicting a response in a random fashion; the model should predict values higher than actual and lower than actual with equal probability. See this"

I would recommend first plot response vs independent variable to see the relationship between them. It might be reasonable to add polynomial terms in the model.

Biostat
quelle
0

Isn't this what happens if there is no relationship between the X & Y variable? From looking at this graph, it appears you are essentially predicting Y with it's mean.

Adam
quelle
0

I think OP plotted residuals vs. the original response variable (not the fitted response variable from the model). I see plots like this all the time, with nearly the same exact pattern. Make sure you plot residuals vs. fitted values, as I'm not sure what meaningful inference you could gather from residuals vs. original Y. But I could certainly be wrong.

Todai
quelle