Discussion:
NaN on linear regression with many categorical variables
Elisa Pieri
2018-03-15 15:13:11 UTC
Permalink
Hello,

I'm using PSPP (psppire 0.8.5) on Linux Mint 18.3.

Premise: I'm a big newbie in statistical analysis, so please be patient :)

I have a data set with 23 categorical variables (binary values 0/1) and a
continuous variable. I would like to calculate linear regression, using the
continuous variable as the dependent one, to understand which ones have the
strongest impact.

The syntax that I'm using is:

REGRESSION
/VARIABLES= GLU4 HIS8 HIS21 GLU36 ASP57 LYS60 GLU62 HIS69 ASP75
LYS96 LYS97 ASP98 ASP120 GLU123 ASP125 LYS153 GLU160 ASP166 LYS167 ASP198
ASP217 HIS219 ASP226
/DEPENDENT= Energy
/STATISTICS=COEFF R ANOVA.

When I try to use less than 10 variables, the analysis works, but when I
use all of them I get a lot of Nan:

Model Summary (Energy)
#====#========#=================#==========================#
# R #R Square|Adjusted R Square|Std. Error of the Estimate#
##===#========#=================#==========================#
#|NaN# NaN| NaN| NaN#
##===#========#=================#==========================#

ANOVA (Energy)
#===========#==============#=====#===========#===#====#
# #Sum of Squares| df |Mean Square| F |Sig.#
##==========#==============#=====#===========#===#====#
#|Regression# NaN| 23| NaN|NaN| NaN#
#|Residual # NaN|39976| NaN| | #
#|Total # 499,89|39999| | | #
##==========#==============#=====#===========#===#====#

Coefficients (Energy)
#===========#============================#=========================#===#====#
# # Unstandardized Coefficients|Standardized Coefficients| |
#
#| #-----------+----------------+-------------------------+ |
#
#| # B | Std. Error | Beta | t
|Sig.#
##==========#===========#================#=========================#===#====#
#|(Constant)# NaN| NaN| ,00|NaN|
NaN#
#|GLU4 # NaN| NaN| NaN|NaN|
NaN#
#|HIS8 # NaN| NaN| NaN|NaN|
NaN#
#|HIS21 # NaN| NaN| NaN|NaN|
NaN#
#|GLU36 # NaN| NaN| NaN|NaN|
NaN#
#|ASP57 # NaN| NaN| NaN|NaN|
NaN#
#|LYS60 # NaN| NaN| NaN|NaN|
NaN#
#|GLU62 # NaN| NaN| NaN|NaN|
NaN#
#|HIS69 # NaN| NaN| NaN|NaN|
NaN#
#|ASP75 # NaN| NaN| NaN|NaN|
NaN#
#|LYS96 # -,01| NaN| -,01|NaN|
NaN#
#|LYS97 # ,50| NaN| ,40|NaN|
NaN#
#|ASP98 # ,00| NaN| -,01|NaN|
NaN#
#|ASP120 # ,12| NaN| ,01|NaN|
NaN#
#|GLU123 # ,02| NaN| ,04|NaN|
NaN#
#|ASP125 # ,00| NaN| -,01|NaN|
NaN#
#|LYS153 # ,00| NaN| ,01|NaN|
NaN#
#|GLU160 # -,02| NaN| -,01|NaN|
NaN#
#|ASP166 # -,02| NaN| ,00|NaN|
NaN#
#|LYS167 # ,00| NaN| ,00|NaN|
NaN#
#|ASP198 # ,00| NaN| ,00|NaN|
NaN#
#|ASP217 # -,04| NaN| -,11|NaN|
NaN#
#|HIS219 # ,02| NaN| ,08|NaN|
NaN#
#|ASP226 # ,00| NaN| ,00|NaN|
NaN#
##==========#===========#================#=========================#===#====#

Is there a kind soul amongst you that would explain to me what is going on?
Thank you very much in advance.

Elisa
Dr. Walter Statistics
2018-03-15 15:36:13 UTC
Permalink
Dear Ms Pieri,

without checking your data set it is hard to definitely say why you got
these results in PSPP. My first guess is that the number of variables in
the analysis leads to multicollinearity - the set of variables is linear
dependent or almost linear dependent - and / or a low ratio of
cases-to-variables. At least when I did an analysis with a
multicollinear set of variables PSPP printed NaN for standard errors, t
values and significance levels of some variables. This problem
disappeared when I deleted the subset of variables from the analysis
which was linear dependent on the other variables.

Kind regards from Germany,

Dr. Oliver Walter
Post by Elisa Pieri
Hello,
I'm using PSPP (psppire 0.8.5) on Linux Mint 18.3.
Premise: I'm a big newbie in statistical analysis, so please be patient :)
I have a data set with 23 categorical variables (binary values 0/1)
and a continuous variable. I would like to calculate linear
regression, using the continuous variable as the dependent one, to
understand which ones have the strongest impact.
REGRESSION
        /VARIABLES= GLU4 HIS8 HIS21 GLU36 ASP57 LYS60 GLU62 HIS69
ASP75 LYS96 LYS97 ASP98 ASP120 GLU123 ASP125 LYS153 GLU160 ASP166
LYS167 ASP198 ASP217 HIS219 ASP226
        /DEPENDENT=      Energy
        /STATISTICS=COEFF R ANOVA.
When I try to use less than 10 variables, the analysis works, but when
Model Summary (Energy)
#====#========#=================#==========================#
#  R #R Square|Adjusted R Square|Std. Error of the Estimate#
##===#========#=================#==========================#
#|NaN#     NaN|              NaN| NaN#
##===#========#=================#==========================#
ANOVA (Energy)
#===========#==============#=====#===========#===#====#
#           #Sum of Squares|  df |Mean Square| F |Sig.#
##==========#==============#=====#===========#===#====#
#|Regression#           NaN|   23|        NaN|NaN| NaN#
#|Residual  #           NaN|39976|        NaN|   |    #
#|Total     #        499,89|39999|           |   |    #
##==========#==============#=====#===========#===#====#
Coefficients (Energy)
#===========#============================#=========================#===#====#
#           # Unstandardized Coefficients|Standardized Coefficients|  
|    #
#| #-----------+----------------+-------------------------+ |    #
#|          #     B     |   Std. Error   | Beta          | t |Sig.#
##==========#===========#================#=========================#===#====#
#|(Constant)#        NaN| NaN|                      ,00|NaN| NaN#
#|GLU4      #        NaN| NaN|                      NaN|NaN| NaN#
#|HIS8      #        NaN| NaN|                      NaN|NaN| NaN#
#|HIS21     #        NaN| NaN|                      NaN|NaN| NaN#
#|GLU36     #        NaN| NaN|                      NaN|NaN| NaN#
#|ASP57     #        NaN| NaN|                      NaN|NaN| NaN#
#|LYS60     #        NaN| NaN|                      NaN|NaN| NaN#
#|GLU62     #        NaN| NaN|                      NaN|NaN| NaN#
#|HIS69     #        NaN| NaN|                      NaN|NaN| NaN#
#|ASP75     #        NaN| NaN|                      NaN|NaN| NaN#
#|LYS96     #       -,01| NaN|                     -,01|NaN| NaN#
#|LYS97     #        ,50| NaN|                      ,40|NaN| NaN#
#|ASP98     #        ,00| NaN|                     -,01|NaN| NaN#
#|ASP120    #        ,12| NaN|                      ,01|NaN| NaN#
#|GLU123    #        ,02| NaN|                      ,04|NaN| NaN#
#|ASP125    #        ,00| NaN|                     -,01|NaN| NaN#
#|LYS153    #        ,00| NaN|                      ,01|NaN| NaN#
#|GLU160    #       -,02| NaN|                     -,01|NaN| NaN#
#|ASP166    #       -,02| NaN|                      ,00|NaN| NaN#
#|LYS167    #        ,00| NaN|                      ,00|NaN| NaN#
#|ASP198    #        ,00| NaN|                      ,00|NaN| NaN#
#|ASP217    #       -,04| NaN|                     -,11|NaN| NaN#
#|HIS219    #        ,02| NaN|                      ,08|NaN| NaN#
#|ASP226    #        ,00| NaN|                      ,00|NaN| NaN#
##==========#===========#================#=========================#===#====#
Is there a kind soul amongst you that would explain to me what is going on?
Thank you very much in advance.
Elisa
_______________________________________________
Pspp-users mailing list
https://lists.gnu.org/mailman/listinfo/pspp-users
--
Dr. Walter Statistics
Gabelsberger Straße 27
24148 Kiel
Tel.: 0431/7802809
E-Mail: ***@walter-statistics.com
https://www.walter-statistics.com
John Darrington
2018-03-15 16:32:44 UTC
Permalink
On Thu, Mar 15, 2018 at 04:13:11PM +0100, Elisa Pieri wrote:
Hello,

I'm using PSPP (psppire 0.8.5) on Linux Mint 18.3.

This version is very old and there have been many fixes to the REGRESSION procedure
in recent releases.

I suggest that you upgrade.

J'
--
Avoid eavesdropping. Send strong encrypted email.
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://sks-keyservers.net or any PGP keyserver for public key.
Elisa Pieri
2018-03-15 16:50:50 UTC
Permalink
I did update to the 0.10.2, but still the same results :(

On Thu, Mar 15, 2018 at 5:32 PM, John Darrington <
Post by Elisa Pieri
Hello,
I'm using PSPP (psppire 0.8.5) on Linux Mint 18.3.
This version is very old and there have been many fixes to the REGRESSION procedure
in recent releases.
I suggest that you upgrade.
J'
--
Avoid eavesdropping. Send strong encrypted email.
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://sks-keyservers.net or any PGP keyserver for public key.
John Darrington
2018-03-15 17:58:17 UTC
Permalink
Well at least that rules out any known problems with PSPP.

I suggest that your next step be to run DESCRIPTIVES on that same set of variables,
(both the dependent and independent) and see if there is anything interesting in that.

J'

On Thu, Mar 15, 2018 at 05:50:50PM +0100, Elisa Pieri wrote:
I did update to the 0.10.2, but still the same results :(

On Thu, Mar 15, 2018 at 5:32 PM, John Darrington <
Post by Elisa Pieri
Hello,
I'm using PSPP (psppire 0.8.5) on Linux Mint 18.3.
This version is very old and there have been many fixes to the REGRESSION procedure
in recent releases.
I suggest that you upgrade.
J'
--
Avoid eavesdropping. Send strong encrypted email.
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://sks-keyservers.net or any PGP keyserver for public key.
--
Avoid eavesdropping. Send strong encrypted email.
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://sks-keyservers.net or any PGP keyserver for public key.
ftr public
2018-03-21 16:56:03 UTC
Permalink
Hi,

as you are a newbie first question: did you do standard multiple
regression analysis with a continuous dependent and several continuous
independent variables before so that you know what you get in the output
window ? And how to understand it ?

A first step is data cleaning. You should at least do the following:
You say you have two values for each independent variable, 0 and 1. The
zeros are valid or are declared missing ? They must be valid.
Did you exclude cases that have nothing but 1 or nothing but 0 values
for all cases (i.e. respondents did not play the game) ?
Once excluded the previous respondents you should do a correlation for
all cases and variables to see whether there are variables with a
correlation of 1, i.e. they are multicollinear - they explain exactly
the same. So you exclude one of the two.
How many cases do you have, and how many independent variables ?

I recommend to read
Barbara G. Tabachnick and Linda S. Fidell: Using multivariate
statistics. Pearson.
Joseph Hair et al. Multivariate data analysis.
I bought the 4th edition, with readings. You get a second hand edition
at Alibris for 2.14€ plus postage.
You may also buy the 7th and latest international edition, for about 27
€ & postage.

HTH as a starter.

- ftr
Post by John Darrington
Well at least that rules out any known problems with PSPP.
I suggest that your next step be to run DESCRIPTIVES on that same set of variables,
(both the dependent and independent) and see if there is anything interesting in that.
J'
I did update to the 0.10.2, but still the same results :(
On Thu, Mar 15, 2018 at 5:32 PM, John Darrington <
Post by Elisa Pieri
Hello,
I'm using PSPP (psppire 0.8.5) on Linux Mint 18.3.
This version is very old and there have been many fixes to the REGRESSION
procedure
in recent releases.
I suggest that you upgrade.
J'
--
Avoid eavesdropping. Send strong encrypted email.
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://sks-keyservers.net or any PGP keyserver for public key.
_______________________________________________
Pspp-users mailing list
https://lists.gnu.org/mailman/listinfo/pspp-users
Loading...