Title: | Tools for an Introductory Class in Regression and Modeling |
---|---|
Description: | Contains basic tools for visualizing, interpreting, and building regression models. It has been designed for use with the book Introduction to Regression and Modeling with R by Adam Petrie, Cognella Publishers, ISBN: 978-1-63189-250-9 <https://titles.cognella.com/introduction-to-regression-and-modeling-with-r-9781631892509>. |
Authors: | Adam Petrie |
Maintainer: | Adam Petrie <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.6 |
Built: | 2025-01-20 05:25:58 UTC |
Source: | https://github.com/cran/regclass |
Customers were marketed a new type of account at a bank. It is desired to model what factors seemed to be associated with the probability of opening the account to tune marketing strategy.
data("ACCOUNT")
data("ACCOUNT")
A data frame with 24242 observations on the following 8 variables.
Purchase
a factor with levels No
Yes
Tenure
a numeric vector, the number of years the customer has been with the bank
CheckingBalance
a numeric vector, amount currently held in checking (may be negative if overdrafted)
SavingBalance
a numeric vector, amount currently held in savings (0 or larger)
Income
a numeric vector, yearly income in thousands of dollars
Homeowner
a factor with levels No
Yes
Age
a numeric vector
Area.Classification
a factor with levels R
S
U
for rural, suburban, or urban
Who is more likely to open a new type of account that a bank wants to try to sell its customers? Try logistic regression or partition models to see if you can develop a model that accurately classifies purchasers vs. non-purchasers. Or, try to develop a model that does well in promoting to nearly all customers who would buy the account.
This function gives a list of all pairwise correlations between quantitative variables in a dataframe. Alternatively, it can provide all pairwise correlations with just a particular variable.
all_correlations(X,type="pearson",interest=NA,sorted="none")
all_correlations(X,type="pearson",interest=NA,sorted="none")
X |
A data frame |
type |
Either |
interest |
If specified, returns only pairwise correlations with this variable. Argument should be in quotes and must give the exact name of the column of the variable of interest. |
sorted |
Either |
This function filters out any non-numerical variables in the data frame and provides correlations only between quantitative variables. It is useful for quickly glancing at the size of the correlations between many pairs of variables or all correlations with a particular variable. Further analysis should be done on pairs of interest using associate
.
Note: if Spearmans' rank correlations are computed, warnings message result indicating that the exact p-value cannot be computed with ties. Running associate
will give you an approximate p-value using the permutation procedure.
Adam Petrie
Introduction to Regression and Modeling
#all pairwise (Pearson) correlations between all quantitative variables data(STUDENT) all_correlations(STUDENT) #Spearman correlations between all quantitative variables and CollegeGPA, sorted by pvalue. #Gives warnings due to ties all_correlations(STUDENT,interest="CollegeGPA",type="spearman",sorted="significance")
#all pairwise (Pearson) correlations between all quantitative variables data(STUDENT) all_correlations(STUDENT) #Spearman correlations between all quantitative variables and CollegeGPA, sorted by pvalue. #Gives warnings due to ties all_correlations(STUDENT,interest="CollegeGPA",type="spearman",sorted="significance")
Appliance shipments from 1960 to 1985
data("APPLIANCE")
data("APPLIANCE")
A data frame with 26 observations on the following 7 variables.
Year
a numeric vector
Dishwasher
a numeric vector, Factory shipments (domestic) of dishwashers (thousands)
Disposal
a numeric vector, Factory shipments (domestic) of disposers (thousands)
Refrigerator
a numeric vector, Factory shipments (domestic) of refrigerators (thousands)
Washer
a numeric vector, Factory shipments (domestic) of washing machines (thousands)
DurableGoodsExp
a numeric vector, Durable goods expenditures (billions of 1972 dollars)
PrivateResInvest
a numeric vector, Private residential investment (billions of 1972 dollars)
From the (former) Data and Story library.
The file gives unit shipments of dishwashers, disposers, refrigerators, and washers in the United States from 1960 to 1985. This and other data are published currently in the Department of Commerce's Survey of Current Business, and are summarized from time to time in their publication, Business Statistics. Also included in the file are durable goods expenditures and private residential investment in the United States.
This function takes two quantities and computes relevent numerical measures of association. The p-values of the associations are estimated via permutation tests. Plots for diagnostics are provided as well, with optional arguments that allow for classic tests.
associate(formula, data, permutations = 500, seed=NA, plot = TRUE, classic = FALSE, cex.leg=0.7, n.levels=NA,prompt=TRUE,color=TRUE,...)
associate(formula, data, permutations = 500, seed=NA, plot = TRUE, classic = FALSE, cex.leg=0.7, n.levels=NA,prompt=TRUE,color=TRUE,...)
formula |
A standard R formula written as y~x, where y is the name of the variable playing the role of y and x is the name of the variable playing the role of x. |
data |
An optional argument giving the name of the data frame that contains x and y. If not specified, the function will use existing definitions in the parent environment. |
permutations |
The number of permutations for Monte Carlo estimation of the p-value. If 0, function defaults to reporting classic results. |
seed |
An optional argument specifying the random number seed for permutations. |
plot |
|
classic |
|
cex.leg |
Scale factor for the size of legends in plots. Larger values make legends bigger. |
n.levels |
An optional argument of interest only when y is categorical and x is quantitative. It specifies the number of levels when converting x to a categorical variable during the analysis. Each level will have the same number of cases. If this does not work out evenly, some levels are randomly picked to have one more case than the others. If unspecified, the default is to pick the number of levels so that there are 10 cases per level or a maximum of 6 levels (whichever is smaller). |
prompt |
|
color |
|
... |
Additional arguments related to plotting, e.g., pch, lty, lwd |
This function uses Monte Carlo simulation (permutation procedure) to approximate the p-value of an association. Only complete cases are considered in the analysis.
Valid formulas may include functions of the variable, e.g. y^2, log10(x), or more complicated functions like I(x1/(x2+x3)). In the latter case, I() must surround the function of interest to be computed correctly.
When both x and y are quantitative variables, an analysis of Pearson's correlation and Spearman's rank correlation is provided. Scatterplots and histograms of the variables are provided. If classic
is TRUE
, the QQ-plots of the variables are provided along with tests of assumptions.
When x is categorical and y is quantitative, the averages (as well as mean ranks and medians) of y are compared between levels of x. The "discrepancy" is the F statistic for averages, Kruskal-Wallis statistic for mean ranks, and the chi-squared statistic for the median test. Side-by-side boxplots are also provided. If classic
is TRUE
, the QQ-plots of the distribution of y for each level of x are provided.
When x is quantitative and y is categorical, x is converted to a categorical variable with n.levels
levels with equal numbers of cases. A chi-squared test is performed for the association. The classic approach assumes a multinomial logistic regression to check significance. A mosaic plot showing the distribution of y for each induced level of x is provided as well as a probability "curve". If classic
is TRUE
, the multinomial logistic curves for each level are provided versus x..
When both x and y are categorical, a chi-squared test is performed. The contingency table, table of expected counts, and conditional distributions are also reported along with a mosaic plot.
If the permutation procedure is used, the sampling distribution of the measure of association is displayed over the requested amount of permutations along with the observed value on the actual data (except when y is categorical with x quantitative).
If classic results are desired, then plots and tests to check assumptions are supplied. white.test
from package bstats (version 1.1-11-5) and mshapiro.test
from package mvnormtest (version 0.1-9) are built into the function to avoid directly referencing the libraries (which sometimes causes problems).
Adam Petrie
Introduction to Regression and Modeling
lm
, glm
, anova
, cor
, chisq.test
, vglm
#Two quantitative variables data(SALARY) associate(Salary~Education,data=SALARY,permutations=1000) #y is quantitative while x is categorical data(SURVEY11) associate(X07.GPA~X40.FavAlcohol,data=SURVEY11,permutations=0,classic=TRUE) #y is categorical while x is quantitative data(WINE) associate(Quality~alcohol,data=WINE,classic=TRUE,n.levels=5) #Two categorical variables (many cases, turns off prompt asking for user input) data(ACCOUNT) set.seed(320) #Work with a smaller subset SUBSET <- ACCOUNT[sample(nrow(ACCOUNT),1000),] associate(Purchase~Area.Classification,data=SUBSET,classic=TRUE,prompt=FALSE)
#Two quantitative variables data(SALARY) associate(Salary~Education,data=SALARY,permutations=1000) #y is quantitative while x is categorical data(SURVEY11) associate(X07.GPA~X40.FavAlcohol,data=SURVEY11,permutations=0,classic=TRUE) #y is categorical while x is quantitative data(WINE) associate(Quality~alcohol,data=WINE,classic=TRUE,n.levels=5) #Two categorical variables (many cases, turns off prompt asking for user input) data(ACCOUNT) set.seed(320) #Work with a smaller subset SUBSET <- ACCOUNT[sample(nrow(ACCOUNT),1000),] associate(Purchase~Area.Classification,data=SUBSET,classic=TRUE,prompt=FALSE)
The average attractiveness scores of 70 females along with physical attributes
data("ATTRACTF")
data("ATTRACTF")
A data frame with 70 observations on the following 21 variables.
Score
a numeric vector giving the average attractivness score compiled after 100 student ratings
Actual.Sexuality
a factor with levels Gay
Straight
indicating the self-reported sexuality of the person in the picture
ApparentRace
a factor with levels black
other
white
indicating the consensus regarding the apparent race of the person
Chin
a factor with levels pointed
rounded
indicating the consensus regarding the shape of the person's chin
Cleavage
a factor with levels no
yes
indicating the consensus regarding whether the pictured woman was prominently displaying cleavage
ClothingStyle
a factor with levels conservative
revealing
indicating the consensus regarding how the women was dressed
FaceSymmetryScore
a numeric vector indicating the number of people (out of 2) who agreed the woman's case was symmetric
FashionScore
a numeric vector indicating the number of people (out of 4) who agreed the woman was fashionable
FitnessScore
a numeric vector indicating the number of people (out of 4) who agreed the woman was physically fit
GayScore
a numeric vector indicating the number of people (out of 16) who agreed the woman was a lesbian
Glasses
a factor with levels Glasses
No Glasses
GroomedScore
a numeric vector indicating the number of people (out of 4) who agreed the woman made a noticeable effort to look nice
HairColor
a factor with levels dark
light
indicating the consensus regarding the woman's hair color
HairstyleUniquess
a numeric vector indicating the number of people (out of 2) who agreed the woman had an unconventional haircut
HappinessRating
a numeric vector indicating the number of people (out of 2) who agreed the woman looked happy in her photo
LookingAtCamera
a factor with levels no
yes
MakeupScore
a numeric vector indicating the number of people (out of 5) who agreed the woman was wearing a noticeable amount of makeup
NoseOddScore
a numeric vector indicating the number of people (out of 3) who agreed the woman had an unusually shaped nose
Selfie
a factor with levels no
yes
SkinClearScore
a numeric vector indicating the number of people (out of 2) who agreed the woman's complexion was clear.
Smile
a factor with levels no
yes
Students were asked to rate on a scale of 1 (very unattractive) to 5 (very attractive) the attractiveness of 70 college-aged women who had posted their photos on a dating website. Of the nearly 100 respondents, most were straight males. Score
represents the average of these ratings.
In a separate survey, students (of both genders) were asked to rate characteristics of the woman by answering the questions: what is her race, is she displaying her cleavage prominently, is she a lesbian, is she physically fit, etc. The variables ending “Score" represent the number of students who answered Yes to the question. Other variables (such as Selfie
, Smile
) represent the consensus among the students. The only attribute taken from the woman's profile was Actual.Sexuality
.
Students in BAS 320 at the University of Tennessee from 2013-2015.
The average attractiveness scores of 70 males along with physical attributes
data("ATTRACTM")
data("ATTRACTM")
A data frame with 70 observations on the following 23 variables.
Score
a numeric vector giving the average attractivness score compiled after 60 student ratings
Actual.Sexuality
a factor with levels Gay
Straight
indicating the self-reported sexuality of the person in the picture
ApparentRace
a factor with levels black
other
white
indicating the consensus regarding the apparent race of the person
Chin
a factor with levels pointed
rounded
indicating the consensus regarding the shape of the person's chin
ClothingStyle
a factor with levels conservative
revealing
indicating the consensus regarding how the man was dressed
FaceSymmetryScore
a numeric vector indicating the number of people (out of 7) who agreed the woman's case was symmetric
FacialHair
a factor with levels no
yes
indicating the consensus regarding whether the man appeared to maintain facial hair
FashionScore
a numeric vector indicating the number of people (out of 7) who agreed the woman was fashionable
FitnessScore
a numeric vector indicating the number of people (out of 8) who agreed the woman was physically fit
GayScore
a numeric vector indicating the number of people (out of 16) who agreed the man was gay
Glasses
a factor with levels no
yes
GroomedScore
a numeric vector indicating the number of people (out of 6) who agreed the woman made a noticeable effort to look nice
HairColor
a factor with levels dark
light
unseen
indicating the consensus regarding the man's hair color
HairstyleUniquess
a numeric vector indicating the number of people (out of 4) who agreed the woman had an unconventional haircut
HappinessRating
a numeric vector indicating the number of people (out of 6) who agreed the man looked happy in her photo
Hat
a factor with levels no
yes
LookingAtCamera
a factor with levels no
yes
NoseOddScore
a numeric vector indicating the number of people (out of 3) who agreed the woman had an unusually shaped nose
Piercings
a factor with levels no
yes
indicating whether the man had visible piercings
Selfie
a factor with levels no
yes
SkinClearScore
a numeric vector indicating the number of people (out of 2) who agreed the woman's complexion was clear.
Smile
a factor with levels no
yes
Tattoo
a factor with levels no
yes
Students were asked to rate on a scale of 1 (very unattractive) to 5 (very attractive) the attractiveness of 70 college-aged men who had posted their photos on a dating website. Of the nearly 60 respondents, most were straight females. Score
represents the average of these ratings.
In a separate survey, students (of both genders) were asked to rate characteristics of the man by answering the questions: what is his race, how symmetric does his face look, is he gay, is he physically fit, etc. The variables ending “Score" represent the number of students who answered Yes to the question. Other variables (such as Hat
, Smile
) represent the consensus among the students. The only attribute taken from the man's profile was Actual.Sexuality
.
Students in BAS 320 at the University of Tennessee from 2013-2015.
Characteristics of cars from 1991
data("AUTO")
data("AUTO")
A data frame with 82 observations on the following 5 variables.
CabVolume
a numeric vector, cubic feet of cab space
Horsepower
a numeric vector, engine horsepower
FuelEfficiency
a numeric vector, average miles per gallon
TopSpeed
a numeric vector, miles per hour
Weight
a numeric vector, in units of 100 lbs
Although this is a popular dataset, there is some question as to the units of the fuel efficiency. The source claims it to be in miles per gallon, but the numbers reported seem unrealistic. However, the units do not appear to be in km/gallon or km/L.
Data provided by the U.S. Environmental Protection Agency and obtained from the (former) Data and Story library
R.M. Heavenrich, J.D. Murrell, and K.H. Hellman, Light Duty Automotive Technology and Fuel Economy Trends Through 1991, U.S. Environmental Protection Agency, 1991 (EPA/AA/CTAB/91-02)
Popular Bodyfat dataset
data("BODYFAT")
data("BODYFAT")
A data frame with 252 observations on the following 14 variables.
BodyFat
a numeric vector indicating the percentage body fat 0-100
Age
a numeric vector, yrs
Weight
a numeric vector, lbs
Height
a numeric vector, inches
Neck
a numeric vector
Chest
a numeric vector
Abdomen
a numeric vector
Hip
a numeric vector
Thigh
a numeric vector
Knee
a numeric vector
Ankle
a numeric vector
Biceps
a numeric vector
Forearm
a numeric vector
Wrist
a numeric vector
Bodyfat can be accurately measured by the hydrostatic technique, where someone is submereged in a tank of water. It would be useful to be able to predict body fat from measurements that are simpler to obtain. Unless otherwise specified, all physical measurements are in centimeters.
This is a modified version of the data available in “Fitting Percentage of Body Fat to Simple Body Measurements" as appearing in Journal of Statistics Education v4 n1 (1996). http://www.amstat.org/publications/jse/v4n1/datasets.johnson.html
Bodyfat dataset illustrating quirks of statistical significance
data("BODYFAT2")
data("BODYFAT2")
A data frame with 20 observations on the following 4 variables.
Triceps
a numeric vector, cm
Thigh
a numeric vector, cm
Midarm
a numeric vector, cm
BodyFat
a numeric vector, 0-100 representing percent
The physical measurements are circumferences of body parts of 25-34 year-old healthy females.
This is a classic dataset found in many textbooks and in many places online. The original source may be Neter, Kutner, Nachtsheim, Wasserman, 1997, p. 261: Applied Statistical Models (4th Edition).
This function uses bestglm
to consider an extensive array of models and makes recommendations on what set of variables is appropriate for the final model. Model hierarchy is not preserved. Interactions and multi-level categorical variables are allowed.
build_model(form,data,type="predictive",Kfold=5,repeats=10, prompt=TRUE,seed=NA,holdout=NA,...)
build_model(form,data,type="predictive",Kfold=5,repeats=10, prompt=TRUE,seed=NA,holdout=NA,...)
form |
A model formula giving the most complex model to consider (often predicting y from all variables |
data |
Name of the data frame that contain all variables specifed by |
type |
Either "predictive" or "descriptive". If |
Kfold |
The number of folds for repeated K-fold cross-validation for predictive model building |
repeats |
The number of repeats for repeated K-fold cross-validation for predictive model building |
seed |
If specified, the random number seed used to initialize the repeated K-fold cross-validation procedure so that results can be reproduced. |
prompt |
If |
holdout |
A optional dataframe to serve as a holdout sample. The generalization error on the holdout sample will be calculated and displayed for the best model at each number of predictors. |
... |
Additional arguments to |
This procedure takes the formula specified by form
and the original dataframe and simply converts it into a form that bestglm
(which normally cannot do cross-validation when categorical variables are involved) can use by adding in columns to represent interactions and categorical variables.
One the dataframe has been generated, a warning is given to the user if the procedure may take too long (many rows or many potential predictors), and then bestglm
is run. A plot and table of models' performances is given, as well as a recommendation for a final set of variables (model with the lowest AIC/estimated generalization error, or a simpler model that is more or less equivalent).
The command returns a list with bestformula
(the formula of the model with the lowest AIC or the model chosen by the one standard deviation rule), bestmodel
(the fitted model that had the lowest AIC or the one chosen by the one standard deviation rule), predictors
(a list giving the predictors that appeared in the best model with 1 predictor, with 2 predictors, etc).
If a descriptive model is sought, the last component of the returned list is AICtable
(a data frame containing the number of predictors and the AIC of the best model with that number of predictors; a * denotes the model with the lowest AIC while a + denotes the simplest model whose AIC is within 2 of the lowest).
If a predictive model is sought, the last component of the returned list is CVtable
(a data frame containing the number of predictors and the estimated generalization error of the best model with that number of predictors along with the SD from repeated K-fold cross validation; a * denotes the model with the lowest error while the + denotes the model selected with the one standard deviation rule). Note that the generalization error in the second column of this table is the squared error if the response is quantitative and is another measure of error (not the misclassification rate) if the response is categorical. Additional columns are provided to give the root mean squared error or misclassification rate.
Note: bestmodel
is the one selected by the one standard deviation rule or the simplest one whose AIC is no more than 2 above the model with the lowest AIC. Because the procedure does not respect model hierarchy and can include interactions, the formula returned may not be immediately useable if it involves a categorical variable since the variable returned is how R names indicator variables. You may have to manually fit the model based on the selected predictors.
If HOLDOUT
is given a plot of the error on the holdout sample versus the number of predictors (for the best model at that number of predictors) is provided along with the estimated generalization error from the training set. This can be used to see if the models generalize well, but is in general not used to tune which model is selected.
Adam Petrie
Introduction to Regression and Modeling with R
bestglm
, regsubsets
, see.models
, generalization.error
.
#Descriptive model. Note: Tip and Bill should not be used simultaneously as #predictors of TipPercentage, so leave Tip out since it's not known ahead of time data(TIPS) MODELS <- build_model(TipPercentage~.-Tip,data=TIPS,type="descriptive") MODELS$AICtable MODELS$predictors[[1]] #Variable in best model with a single predictors MODELS$predictors[[2]] #Variables in best model with two predictors summary(MODELS$bestmodel) #Summary of best model, in this case with two predictors #Another descriptive model (large dataset so changing prompt=FALSE for documentation) data(PURCHASE) set.seed(320) #Take a subset of full dataframe for quick illustration SUBSET <- PURCHASE[sample(nrow(PURCHASE),500),] MODELS <- build_model(Purchase~.,data=SUBSET,type="descriptive",prompt=FALSE) MODELS$AICtable #Model with 1 or 2 variables look pretty good #Predict whether a purchase is made by # of previous visits and distance to store MODELS$predictors[[2]] #Predictive model. data(SALARY) set.seed(2010) train.rows <- sample(nrow(SALARY),0.7*nrow(SALARY),replace=TRUE) TRAIN <- SALARY[train.rows,] HOLDOUT <- SALARY[-train.rows,] MODELS <- build_model(Salary~.^2,data=TRAIN,holdout=HOLDOUT) summary(MODELS$bestmodel) M <- lm(Salary~Gender+Education:Months,data=TRAIN) generalization_error(M,HOLDOUT) #Predictive model for WINE data, takes a while. Misclassification rate on holdout sample is 18%. data(WINE) set.seed(2010) train.rows <- sample(nrow(WINE),0.7*nrow(WINE),replace=TRUE) TRAIN <- WINE[train.rows,] HOLDOUT <- WINE[-train.rows,] ## Not run: MODELS <- build_model(Quality~.,data=TRAIN,seed=1919,holdout=HOLDOUT) ## Not run: MODELS$CVtable
#Descriptive model. Note: Tip and Bill should not be used simultaneously as #predictors of TipPercentage, so leave Tip out since it's not known ahead of time data(TIPS) MODELS <- build_model(TipPercentage~.-Tip,data=TIPS,type="descriptive") MODELS$AICtable MODELS$predictors[[1]] #Variable in best model with a single predictors MODELS$predictors[[2]] #Variables in best model with two predictors summary(MODELS$bestmodel) #Summary of best model, in this case with two predictors #Another descriptive model (large dataset so changing prompt=FALSE for documentation) data(PURCHASE) set.seed(320) #Take a subset of full dataframe for quick illustration SUBSET <- PURCHASE[sample(nrow(PURCHASE),500),] MODELS <- build_model(Purchase~.,data=SUBSET,type="descriptive",prompt=FALSE) MODELS$AICtable #Model with 1 or 2 variables look pretty good #Predict whether a purchase is made by # of previous visits and distance to store MODELS$predictors[[2]] #Predictive model. data(SALARY) set.seed(2010) train.rows <- sample(nrow(SALARY),0.7*nrow(SALARY),replace=TRUE) TRAIN <- SALARY[train.rows,] HOLDOUT <- SALARY[-train.rows,] MODELS <- build_model(Salary~.^2,data=TRAIN,holdout=HOLDOUT) summary(MODELS$bestmodel) M <- lm(Salary~Gender+Education:Months,data=TRAIN) generalization_error(M,HOLDOUT) #Predictive model for WINE data, takes a while. Misclassification rate on holdout sample is 18%. data(WINE) set.seed(2010) train.rows <- sample(nrow(WINE),0.7*nrow(WINE),replace=TRUE) TRAIN <- WINE[train.rows,] HOLDOUT <- WINE[-train.rows,] ## Not run: MODELS <- build_model(Quality~.,data=TRAIN,seed=1919,holdout=HOLDOUT) ## Not run: MODELS$CVtable
A tool to choose the "correct" complexity parameter of a tree
build_tree(form, data, minbucket = 5, seed=NA, holdout, mincp=0)
build_tree(form, data, minbucket = 5, seed=NA, holdout, mincp=0)
form |
A formula describing the tree to be built |
data |
Data frame containing the variables to build the tree |
minbucket |
The minimum number of cases allowed in any leaf in the tree |
seed |
If given, specifies the random number seed so the crossvalidation error can be reproduced. |
holdout |
If given, the error on the holdout sample is calculated and given in the cp table. |
mincp |
The |
This command combines the action of building a tree to its maximum possible extent using rpart
and looking at the results using getcp
. A plot of the estimated relative generalization error (as determined by 10-fold cross validation) versus the number of splits is provided. In addition, the complexity parameter table giving the cp
of the tree with the lowest error (and of the simplest tree with an error within one standard deviation of the lowest error) is reported.
If holdout
is given, the RMSE/misclassification rate on the training and holdout samples are provided in the cp table.
Adam Petrie
Introduction to Regression and Modeling
data(JUNK) build_tree(Junk~.,data=JUNK,seed=1337) data(CENSUS) build_tree(ResponseRate~.,data=CENSUS,seed=2017,mincp=0.001) data(OFFENSE) build_tree(Win~.,data=OFFENSE[1:200,],seed=2029,holdout=OFFENSE[201:352,])
data(JUNK) build_tree(Junk~.,data=JUNK,seed=1337) data(CENSUS) build_tree(ResponseRate~.,data=CENSUS,seed=2017,mincp=0.001) data(OFFENSE) build_tree(Win~.,data=OFFENSE[1:200,],seed=2029,holdout=OFFENSE[201:352,])
Predicting the sales price of a bulldozer at auction
data("BULLDOZER")
data("BULLDOZER")
A data frame with 924 observations on the following 6 variables.
SalePrice
a numeric vector
YearsAgo
a numeric vector, the number of years ago (before present) that the sale occurred
YearMade
a numeric vector, year of manufacture of machine
Usage
a numeric vector, hours of usage at time of sale
Blade
a numeric vector, width of the bulldozer blade (feet)
Tire
a numeric vector, size of primary tires
The goal is to predict the sale price of a particular piece of heavy equiment at auction based on its usage, equipment type, and configuration. The data represents a heavily modified version of competition data found on kaggle.com. See original source for actual dataset
https://www.kaggle.com/c/bluebook-for-bulldozers
The BULLDOZER dataset but with the year the dozer was made as a categorical variable
data("BULLDOZER2")
data("BULLDOZER2")
A data frame with 924 observations on the following 6 variables.
Price
a numeric vector
YearsAgo
a numeric vector
Usage
a numeric vector
Tire
a numeric vector
Decade
a factor with levels 1960s and 1970s
1980s
1990s
2000s
BladeSize
a numeric vector
This is the BULLDOZER
data except here YearMade
has been coded into a four level categorical varaible called Decade
Summary of students' cell phone providers and relative frequency of dropped calls
data("CALLS")
data("CALLS")
A data frame with 579 observations on the following 2 variables.
Provider
a factor with levels ATT
Sprint
USCellular
Verizon
DropCallFreq
a factor with levels Occasionally
Often
Rarely
Data is self-reported by students. The dropped call frequency is based on individuals' perceptions and not any independent quantititatve measure. The data is a subset of SURVEY09
.
Student survey from STAT 201, University of Tennessee Knoxville, Fall 2009
Information from the 2010 US Census
data("CENSUS")
data("CENSUS")
A data frame with 3534 observations on the following 39 variables.
ResponseRate
a numeric vector, 0-100 representing the percentage of households in a block group that mailed in the form
Area
a numeric vector, land area in square miles
Urban
a numeric vector, percentage of block group in Urbanized area (50000 or greater)
Suburban
a numeric vector, percentage of block group in an Urban Cluster area (2500 to 49999)
Rural
a numeric vector, percentage of block group in an Urban Cluster area (2500 to 49999)
Male
a numeric vector, percentage of males
AgeLess5
a numeric vector, percentage of individuals aged less than 5 years old
Age5to17
a numeric vector
Age18to24
a numeric vector
Age25to44
a numeric vector
Age45to64
a numeric vector
Age65plus
a numeric vector
Hispanics
a numeric vector, percentage of individuals who identify as Hispanic
Whites
a numeric vector, percentage of individuals who identify as white (alone)
Blacks
a numeric vector
NativeAmericans
a numeric vector
Asians
a numeric vector
Hawaiians
a numeric vector
Other
a numeric vector, percentage of individuals who identify as another ethnicity
RelatedHH
a numeric vector, percentage of households where at least 2 members are related by birth, marriage, or adoption; same-sex couple households with no relatives of the householder present are not included
MarriedHH
a numeric vector, percentage of households in which the householder and his or her spouse are listed as members of the same household; does not include same-sex married couples
NoSpouseHH
a numeric vector, percentage of households with no spousal relationship present
FemaleHH
a numeric vector, percentage of households with a female householder and no husband of householder present
AloneHH
a numeric vector, percentage of households where householder is living alone
WithKidHH
a numeric vector, percentage of households which have at least one person under the age of 18
MedianHHIncomeBlock
a numeric vector, median income of households in the block group (from American Community Survey)
MedianHHIncomeCity
a numeric vector, median income of households in the tract
OccupiedUnits
a numeric vector, percentage of housing units that are occupied
RentingHH
a numeric vector, percentage of housing units occupied by renters
HomeownerHH
a numeric vector, percentage of housing units occupied by the owner
MobileHomeUnits
a numeric vector, percentage of housing units that are mobile homes (from American Community Survey)
CrowdedUnits
a numeric vector, percentage of housing units with more than 1 person per room on average
NoPhoneUnits
a numeric vector, percentage of housing units without a landline
NoPlumbingUnits
a numeric vector, percentage of housing units without active plumbing
NewUnits
a numeric vector, percentage of housing units constructed in 2010 or later
Population
a numeric vector, number of people in the block group
NumHH
a numeric vector, number of households in the block group
NumUnits
a numeric vector, number of housing units in the block group
logMedianHouseValue
a numeric vector, the logarithm of the median home value in the block group
The goal is to predict ResponseRate
from the other predictors. ResponseRate
is the percentage of households in a block group that mailed in the census forms. A block group is on average about 40 blocks, each typically bounded by streets, roads, or water. The number of block groups per county in the US is typically between about 5 and 165 with a median of about 20.
See https://www2.census.gov/programs-surveys/research/guidance/planning-databases/2014/pdb-block-2014-11-20a.pdf for variable definitions.
A portion of the CENSUS dataset used for illustration
data("CENSUSMLR")
data("CENSUSMLR")
A data frame with 1000 observations on the following 7 variables.
Response
a numeric vector, percentage 0-100 of household that mailed in the census form
Population
a numeric vector, the number of people living in the census block based on 2010 census
ACSPopulation
a numeric vector, the number of people living in the census block based on 2010 census
Rural
a numeric vector, the number of people living in a rural area (in that census block)
Males
a numeric vector, the number of males living in the census block
Elderly
a numeric vector, the number of people aged 65+ living in the census block
Hispanic
a numeric vector, the number of people who self-identify as Hispanic in the census block
See CENSUS
data for more information.
Charity data (adapted from a small section of a charity's donor database)
data("CHARITY")
data("CHARITY")
A data frame with 15283 observations on the following 11 variables.
Donate
a factor with levels Donate
No
Homeowner
a factor with levels No
Yes
Gender
a factor with levels F
M
UnlistedPhone
a factor with levels No
Yes
ResponseProportion
a numeric vector giving the fraction of solications that resulted in a donation
NumResponses
a numeric vector giving the number of past donations
CardResponseCount
a numeric vector giving the number of past solicitations
MonthsSinceLastResponse
a numeric vector giving the number of months since last response to solicitation (which may have been declining to give)
LastGiftAmount
a numeric vector giving the amount of the last donation
MonthSinceLastGift
a numeric vector giving the number of months since last donation
LogIncome
a numeric vector giving the logarithm of a scaled and normalized yearly income
This dataset is adapted from a real-world database of donors to a charity.
Unknown
If the model is a linear regression, obtain tests of linearity, equal spread, and Normality as well as relevant plots (residuals vs. fitted values, histogram of residuals, QQ plot of residuals, and predictor vs. residuals plots). If the model is a logistic regression model, a goodness of fit test is given.
check_regression(M,extra=FALSE,tests=TRUE,simulations=500,n.cats=10,seed=NA,prompt=TRUE)
check_regression(M,extra=FALSE,tests=TRUE,simulations=500,n.cats=10,seed=NA,prompt=TRUE)
M |
|
extra |
If |
tests |
If |
simulations |
The number of artificial samples to generate for estimating the p-value of the goodness of fit test for logistic regression models. These artificial samples are generated assuming the fitted logistic regression is correct. |
n.cats |
Number of (roughly) equal sized categories for the Hosmer-Lemeshow goodness of fit test for logistic regression models |
seed |
If specified, sets the random number seed before generation of artificial samples in the goodness of fit tests for logistic regression models. |
prompt |
For documentation only, if |
This function provides standard visual and statistical diagnostics for regression models.
For linear regression, tests of linearity, equal spread, and Normality are performed and residuals plots are generated.
The test for linearity (a goodness of fit test) is an F-test. A simple linear regression model predicting y from x is fit and compared to a model treating each value of the predictor as some level of a categorical variable. If this more sophisticated model does not offer a significant improvement in the sum of squared errors, the linearity assumption in that predictor is reasonable. If the p-value is larger 0.05, then statistically we can consider the relationship to be linear. If the p-value is smaller than 0.05, check the residuals plot and the predictor vs residuals plots for signs of obvious curvature (the test can be overly sensitive to inconsequential violations for larger sample sizes). The test can only be run if are two or more individuals that have a common value of x. A test of the model as a whole is run similarly if at least two individuals have identical combinations of all predictor variables.
Note: if categorical variables, interactions, polynomial terms, etc., are present in the model, the test for linearity is conducted for each term even when it does not necessarily make sense to do so.
The test for equal spread is the Breusch-Pagan test. If the p-value is larger 0.05, then statistically we can consider the residuals to have equal spread everywhere. If the p-value is smaller than 0.05, check the residuals plot for obvious signs of unequal spread (the test can be overly sensitive to inconsequential violations for larger sample sizes).
The test for Normality is the Shapiro-Wilk test when the sample size is smaller than 5000, or the KS-test for larger sample sizes. If the p-value is larger 0.05, then statistically we can consider the residuals to be Normally distributed. If the p-value is smaller than 0.05, check the histogram and QQ plot of residuals to look for obvious signs of non-Normality (e.g., skewness or outlier). The test can be overly sensitive to inconsequential violations for larger sample sizes.
The first three plots displayed are the residuals plot (residuals vs. fitted values), histogram of residuals, and QQ plot of residuals. The function gives the option of pressing Enter to display additional predictor vs. residual plots if extra=TRUE
, or to terminate by typing 'q' in the console and pressing Enter. If polynomial or interactions terms are present in the model, a plot is provided for each term. If categorical predictors are present, plots are provided for each indicator variable.
For logistic regression, two goodness of fit tests are offered.
Method 1 is a crude test that assumes the fitted logistic regression is correct, then generates an artifical sample according the predicted probabilities. A chi-squared test is conducted that compares the observed levels to the predicted levels. The test is failed is the p-value is less than 0.05. The test is not sensitive to departures from the logistic curve unless the sample size is very large or the logistic curve is a really bad model.
Method 2 is a Hosmer-Lemeshow type goodness of fit test. The observations are put into 10 groups according to the probability predicted by the logistic regression model. For example, if there were 200 observations, the first group would have the cases with the 20 smallest predicted probabilities, the second group would have the cases with the 20 next smallest probabilities, etc. The number of cases with the level of interest is compared with the expected number given the fitted logistic regression model via a chi-squared test. The test is failed is the p-value is less than 0.05.
Note: for both methods, the p-values of the chi-squared tests are estimate via Monte Carlo simulation instead of any asymptotic results.
Adam Petrie
Introduction to Regression and Modeling
lm
, glm
, shapiro.test
, ks.test
, bptest (in package lmtest). The goodness of fit test for logistic regression is further detailed and implemented in package 'rms' using the commands lrm and residuals.
#Simple linear regression where everything looks good data(FRIEND) M <- lm(FriendshipPotential~Attractiveness,data=FRIEND) check_regression(M) #Multiple linear regression (prompt is FALSE only for documentation) data(AUTO) M <- lm(FuelEfficiency~.,data=AUTO) check_regression(M,extra=TRUE,prompt=FALSE) #Multiple linear regression with a categorical predictors and an interaction data(TIPS) M <- lm(TipPercentage~Bill*PartySize*Weekday,data=TIPS) check_regression(M) #Multiple linear regression with polynomial term (prompt is FALSE only for documentation) #Note: in this example only plots are provided data(BULLDOZER) M <- lm(SalePrice~.-YearMade+poly(YearMade,2),data=BULLDOZER) check_regression(M,extra=TRUE,tests=FALSE,prompt=FALSE) #Simple logistic regression. Use 8 categories since only 8 unique values of Dose data(POISON) M <- glm(Outcome~Dose,data=POISON,family=binomial) check_regression(M,n.cats=8,seed=892) #Multiple logistic regression data(WINE) M <- glm(Quality~.,data=WINE,family=binomial) check_regression(M,seed=2010)
#Simple linear regression where everything looks good data(FRIEND) M <- lm(FriendshipPotential~Attractiveness,data=FRIEND) check_regression(M) #Multiple linear regression (prompt is FALSE only for documentation) data(AUTO) M <- lm(FuelEfficiency~.,data=AUTO) check_regression(M,extra=TRUE,prompt=FALSE) #Multiple linear regression with a categorical predictors and an interaction data(TIPS) M <- lm(TipPercentage~Bill*PartySize*Weekday,data=TIPS) check_regression(M) #Multiple linear regression with polynomial term (prompt is FALSE only for documentation) #Note: in this example only plots are provided data(BULLDOZER) M <- lm(SalePrice~.-YearMade+poly(YearMade,2),data=BULLDOZER) check_regression(M,extra=TRUE,tests=FALSE,prompt=FALSE) #Simple logistic regression. Use 8 categories since only 8 unique values of Dose data(POISON) M <- glm(Outcome~Dose,data=POISON,family=binomial) check_regression(M,n.cats=8,seed=892) #Multiple logistic regression data(WINE) M <- glm(Quality~.,data=WINE,family=binomial) check_regression(M,seed=2010)
This function takes a simple linear regression model and displays the adjusted R^2 and AICc for the original model (order 1) and for polynomial models up to a specified maximum order and plots the fitted models.
choose_order(M,max.order=6,sort=FALSE,loc="topleft",...)
choose_order(M,max.order=6,sort=FALSE,loc="topleft",...)
M |
A simple linear regression model fitted with lm() |
max.order |
The maximum order of the polynomial model to consider. |
sort |
How to sort the results. If TRUE, "R2", "r2", "r2adj", or "R2adj", sorts from highest to lowest adjusted R^2. If "AIC", "aic", "AICC", "AICc", sorts by AICc. |
loc |
Location of the legend. Can also be "top", "topright", "bottomleft", "bottom", "bottomright", "left", "right", "center" |
... |
Additional arguments to plot(), e.g., pch |
The function outputs a table of the order of the polynomial and the according adjusted R^2 and AICc. One strategy for picking the best order is to find the highest value of R^2 adjusted, then to choose the smallest order (simplest model) that has an R^2 adjusted within 0.005. Another strategy is the find the lowest value of AICc, then to choose the smallest order that has an AICc no more than 2 higher.
The scatterplot of the data is provided and the fitted models are displayed as well.
Adam Petrie
Introduction to Regression and Modeling
data(BULLDOZER) M <- lm(SalePrice~YearMade,data=BULLDOZER) #Unsorted list, messing with plot options to make it look alright choose_order(M,pch=20,cex=.3) #Sort by R2adj. A 10th order polynomial is highest, but this seems overly complex choose_order(M,max.order=10,sort=TRUE) #Sort by AICc. 4th order is lowest, but 2nd order is simpler and within 2 of lowest choose_order(M,max.order=10,sort="aic")
data(BULLDOZER) M <- lm(SalePrice~YearMade,data=BULLDOZER) #Unsorted list, messing with plot options to make it look alright choose_order(M,pch=20,cex=.3) #Sort by R2adj. A 10th order polynomial is highest, but this seems overly complex choose_order(M,max.order=10,sort=TRUE) #Sort by AICc. 4th order is lowest, but 2nd order is simpler and within 2 of lowest choose_order(M,max.order=10,sort="aic")
Churn data (artificial based on claims similar to real world) from the UCI data repository
data("CHURN")
data("CHURN")
A data frame with 5000 observations on the following 18 variables.
churn
a factor with levels No
Yes
accountlength
a numeric vector
internationalplan
a factor with levels no
yes
voicemailplan
a factor with levels no
yes
numbervmailmessages
a numeric vector
totaldayminutes
a numeric vector
totaldaycalls
a numeric vector
totaldaycharge
a numeric vector
totaleveminutes
a numeric vector
totalevecalls
a numeric vector
totalevecharge
a numeric vector
totalnightminutes
a numeric vector
totalnightcalls
a numeric vector
totalnightcharge
a numeric vector
totalintlminutes
a numeric vector
totalintlcalls
a numeric vector
totalintlcharge
a numeric vector
numbercustomerservicecalls
a numeric vector
This dataset is modified from the one stored at the UCI data repository (namely, the area code and phone number have been deleted). This is artificial data similar to what is found in actual customer profiles. Charges are in dollars.
Though originally on the UCI data repository, actual data was obtained via https://www.sgi.com/tech/mlc/db/
This function takes a categorical variable and combines all levels with frequencies less than a user-specified threshold named Combined
combine_rare_levels(x,threshold=20,newname="Combined")
combine_rare_levels(x,threshold=20,newname="Combined")
x |
a vector of categorical values |
threshold |
levels that appear a total of |
newname |
defaults to |
Returns a list of two objects:
values
- The recoded values of the categorical variable. All levels which appeared threshold
times or fewer are now known as Combined
combined
- The levels that have been combined together
If, after being combined, the newname
level has threshold
or fewer instances, the remaining level that appears least often is combined as well.
Adam Petrie
Introduction to Regression and Modeling
data(EX6.CLICK) x <- EX6.CLICK[,15] table(x) #Combine all levels which appear 700 or fewer times (AA, CC, DD) y <- combine_rare_levels(x,700) table( y$values ) #Combine all levels which appear 1350 or fewer times. This forces BB (which #occurs 2422 times) into the Combined level since the three levels that appear #fewer than 1350 times do not appear more than 1350 times combined y <- combine_rare_levels(x,1350) table( y$values )
data(EX6.CLICK) x <- EX6.CLICK[,15] table(x) #Combine all levels which appear 700 or fewer times (AA, CC, DD) y <- combine_rare_levels(x,700) table( y$values ) #Combine all levels which appear 1350 or fewer times. This forces BB (which #occurs 2422 times) into the Combined level since the three levels that appear #fewer than 1350 times do not appear more than 1350 times combined y <- combine_rare_levels(x,1350) table( y$values )
This function takes the output of a logistic regression created with glm
and returns the confusion matrix.
confusion_matrix(M,DATA=NA)
confusion_matrix(M,DATA=NA)
M |
A logistic regression model created with |
DATA |
A data frame on which the confusion matrix will be made. If omitted, the confusion matrix is on the data used in |
This function makes classifications on the data used to build a logistic regression model by predicting the "level of interest" (last alphabetically) when the predicted probability exceeds 50%.
Adam Petrie
#On WINE data as a whole data(WINE) M <- glm(Quality~.,data=WINE,family=binomial) confusion_matrix(M) #Calculate generalization error using training/holdout set.seed(1010) train.rows <- sample(nrow(WINE),0.7*nrow(WINE),replace=TRUE) TRAIN <- WINE[train.rows,] HOLDOUT <- WINE[-train.rows,] M <- glm(Quality~.,data=TRAIN,family=binomial) confusion_matrix(M,HOLDOUT) #Predicting donation #Model predicting from recent average gift amount is significant, but its #classifications are the same as the naive model (majority rules) data(DONOR) M.naive <- glm(Donate~1,data=DONOR,family=binomial) confusion_matrix(M.naive) M <- glm(Donate~RECENT_AVG_GIFT_AMT,data=DONOR,family=binomial) confusion_matrix(M)
#On WINE data as a whole data(WINE) M <- glm(Quality~.,data=WINE,family=binomial) confusion_matrix(M) #Calculate generalization error using training/holdout set.seed(1010) train.rows <- sample(nrow(WINE),0.7*nrow(WINE),replace=TRUE) TRAIN <- WINE[train.rows,] HOLDOUT <- WINE[-train.rows,] M <- glm(Quality~.,data=TRAIN,family=binomial) confusion_matrix(M,HOLDOUT) #Predicting donation #Model predicting from recent average gift amount is significant, but its #classifications are the same as the naive model (majority rules) data(DONOR) M.naive <- glm(Donate~1,data=DONOR,family=binomial) confusion_matrix(M.naive) M <- glm(Donate~RECENT_AVG_GIFT_AMT,data=DONOR,family=binomial) confusion_matrix(M)
This function shows the correlation and coefficient of determination as user interactively adds datapoints. Useful for seeing what different values of correlation look like and seeing the effect of outliers.
cor_demo(cex.leg=0.8)
cor_demo(cex.leg=0.8)
cex.leg |
A number specifying the magnification of legends inside the plot. Smaller numbers mean smaller font. |
This function allows the user to generate data by click on a plot. Once two points are added, the correlation (r) and coefficient of determination (r^2) are displayed. When an additional point is added, these values are updated in the upper left with previous values being displayed in the upper right. The effect of outliers on the correlation and coefficient of determination can easily be illustrated. Pressing the red UNDO button on the plot will allow you to take away recently added points for further exploration.
Note: To end the demo, you MUST click on the red box labeled "End" (or press Escape, which will return an error)
Adam Petrie
This function produces the matrix of correlations between all quantitative variables in a dataframe.
cor_matrix(X,type="pearson")
cor_matrix(X,type="pearson")
X |
A data frame |
type |
Either |
This function filters out any non-numerical variables and provides correlations only between quantitative variables. Best for datasets with only a few variables. The correlation matrix is returned (with class matrix
).
Adam Petrie
Introduction to Regression and Modeling
data(TIPS) cor_matrix(TIPS) data(AUTO) cor_matrix(AUTO,type="spearman")
data(TIPS) cor_matrix(TIPS) data(AUTO) cor_matrix(AUTO,type="spearman")
Customer database describing customer churn (adapted from a former case study)
data("CUSTCHURN")
data("CUSTCHURN")
A data frame with 500 observations on the following 11 variables.
Duration
a numeric vector giving the days that the company was considered a customer. Note: censored at 730 days, which is the value for someone who is currently a customer (not churned)
Churn
a factor with levels N
Y
giving whether the customer has churned or not
RetentionCost
a numeric vector giving the average amount of money spent per year to retain the individual or company as a customer
EBiz
a factor with levels No
Yes
giving whether the customer was an e-business or not
CompanyRevenue
a numeric vector giving the company's revenue
CompanyEmployees
a numeric vector giving the number of employees working for the company
Categories
a numeric vector giving the number of product categories from which customer made a purchase of their lifetime
NumPurchases
a numeric vector giving the total amount of purchases over the customer's lifetime
Each row corresponds to a customer of a Fortune 500 company. These customers are businesses, which may or may not exclusively be an e-business. Whether a customer is still a customer (or has churned) after 730 days is recorded.
Unknown
Customer database describing customer value (adapted from a former case study) and whether they have a loyalty card
data("CUSTLOYALTY")
data("CUSTLOYALTY")
A data frame with 500 observations on the following 9 variables.
Gender
a factor with levels Female
Male
giving the customer's gender
Married
a factor with levels Married
Single
giving the customer's marital status
Income
a factor with levels f0t30
, f30t45
, f45t60
, f60t75
, f75t90
, f90toINF
giving the approximate yearly income of the customer. The first level corresponds to 30K or less, the second level corresponds to 30K to 45K, and the last level corresponds to 90K or above
FirstPurchase
a numeric vector giving the amount of the customer's first purchase amount
LoyaltyCard
a factor with levels No
Yes
that gives whether the customer has a loyalty card for the store
WalletShare
a numeric vector giving the percentage from 0 to 100 of similar products that the customer makes at this store. A value of 100 means the customer uses this store exclusively for such purchases.
CustomerLV
a numeric vector giving the lifetime value of the customer and reflects the amount spent acquiring and retaining the customer along with the revenue brought in by the customer
TotTransactions
a numeric vector giving the total number of consecutive months the customer has made a transaction in the last year
LastTransaction
a numeric vector giving the total amount of months since the customers last transaction
Each row corresponds to a customer of a local chain. Does having a loyalty card increase the customer's value?
Unknown
Customer reacquisition
data("CUSTREACQUIRE")
data("CUSTREACQUIRE")
A data frame with 500 observations on the following 9 variables.
Reacquire
a factor with levels No
Yes
indicating whether a customer who has previously churned was reacquired
Lifetime2
a numeric vector giving the days that the company was considered a customer
Value2
a numeric vector giving the lifetime value of the customer (related to the amount of money spent on reacquisition and the revenue brought in by the customer; can be negative)
Lifetime1
a numeric vector giving the days that the company was considered a customer before churning the first time
OfferAmount
a numeric vector giving the money equivalent of a special offer given to the former customer in an attempt to reacquire
Lapse
a numeric vector giving the number of days between the customer churning and the time of the offer
PriceChange
a numeric vector giving the percentage by which the typical product purchased by the customer has changed from the time they churned to the time the special offer was sent
Gender
a factor with levels Female
Male
giving the gender of the customer
Age
a numeric vector giving the age of the customer
A company kept records of its success in reacquiring customers that had previously churned. Data is based on a previous case study.
Unknown
Customer database describing customer value (adapted from a former case study)
data("CUSTVALUE")
data("CUSTVALUE")
A data frame with 500 observations on the following 11 variables.
Acquired
a factor with levels No
Yes
indicating whether a potential customer was acquired
Duration
a numeric vector giving the days that the company was considered a customer
LifetimeValue
a numeric vector giving the lifetime value of the customer (related to the amount of money spent on acquisition and the revenue brought in by the customer; can be negative)
AcquisitionCost
a numeric vector giving the amount of money spent attempting to acquire as a customer
RetentionCost
a numeric vector giving the average amount of money spent per year to retain the individual or company as a customer
NumPurchases
a numeric vector giving the total amount of purchases over the customer's lifetime
Categories
a numeric vector giving the number of product categories from which customer made a purchase of their lifetime
WalletShare
a numeric vector giving the percentage of purchases of similar products the customer makes with this company; a few values exceed 100 for some reason
EBiz
a factor with levels No
Yes
giving whether the customer was an e-business or not
CompanyRevenue
a numeric vector giving the company's revenue
CompanyEmployees
a numeric vector giving the number of employees working for the company
Each row corresponds to a (potential) customer of a Fortune 500 company. These customers are businesses, which may or may not exclusively an e-business.
Unknown
The weight of a person over time who is dieting and exercising
data("DIET")
data("DIET")
A data frame with 35 observations on the following 2 variables.
Weight
a numeric vector, lbs
Day
a numeric vector, the number of days after the diet started
This data was collected by the author and consists of his weight measured first thing in the morning over the course of amount a month. The scale round to the nearest 0.2 lbs.
Adapted from the KDD-CUP-98 data set concerning data regarding donations made to a national veterans organization.
data("DONOR")
data("DONOR")
A data frame with 19372 observations on the following 50 variables.
Donate
a factor with levels No
Yes
Donation.Amount
a numeric vector
ID
a numeric vector
MONTHS_SINCE_ORIGIN
a numeric vector, number of months donor has been in the database
DONOR_AGE
a numeric vector
IN_HOUSE
a numeric vector, 1 if person has donated to the charity's “In House" program
URBANICITY
a factor with levels ?
C
R
S
T
U
SES
a factor with levels ?
1
2
3
4
, one of five possible codes indicating socioeconomic status
CLUSTER_CODE
a factor with levels .
01
02
... 53
, one of 54 possible cluster codes, which are
unique in terms of socioeconomic status,
urbanicity, ethnicity, and other demographic
characteristics
HOME_OWNER
a factor with levels H
U
DONOR_GENDER
a factor with levels A
F
M
U
INCOME_GROUP
a numeric vector, but in reality one of 7 possible income groups inferred from demographics
PUBLISHED_PHONE
a numeric vector, listed (1) vs not listed (0)
OVERLAY_SOURCE
a factor with levels B
M
N
P
, source from which the donor was match; B is both sources and N is neither
MOR_HIT_RATE
a numeric vector, number of known times donor has responded to a mailed solicitation from a group other than the charity
WEALTH_RATING
a numeric vector, but in reality one of 10 groups based on demographics
MEDIAN_HOME_VALUE
a numeric vector, inferred from other variables
MEDIAN_HOUSEHOLD_INCOME
a numeric vector, inferred from other variables
PCT_OWNER_OCCUPIED
a numeric vector, percent of owner-occupied housing near where person lives
PER_CAPITA_INCOME
a numeric vector, of neighborhood in which person lives
PCT_ATTRIBUTE1
a numeric vector, percent of residents in person's neighborhood that are male and active military
PCT_ATTRIBUTE2
a numeric vector, percent of residents in person's neighborhood that are male and veterans
PCT_ATTRIBUTE3
a numeric vector, percent of residents in person's neighborhood that are Vietnam veterans
PCT_ATTRIBUTE4
a numeric vector, percent of residents in person's neighborhood that are WW2 veterans
PEP_STAR
a numeric vector, 1 if has achieved STAR donor status and 0 otherwise
RECENT_STAR_STATUS
a numeric vector, 1 if achieved STAR within last 4 years
RECENCY_STATUS_96NK
a factor with levels A
(active) E
(inactive) F
(first time) L
(lapsing)N
(new) S
(star donor) as of 1996.
FREQUENCY_STATUS_97NK
a numeric vector indicating number of times donated in last period (but period is determined by RECENCY STATUS 96NK)
RECENT_RESPONSE_PROP
a numeric vector, proportion of responses to the individual to the number of (card or other) solicitations from the charitable organization since four years ago
RECENT_AVG_GIFT_AMT
a numeric vector, average donation from the individual to the charitable organization since four years ago
RECENT_CARD_RESPONSE_PROP
a numeric vector, number of times the individual has responded to a card solicitation from the charitable organization since four years ago
RECENT_AVG_CARD_GIFT_AMT
a numeric vector, average donation from the individual in response to a card solicitation from the charitable organization since four years ago
RECENT_RESPONSE_COUNT
a numeric vector, number of times the individual has responded to a promotion (card or other) from the charitable organization since four years ago
RECENT_CARD_RESPONSE_COUNT
a numeric vector, number of times the individual has responded to a card solicitation from the charitable organization since four years ago
MONTHS_SINCE_LAST_PROM_RESP
a numeric vector, number of months since the individual has responded to a promotion by the charitable organization
LIFETIME_CARD_PROM
a numeric vector, total number of card promotions sent to the individual by the charitable organization
LIFETIME_PROM
a numeric vector, total number of promotions sent to the individual by the charitable organization
LIFETIME_GIFT_AMOUNT
a numeric vector, total lifetime donation amount from the individual to the charitable organization
LIFETIME_GIFT_COUNT
a numeric vector, total number of donations from the individual to the charitable organization
LIFETIME_AVG_GIFT_AMT
a numeric vector, lifetime average donation from the individual to the charitable organization
LIFETIME_GIFT_RANGE
a numeric vector, difference between maximum and minimum donation amounts from the individual
LIFETIME_MAX_GIFT_AMT
a numeric vector
LIFETIME_MIN_GIFT_AMT
a numeric vector
LAST_GIFT_AMT
a numeric vector
CARD_PROM_12
a numeric vector, number of card promotions sent to the individual by the charitable organization in the last 12 months
NUMBER_PROM_12
a numeric vector, number of promotions (card or other) sent to the individual by the charitable organization in the last 12 months
MONTHS_SINCE_LAST_GIFT
a numeric vector
MONTHS_SINCE_FIRST_GIFT
a numeric vector
FILE_AVG_GIFT
a numeric vector, same as LIFETIME_AVG_GIFT_AMT
FILE_CARD_GIFT
a numeric vector, lifetime average donation from the individual in response to all card solicitations from the charitable organization
Originally, this data was used with the 1998 KDD competition (https://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html). This particular version has been adapted from the version available in SAS Enterprise Miner (http://support.sas.com/documentation/cdl/en/emgsj/61207/PDF/default/emgsj.pdf Appendix 2 for descriptions of variable names). One goal is to determine whether a past donor donated in response to the 97NK mail solicitation and (if so), how much, based on age, gender, most recent donation amount, total gift amount, etc.
Data on the College GPAs of students in an introductory statistics class
data("EDUCATION")
data("EDUCATION")
A data frame with 607 observations on the following 18 variables.
CollegeGPA
a numeric vector
Gender
a factor with levels Female
Male
HSGPA
a numeric vector, can range up to 5 if the high school allowed it
ACT
a numeric vector, ACT score
APHours
a numeric vector, number of AP hours student took in HS
JobHours
a numeric vector, number of hours student currently works on average
School
a factor with levels Private
Public
, type of HS
LanguagesSpoken
a numeric vector
HSHonorsClasses
a numeric vector, number of honors classes taken in HS
SmokeInHS
a factor with levels No
Yes
PayCollegeNoLoans
a factor with levels No
Yes
, can the student and his/her family pay for the University of Tennessee without taking out loans?
ClubsInHS
a numeric vector, number of clubs belonged to in HS
JobInHS
a factor with levels No
Yes
, whether the student maintained a job at some point while in HS
Churchgoer
a factor with levels No
Yes
, answer to the question Do you regularly attend chruch?
Height
a numeric vector (inches)
Weight
a numeric vector (lbs)
Family
what position they are in the family, a factor with levels Middle Child
Oldest Child
Only Child
Youngest Child
Pet
favorite pet, a factor with levels Both
Cat
Dog
Neither
Responses are from students in an introductory statistics class at the University of Tennessee in 2010. One goal to try to predict someone's college GPA from some of the students' characteristics. What information about a high school student could a college admission's counselor use to anticipate that student's performance in college?
CENSUS data for Exercise 5 in Chapter 2
data("EX2.CENSUS")
data("EX2.CENSUS")
A data frame with 3534 observations on the following 41 variables.
ResponseRate
a numeric vector
Area
a numeric vector
Urban
a numeric vector
Suburban
a numeric vector
Rural
a numeric vector
Male
a numeric vector
Female
a numeric vector
AgeLess5
a numeric vector
Age5to17
a numeric vector
Age18to24
a numeric vector
Age25to44
a numeric vector
Age45to64
a numeric vector
Age65plus
a numeric vector
Hispanics
a numeric vector
Whites
a numeric vector
Blacks
a numeric vector
NativeAmericans
a numeric vector
Asians
a numeric vector
Hawaiians
a numeric vector
Other
a numeric vector
RelatedHH
a numeric vector
MarriedHH
a numeric vector
NoSpouseHH
a numeric vector
FemaleHH
a numeric vector
AloneHH
a numeric vector
WithKidHH
a numeric vector
MedianHHIncomeBlock
a numeric vector
MedianHHIncomeCity
a numeric vector
OccupiedUnits
a numeric vector
VacantUnits
a numeric vector
RentingHH
a numeric vector
HomeownerHH
a numeric vector
MobileHomeUnits
a numeric vector
CrowdedUnits
a numeric vector
NoPhoneUnits
a numeric vector
NoPlumbingUnits
a numeric vector
NewUnits
a numeric vector
Population
a numeric vector
NumHH
a numeric vector
NumUnits
a numeric vector
logMedianHouseValue
a numeric vector
See CENSUS
for variable descriptions (this data is nearly identical). The goal is to predict ResponseRate
from the other predictors. ResponseRate
is the percentage of households in a block group that mailed in the census forms. A block group is on average about 40 blocks, each typically bounded by streets, roads, or water. The number of block groups per county in the US is typically between about 5 and 165 with a median of about 20.
TIPS data for Exercise 6 in Chapter 2
data("EX2.TIPS")
data("EX2.TIPS")
A data frame with 244 observations on the following 8 variables.
Tip.Percentage
a numeric vector
Bill_in_USD
a numeric vector
Tip_in_USD
a numeric vector
Gender
a factor with levels Female
Male
Smoker
a factor with levels No
Yes
Weekday
a factor with levels Friday
Saturday
Sunday
Thursday
Day_Night
a factor with levels Day
Night
Size_of_Party
a numeric vector
See TIPS
for more details. This is the same dataset except that the names of the variables are different.
ABALONE dataset for Exercise D in Chapter 3
data("EX3.ABALONE")
data("EX3.ABALONE")
A data frame with 1528 observations on the following 7 variables.
Length
a numeric vector
Diameter
a numeric vector
Height
a numeric vector
Whole.Weight
a numeric vector
Meat.Weight
a numeric vector
Shell.Weight
a numeric vector
Rings
a numeric vector
Abalone are sea creatures that are considered a delicacy and have very pretty iridescent shells. See https://en.wikipedia.org/wiki/Abalone. Predicting the age of the abalone from physical measurements could be useful for harvesting purposes. Dimensions are in mm and weights are in grams. Rings
is an indicator of the age of the abalone (Age is about 1.5 plus the number of rings).
Data is adapted from the abalone dataset on UCI Data Repository https://archive.ics.uci.edu/ml/datasets/Abalone. Only the male abalone are represented in this dataset.
See page on UCI for full details of owner and donor of this data.
Bodyfat data for Exercise F in Chapter 3
data("EX3.BODYFAT")
data("EX3.BODYFAT")
A data frame with 20 observations on the following 4 variables.
Triceps
a numeric vector
Thigh
a numeric vector
Midarm
a numeric vector
Fat
a numeric vector
Same data as BODYFAT2
, which you can see for more details.
Housing data for Exercise E in Chapter 3
data("EX3.HOUSING")
data("EX3.HOUSING")
A data frame with 522 observations on the following 2 variables.
AREA
a numeric vector, square area of house
PRICE
a numeric vector, selling price
Selling prices of houses (perhaps in the Boston area in Massachusettes).
Original source unknown, but it appears in many places around the internet, e.g., public.iastate.edu/~pdixon/stat500/data/realestate.txt
NFL data for Exercise A in Chapter 3
data("EX3.NFL")
data("EX3.NFL")
A data frame with 352 observations on the following 137 variables.
Year
a numeric vector
Team
a factor with levels Arizona
Atlanta
Baltimore
Buffalo
Carolina
Chicago
Cincinnati
Cleveland
Dallas
Denver
Detroit
GreenBay
Houston
Indianapolis
Jacksonville
KansasCity
Miami
Minnesota
NewEngland
NewOrleans
NYGiants
NYJets
Oakland
Philadelphia
Pittsburgh
SanDiego
SanFrancisco
Seattle
St.Louis
TampaBay
Tennessee
Washington
Next.Years.Wins
a numeric vector
Wins
a numeric vector
X1.Off.Tot.Yds
a numeric vector
X2.Off.Tot.Plays
a numeric vector
X3.Off.Tot.Yds.per.Ply
a numeric vector
X4.Off.Tot.1st.Dwns
a numeric vector
X5.Off.Pass.1st.Dwns
a numeric vector
X6.Off.Rush.1st.Dwns
a numeric vector
X7.Off.Tot.Turnovers
a numeric vector
X8.Off.Fumbles.Lost
a numeric vector
X9.Off.1st.Dwns.by.Penalty
a numeric vector
X10.Off.Pass.Comp
a numeric vector
X11.Off.Pass.Comp.
a numeric vector
X12.Off.Pass.Yds
a numeric vector
X13.Off.Pass.Tds
a numeric vector
X14.Off.Pass.INTs
a numeric vector
X15.Off.Pass.INT.
a numeric vector
X16.Off.Pass.Longest
a numeric vector
X17.Off.Pass.Yds.per.Att
a numeric vector
X18.Off.Pass.Adj.Yds.per.Att
a numeric vector
X19.Off.Pass.Yds.per.Comp
a numeric vector
X20.Off.Pass.Yds.per.Game
a numeric vector
X21.Off.Passer.Rating
a numeric vector
X22.Off.Pass.Sacks.Alwd
a numeric vector
X23.Off.Pass.Sack.Yds
a numeric vector
X24.Off.Pass.Net.Yds.per.Att
a numeric vector
X25.Off.Pass.Adj.Net.Yds.per.Att
a numeric vector
X26.Off.Pass.Sack.
a numeric vector
X27.Off.Game.Winning.Drives
a numeric vector
X28.Off.Rush.Yds
a numeric vector
X29.Off.Rush.Tds
a numeric vector
X30.Off.Rush.Longest
a numeric vector
X31.Off.Rush.Yds.per.Att
a numeric vector
X32.Off.Rush.Yds.per.Game
a numeric vector
X33.Off.Fumbles
a numeric vector
X34.Off.Punt.Returns
a numeric vector
X35.Off.PR.Yds
a numeric vector
X36.Off.PR.Tds
a numeric vector
X37.Off.PR.Longest
a numeric vector
X38.Off.PR.Yds.per.Att
a numeric vector
X39.Off.Kick.Returns
a numeric vector
X40.Off.KR.Yds
a numeric vector
X41.Off.KR.Tds
a numeric vector
X42.Off.KR.Longest
a numeric vector
X43.Off.KR.Yds.per.Att
a numeric vector
X44.Off.All.Purpose.Yds
a numeric vector
X45.X1.19.yd.FG.Att
a numeric vector
X46.X1.19.yd.FG.Made
a numeric vector
X47.X20.29.yd.FG.Att
a numeric vector
X48.X20.29.yd.FG.Made
a numeric vector
X49.X1.29.yd.FG.
a numeric vector
X50.X30.39.yd.FG.Att
a numeric vector
X51.X30.39.yd.FG.Made
a numeric vector
X52.X30.39.yd.FG.
a numeric vector
X53.X40.49.yd.FG.Att
a numeric vector
X54.X40.49.yd.FG.Made
a numeric vector
X55.X50yd.FG.Att
a numeric vector
X56.X50yd.FG.Made
a numeric vector
X57.X40yd.FG.
a numeric vector
X58.Total.FG.Att
a numeric vector
X59.Off.Tot.FG.Made
a numeric vector
X60.Off.Tot.FG.
a numeric vector
X61.Off.XP.Att
a numeric vector
X62.Off.XP.Made
a numeric vector
X63.Off.XP.
a numeric vector
X64.Off.Times.Punted
a numeric vector
X65.Off.Punt.Yards
a numeric vector
X66.Off.Longest.Punt
a numeric vector
X67.Off.Times.Had.Punt.Blocked
a numeric vector
X68.Off.Yards.Per.Punt
a numeric vector
X69.Fmbl.Tds
a numeric vector
X70.Def.INT.Tds.Scored
a numeric vector
X71.Blocked.Kick.or.Missed.FG.Ret.Tds
a numeric vector
X72.Total.Tds.Scored
a numeric vector
X73.Off.2pt.Conv.Made
a numeric vector
X74.Def.Safeties.Scored
a numeric vector
X75.Def.Tot.Yds.Alwd
a numeric vector
X76.Def.Tot.Plays.Alwd
a numeric vector
X77.Def.Tot.Yds.per.Play.Alwd
a numeric vector
X78.Def.Tot.1st.Dwns.Alwd
a numeric vector
X79.Def.Pass.1st.Dwns.Alwd
a numeric vector
X80.Def.Rush.1st.Dwns.Alwd
a numeric vector
X81.Def.Turnovers.Created
a numeric vector
X82.Def.Fumbles.Recovered
a numeric vector
X83.Def.1st.Dwns.Alwd.by.Penalty
a numeric vector
X84.Def.Pass.Comp.Alwd
a numeric vector
X85.Def.Pass.Att.Alwd
a numeric vector
X86.Def.Pass.Comp..Alwd
a numeric vector
X87.Def.Pass.Yds.Alwd
a numeric vector
X88.Def.Pass.Tds.Alwd
a numeric vector
X89.Def.Pass.TDAlwd
a numeric vector
X90.Def.Pass.INTs
a numeric vector
X91.Def.Pass.INT.
a numeric vector
X92.Def.Pass.Yds.per.Att.Alwd
a numeric vector
X93.Def.Pass.Adj.Yds.per.Att.Alwd
a numeric vector
X94.Def.Pass.Yds.per.Comp.Alwd
a numeric vector
X95.Def.Pass.Yds.per.Game.Alwd
a numeric vector
X96.Def.Passer.Rating.Alwd
a numeric vector
X97.Def.Pass.Sacks
a numeric vector
X98.Def.Pass.Sack.Yds
a numeric vector
X99.Def.Pass.Net.Yds.per.Att.Alwd
a numeric vector
X100.Def.Pass.Adj.Net.Yds.per.Att.Alwd
a numeric vector
X101.Def.Pass.Sack.
a numeric vector
X102.Def.Rush.Yds.Alwd
a numeric vector
X103.Def.Rush.Tds.Alwd
a numeric vector
X104.Def.Rush.Yds.per.Att.Alwd
a numeric vector
X105.Def.Rush.Yds.per.Game.Alwd
a numeric vector
X106.Def.Punt.Returns.Alwd
a numeric vector
X107.Def.PR.Tds.Alwd
a numeric vector
X108.Def.Kick.Returns.Alwd
a numeric vector
X109.Def.KR.Yds.Alwd
a numeric vector
X110.Def.KR.Tds.Alwd
a numeric vector
X111.Def.KR.Yds.per.Att.Alwd
a numeric vector
X112.Def.Tot.FG.Att.Alwd
a numeric vector
X113.Def.Tot.FG.Made.Alwd
a numeric vector
X114.Def.Tot.FG..Alwd
a numeric vector
X115.Def.XP.Att.Alwd
a numeric vector
X116.Def.XP.Made.Alwd
a numeric vector
X117.Def.XP..Alwd
a numeric vector
X118.Def.Punts.Alwd
a numeric vector
X119.Def.Punt.Yds.Alwd
a numeric vector
X120.Def.Punt.Yds.per.Att.Alwd
a numeric vector
X121.Def.2pt.Conv.Alwd
a numeric vector
X122.Off.Safeties
a numeric vector
X123.Off.Rush.Success.Rate
a numeric vector
X124.Head.Coach.Disturbance.
a factor with levels No
Yes
X125.QB.Disturbance
a factor with levels No
Yes
X126.RB.Disturbance
a factor with levels ?
No
Yes
X127.Off.Run.Pass.Ratio
a numeric vector
X128.Off.Pass.Ply.
a numeric vector
X129.Off.Run.Ply.
a numeric vector
X130.Off.Yds.Pt
a numeric vector
X131.Def.Yds.Pt
a numeric vector
X132.Off.Pass.Drop.rate
a numeric vector
X133.Def.Pass.Drop.Rate
a numeric vector
See NFL
for more details. This dataset is actually a more complete version of NFL
and contains additional variables such as the year, team, next year's wins of the team, etc., and could be used in place of the NFL
data
Bike data for Exercise 1 in Chapter 4
data("EX4.BIKE")
data("EX4.BIKE")
A data frame with 414 observations on the following 5 variables.
Demand
a numeric vector, total number of rental bikes
AvgTemp
a numeric vector, average temperature of the day
EffectiveAvgTemp
a numeric vector, average temperature it feels like (taking into account dewpoint) for the day
AvgHumidity
a numeric vector, average humidity for the day
AvgWindspeed
a numeric vector, average wind speed for the day
Adapted from the bike sharing dataset on the UCI data repository http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset. This concerns the demand for rental bikes in the DC area.
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.
Fanaee-T, Hadi, and Gama, Joao, Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.
Stock data for Exercise 2 in Chapter 4 (prediction set)
data("EX4.STOCKPREDICT")
data("EX4.STOCKPREDICT")
A data frame with 5 observations on the following 40 variables.
AAPLlag2
a numeric vector
AXPlag2
a numeric vector
BAlag2
a numeric vector
BAClag2
a numeric vector
CATlag2
a numeric vector
CSCOlag2
a numeric vector
CVXlag2
a numeric vector
DDlag2
a numeric vector
DISlag2
a numeric vector
GElag2
a numeric vector
HDlag2
a numeric vector
HPQlag2
a numeric vector
IBMlag2
a numeric vector
INTClag2
a numeric vector
JNJlag2
a numeric vector
JPMlag2
a numeric vector
KOlag2
a numeric vector
MCDlag2
a numeric vector
MMMlag2
a numeric vector
MRKlag2
a numeric vector
MSFTlag2
a numeric vector
PFElag2
a numeric vector
PGlag2
a numeric vector
Tlag2
a numeric vector
TRVlag2
a numeric vector
UNHlag2
a numeric vector
VZlag2
a numeric vector
WMTlag2
a numeric vector
XOMlag2
a numeric vector
Australialag2
a numeric vector
Copperlag2
a numeric vector
DollarIndexlag2
a numeric vector
Europelag2
a numeric vector
Exchangelag2
a numeric vector
GlobalDowlag2
a numeric vector
HongKonglag2
a numeric vector
Indialag2
a numeric vector
Japanlag2
a numeric vector
Oillag2
a numeric vector
Shanghailag2
a numeric vector
The data frame for which you are to predict the closing price of Alcoa stock based on the model built using EX4.STOCKS
. The actual closing prices are not given.
Stock data for Exercise 2 in Chapter 4
data("EX4.STOCKS")
data("EX4.STOCKS")
A data frame with 216 observations on the following 41 variables.
AA
a numeric vector
AAPLlag2
a numeric vector
AXPlag2
a numeric vector
BAlag2
a numeric vector
BAClag2
a numeric vector
CATlag2
a numeric vector
CSCOlag2
a numeric vector
CVXlag2
a numeric vector
DDlag2
a numeric vector
DISlag2
a numeric vector
GElag2
a numeric vector
HDlag2
a numeric vector
HPQlag2
a numeric vector
IBMlag2
a numeric vector
INTClag2
a numeric vector
JNJlag2
a numeric vector
JPMlag2
a numeric vector
KOlag2
a numeric vector
MCDlag2
a numeric vector
MMMlag2
a numeric vector
MRKlag2
a numeric vector
MSFTlag2
a numeric vector
PFElag2
a numeric vector
PGlag2
a numeric vector
Tlag2
a numeric vector
TRVlag2
a numeric vector
UNHlag2
a numeric vector
VZlag2
a numeric vector
WMTlag2
a numeric vector
XOMlag2
a numeric vector
Australialag2
a numeric vector
Copperlag2
a numeric vector
DollarIndexlag2
a numeric vector
Europelag2
a numeric vector
Exchangelag2
a numeric vector
GlobalDowlag2
a numeric vector
HongKonglag2
a numeric vector
Indialag2
a numeric vector
Japanlag2
a numeric vector
Oillag2
a numeric vector
Shanghailag2
a numeric vector
The goal is to predict the closing price of Alcoa stock (AA
) from the closing prices of other stocks and commodities two days prior (IMBlag2
, HongKonglag2
, etc.). If this were possible, and if the association between the prices continued into the future, it would be possible to use this information to make smart trades.
Compiled from various sources on the internet, e.g., Yahoo historical prices.
BIKE dataset for Exercise 4 Chapter 5
data("EX5.BIKE")
data("EX5.BIKE")
A data frame with 413 observations on the following 9 variables.
Demand
a numeric vector
Day
a factor with levels Friday
Monday
Saturday
Sunday
Thursday
Tuesday
Wednesday
Workingday
a factor with levels no
yes
Holiday
a factor with levels no
yes
Weather
a factor with levels No rain
Rain
AvgTemp
a numeric vector
EffectiveAvgTemp
a numeric vector
AvgHumidity
a numeric vector
AvgWindspeed
a numeric vector
Adapted from the bike sharing dataset on the UCI data repository http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset. This concerns the demand for rental bikes in the DC area. This is an expanded version of EX4.BIKE
with more variables and without the row containing bad data.
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.
Fanaee-T, Hadi, and Gama, Joao, Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.
DONOR dataset for Exercise 4 in Chapter 5
data("EX5.DONOR")
data("EX5.DONOR")
A data frame with 8132 observations on the following 18 variables.
Donate
a factor with levels No
Yes
LastAmount
a numeric vector
AccountAge
a numeric vector
Age
a numeric vector
Setting
a factor with levels Rural
Suburban
Urban
Homeowner
a factor with levels No
Yes
Gender
a factor with levels Female
Male
Unknown
Phone
a factor with levels Listed
Unlisted
Source
a factor with levels B
M
N
P
, source from which the donor was match; B is both sources and N is neither
MedianHomeValue
a numeric vector
MedianIncome
a numeric vector
PercentOwnerOccupied
a numeric vector, of the neighborhood in which donor lives
Recent
a factor with levels No
Yes
RecentResponsePercent
a numeric vector
RecentAvgAmount
a numeric vector
MonthsSinceLastGift
a numeric vector
TotalAmount
a numeric vector
TotalDonations
a numeric vector
See DONOR
for details. This data is a subset, though attributes have been renamed.
CLICK data for Exercise 2 in Chapter 6
data("EX6.CLICK")
data("EX6.CLICK")
A data frame with 13594 observations on the following 15 variables.
Click
a factor with levels No
Yes
BannerPosition
a factor with levels Pos1
Pos2
, location of ad
SiteID
a factor with levels S1
S2
S3
S4
S5
S6
S7
S8
SiteDomain
a factor with levels SD1
SD2
SD3
SD4
SD5
SD6
SD7
SD8
SiteCategory
a factor with levels SCat1
SCat2
SCat3
SCat4
SCat5
AppDomain
a factor with levels AD1
AD2
AD3
AppCategory
a factor with levels AC1
AC2
DeviceModel
a factor with levels D1
D10
D11
D12
D13
D14
D15
D16
D17
D18
D2
D3
D4
D5
D6
D7
D8
D9
x1
a numeric vector
x2
a factor with levels A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
x3
a factor with levels a
b
c
d
e
f
x4
a factor with levels val1
val2
val3
x5
a factor with levels type1
type2
type3
type4
x6
a factor with levels class1
class2
class3
class4
x7
a factor with levels AA
BB
CC
DD
EE
Inspired from a competition to predict the click-thru rates of ads displayed on mobile devices https://www.kaggle.com/c/avazu-ctr-prediction. Does the click-thru rate vary based on where the ad placed, what kind of site and device is used to view the ad, something else? All variables are anonymized.
DONOR dataset for Exercise 1 in Chapter 6
data("EX6.DONOR")
data("EX6.DONOR")
A data frame with 8132 observations on the following 18 variables.
Donate
a factor with levels No
Yes
LastAmount
a numeric vector
AccountAge
a numeric vector
Age
a numeric vector
Setting
a factor with levels Rural
Suburban
Urban
Homeowner
a factor with levels No
Yes
Gender
a factor with levels Female
Male
Unknown
Phone
a factor with levels Listed
Unlisted
Source
a factor with levels B
M
N
P
MedianHomeValue
a numeric vector
MedianIncome
a numeric vector
PercentOwnerOccupied
a numeric vector
Recent
a factor with levels No
Yes
RecentResponsePercent
a numeric vector
RecentAvgAmount
a numeric vector
MonthsSinceLastGift
a numeric vector
TotalAmount
a numeric vector
TotalDonations
a numeric vector
Identical to EX5.DONOR
, so see that for details
WINE data for Exercise 3 Chapter 6
data("EX6.WINE")
data("EX6.WINE")
A data frame with 2700 observations on the following 12 variables.
Quality
a factor with levels High
Low
fixed.acidity
a numeric vector
volatile.acidity
a numeric vector
citric.acid
a numeric vector
residual.sugar
a numeric vector
free.sulfur.dioxide
a numeric vector
total.sulfur.dioxide
a numeric vector
density
a numeric vector
pH
a numeric vector
sulphates
a numeric vector
alcohol
a numeric vector
chlorides
a factor with levels Little
Lots
Adapted from the wine quality dataset at the UCI data repository. In this case, the original quality metric has been recoded from a score between 0 and 10 to either High
or Low
, and the chlorides
is treated here as a categorical variable instead of a quantitative variable.
https://archive.ics.uci.edu/ml/datasets/Wine+Quality
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
BIKE dataset for Exercise 1 Chapters 7 and 8
data("EX7.BIKE")
data("EX7.BIKE")
A data frame with 410 observations on the following 9 variables.
Demand
a numeric vector
Day
a factor with levels Friday
Monday
Saturday
Sunday
Thursday
Tuesday
Wednesday
Workingday
a factor with levels no
yes
Holiday
a factor with levels no
yes
Weather
a factor with levels No rain
Rain
AvgTemp
a numeric vector
EffectiveAvgTemp
a numeric vector
AvgHumidity
a numeric vector
AvgWindspeed
a numeric vector
Identical to EX5.BIKE
except with three additional rows deleted. See that dataset for details.
CATALOG data for Exercise 2 in Chapters 7 and 8
data("EX7.CATALOG")
data("EX7.CATALOG")
A data frame with 4000 observations on the following 7 variables.
Buy
a factor with levels No
Yes
, whether customer made a purchase through the catalog next quarter
QuartersWithPurchase
a numeric vector, number of quarters where customer made a purchase through the catalog
PercentQuartersWithPurchase
a numeric vector, percentage of quarters where customer made a purchase through the catalog
CatalogsReceived
a numeric vector, total number of catalogs customer has received
DaysSinceLastPurchase
a numeric vector, number of days since customer placed his or her last order
AvgOrderSize
a numeric vector, the typical number of items per order when customers buys through the catalog
LifetimeOrder
a numeric vector, the number of orders the customer has placed through the catalog
The original source of this data is lost, but it is likely adapted from real data.
Birthweight dataset for Exercise 1 in Chapter 9
data("EX9.BIRTHWEIGHT")
data("EX9.BIRTHWEIGHT")
A data frame with 553 observations on the following 13 variables.
Birthweight
a numeric vector, grams
Gestation
a numeric vector, weeks
MotherRace
a factor with levels Asian
Black
Mexican
Mixed
White
, self-reported
MotherAge
a numeric vector, self-reported
MotherEducation
a factor with levels below HS
College
HS
, self-reported
MotherHeight
a numeric vector, inches
MotherWeight
a numeric vector, pounds
FatherRace
a factor with levels Asian
Black
Mexican
Mixed
White
, self-reported
FatherAge
a numeric vector, self-reported
Father_Education
a factor with levels below HS
College
HS
, self-reported
FatherHeight
a numeric vector, inches
FatherWeight
a numeric vector, pounds
Smoking
a factor with levels never
now
, self-reported
An examination of birthweights and their link to gestation, mother and father characteristics, and whether the mother smoked during pregnancy.
Adapted from a subset of a study from Nolan and Speed (2000) consisting of male, single births which survived for at least 28 days. Some rows that contained bad data have been omitted. http://had.co.nz/stat645/week-05/birthweight.txt
NFL data for Exercise 2 Chapter 9
data("EX9.NFL")
data("EX9.NFL")
A data frame with 352 observations on the following 26 variables.
Wins
a numeric vector
X1.OffTotPlays
a numeric vector
X2.OffTotYdsperPly
a numeric vector
X3.OffPass1stDwns
a numeric vector
X4.OffRush1stDwns
a numeric vector
X5.OffFumblesLost
a numeric vector
X6.OffPassComp
a numeric vector
X7.OffPassINT
a numeric vector
X8.OffPassLongest
a numeric vector
X9.OffPassYdsperAtt
a numeric vector
X10.OffPassYdsperComp
a numeric vector
X11.OffPassSackYds
a numeric vector
X12.OffPassSack
a numeric vector
X13.OffRushLongest
a numeric vector
X14.OffRushYdsperAtt
a numeric vector
X15.OffRushYdsperGame
a numeric vector
X16.OffFumbles
a numeric vector
X17.1to29ydFG
a numeric vector
X18.30to39ydFG
a numeric vector
X19.40.ydFG
a numeric vector
X20.TotalFGAtt
a numeric vector
X21.OffTimesPunted
a numeric vector
X22.OffTimesHadPuntBlocked
a numeric vector
X23.OffYardsPerPunt
a numeric vector
X24.Off2ptConvMade
a numeric vector
X25.OffSafeties
a numeric vector
A subset of the NFL
data (see entry for details) containing statistics on the offense.
Data for Exercise 3 Chapter 9
data("EX9.STORE")
data("EX9.STORE")
A data frame with 1500 observations on the following 68 variables.
Store1
a factor with levels Buy
No
Store2
a factor with levels Buy
No
Store3
a factor with levels Buy
No
Store4
a factor with levels Buy
No
Store5
a factor with levels Buy
No
Store6
a factor with levels Buy
No
Store7
a factor with levels Buy
No
Store8
a factor with levels Buy
No
Store9
a factor with levels Buy
No
Store10
a factor with levels Buy
No
Store11
a factor with levels Buy
No
Store12
a factor with levels Buy
No
Store13
a factor with levels Buy
No
Store14
a factor with levels Buy
No
Store15
a factor with levels Buy
No
Store16
a factor with levels Buy
No
Store17
a factor with levels Buy
No
Store18
a factor with levels Buy
No
Store19
a factor with levels Buy
No
Store20
a factor with levels Buy
No
Store21
a factor with levels Buy
No
Store22
a factor with levels Buy
No
Store23
a factor with levels Buy
No
Store24
a factor with levels Buy
No
Store25
a factor with levels Buy
No
Store26
a factor with levels Buy
No
Store27
a factor with levels Buy
No
Store28
a factor with levels Buy
No
Store29
a factor with levels Buy
No
Store30
a factor with levels Buy
No
Store31
a factor with levels Buy
No
Store32
a factor with levels Buy
No
Store33
a factor with levels Buy
No
Store34
a factor with levels Buy
No
Store35
a factor with levels Buy
No
Store36
a factor with levels Buy
No
Store37
a factor with levels Buy
No
Store38
a factor with levels Buy
No
Store39
a factor with levels Buy
No
Store40
a factor with levels Buy
No
Store41
a factor with levels Buy
No
Store42
a factor with levels Buy
No
Store43
a factor with levels Buy
No
Store44
a factor with levels Buy
No
Store45
a factor with levels Buy
No
Store46
a factor with levels Buy
No
Store47
a factor with levels Buy
No
Store48
a factor with levels Buy
No
Store49
a factor with levels Buy
No
Store50
a factor with levels Buy
No
Store51
a factor with levels Buy
No
Store52
a factor with levels Buy
No
Store53
a factor with levels Buy
No
Store54
a factor with levels Buy
No
Store55
a factor with levels Buy
No
Store56
a factor with levels Buy
No
Store57
a factor with levels Buy
No
Store58
a factor with levels Buy
No
Store59
a factor with levels Buy
No
Store60
a factor with levels Buy
No
Store61
a factor with levels Buy
No
Store62
a factor with levels Buy
No
Store63
a factor with levels Buy
No
Store64
a factor with levels Buy
No
Store65
a factor with levels Buy
No
Store66
a factor with levels Buy
No
Store67
a factor with levels Buy
No
Store68
a factor with levels Buy
No
The data consists of a random sample of 1500 credit card customers and their shopping habits regarding 68 different stores (whether they did or did not make a purchase in the last 90 days). Shoppers don't pick and choose places to shop at random, so it is interesting to study which stores appear together in a customers' history.
Consultation with an anonymous client. Stores have been anonymized to protect the source.
This function computes the Mahalanobis distance of points as a check for potential extrapolation.
extrapolation_check(M,newdata)
extrapolation_check(M,newdata)
M |
A fitted model that uses only quantitative variables |
newdata |
Data frame (that has the exact same columns as predictors used to fit the model |
This function computes the shape of the predictor data cloud and calculates the distances of points from the center (with respect to the shape of the data cloud). Extrapolation occurs at a combination of predictors that is far from combinations used to build the model. An observation with a large Mahalanobis distance MAY be far from the observations used to build the model and thus MAY require extrapolation.
Note: analysis assumes the predictor data cloud is roughly elliptical (this may not be a good assumptions).
The function reports the percentiles of the Mahalanobis distances of the points in newdata
. Percentiles are the fraction of observations used in model that are CLOSER to
the center than the point(s) in question. Large values of these percentages indicate a greater risk for extrapolation. If Percentile
is about 99
you may be extrapolating.
The method is sensitive to outliers clusters of outliers and gives only a crude idea of the potential for extrapolation.
Adam Petrie
Introduction to Regression and Modeling
data(SALARY) M <- lm(Salary~Education*Experience+Months,data=SALARY) newdata <- data.frame(Education=c(0,5,10),Experience=c(15,15,15),Months=c(0,0,0)) extrapolation_check(M,newdata) #Individuals 1 and 3 are rather unusual (though not terribly) while individual 2 is typical.
data(SALARY) M <- lm(Salary~Education*Experience+Months,data=SALARY) newdata <- data.frame(Education=c(0,5,10),Experience=c(15,15,15),Months=c(0,0,0)) extrapolation_check(M,newdata) #Individuals 1 and 3 are rather unusual (though not terribly) while individual 2 is typical.
This function takes a simple linear regression model and finds the transformation of x and y that results in the highest R2
find_transformations(M,powers=seq(from=-3,to=3,by=.25),threshold=0.02,...)
find_transformations(M,powers=seq(from=-3,to=3,by=.25),threshold=0.02,...)
M |
A simple linear regression model fitted with |
powers |
A sequence of powers to try for x and y. By default this ranges from -3 to 3 in steps of 0.25. If 0 is a valid power, then the logarithm is used instead. |
threshold |
Report all models that have an R2 that is within |
... |
Additional arguments to |
The relationship between y and x may not be linear. However, some transformation of y may have a linear relationship with some transformation of x. This function considers simple linear regression with x and y raised to powers between -3 and 3 (in 0.25 increments) by default. The function outputs a list of the top models as gauged by R^2 (all models within 0.02 of the highest R^2). Note: there is no guarantee that these "best" transformations are actually good, since a large R^2 can be produced by outliers created during transformations. A plot of the transformation is also provided.
It is exceedingly rare that the "best" transformation is raising x and y to the 1 power (i.e., the original variables). Transformations are typically used only when there are issues in the residuals plots, highly skewed variables, or physical/logical justifications.
Note: if a variable has 0s or negative numbers, only integer transformations are considered.
Adam Petrie
Introduction to Regression and Modeling
#Straightforward example data(BULLDOZER) M <- lm(SalePrice~YearMade,data=BULLDOZER) find_transformations(M,pch=20,cex=0.3) #Results are very misleading since selected models have high R2 due to outliers data(MOVIE) M <- lm(Total~Weekend,data=MOVIE) find_transformations(M,powers=seq(-2,2,by=0.5),threshold=0.05)
#Straightforward example data(BULLDOZER) M <- lm(SalePrice~YearMade,data=BULLDOZER) find_transformations(M,pch=20,cex=0.3) #Results are very misleading since selected models have high R2 due to outliers data(MOVIE) M <- lm(Total~Weekend,data=MOVIE) find_transformations(M,powers=seq(-2,2,by=0.5),threshold=0.05)
Examining the relationship between how likely someone would be friends with a person based on that person's level of attractiveness
data("FRIEND")
data("FRIEND")
A data frame with 54 observations on the following 2 variables.
Attractiveness
a numeric vector - the average scores (1-5) from about 80 male students who rated the attractiveness of the women in each picture
FriendshipPotential
a numeric vector - the average scores (1-5) from about 30 female students who rated how likely they would be to be friends with the pictured woman
The data contain information on 54 pictures of women posted on the (now defunct/renamed) site hotornot.com. The women in two classes of introductory statistics at the University of Tennessee rated how likely they would be friends with the pictured women (on a scale of 1-5, 1 being very unlikely and 5 being very likely). The men in three (different) classes of introductory statistics gave an attractiveness score to each woman (on a scale of 1-5, 1 being very unattractive and 5 being very attractive). The numbers presented are the averages over all student ratings.
Surveys administered to introductory statistics students at the University of Tennessee from 2008-2010.
Wins vs. Fumbles of an NFL team
data("FUMBLES")
data("FUMBLES")
A data frame with 352 observations on the following 2 variables.
Wins
a numeric vector, number of wins (0-16) of an NFL team over the course of a season
FumblesLost
a numeric vector, the number of fumbles lost by that team over the course of a season
This is a subset of the NFL
data. Data is from the 2002-2012 seasons.
Collected by an undergraduate student from available web data in 2013.
This function takes a linear regression from lm
, logistic regression from glm
, partition model from rpart
, or random forest from randomForest
and calculates the generalization error on a dataframe.
generalization_error(MODEL,HOLDOUT,Kfold=FALSE,K=5,R=10,seed=NA)
generalization_error(MODEL,HOLDOUT,Kfold=FALSE,K=5,R=10,seed=NA)
MODEL |
A linear regression model created using |
HOLDOUT |
A dataset for which the generalization error will be calculated. If not given, the error on the data used to build the model ( |
Kfold |
If |
K |
The number of folds used in repeated K-fold cross-validation for the estimation of the generalization error for the model |
R |
The number of repeats used in repeated K-fold cross-validation. |
seed |
an optional argument priming the random number seed for estimating the generalization error |
This function calculates the error on MODEL
, its estimated generalization error from repeated K-fold cross-validation (for regression models only), and the actual generalization error on HOLDOUT
. If the response is quantitative, the RMSE is reported. If the response is categorical, the confusion matrices and misclassification rates are returned.
Adam Petrie
Introduction to Regression and Modeling
#Education analytics data(STUDENT) set.seed(1010) train.rows <- sample(1:nrow(STUDENT),0.7*nrow(STUDENT)) TRAIN <- STUDENT[train.rows,] HOLDOUT <- STUDENT[-train.rows,] M <- lm(CollegeGPA~.,data=TRAIN) #Also estimate the generalization error of the model generalization_error(M,HOLDOUT,Kfold=TRUE,seed=5020) #Try partition and randomforest, though they do not perform as well as regression here TREE <- rpart(CollegeGPA~.,data=TRAIN) FOREST <- randomForest(CollegeGPA~.,data=TRAIN) generalization_error(TREE,HOLDOUT) generalization_error(FOREST,HOLDOUT) #Wine data(WINE) set.seed(2020) train.rows <- sample(1:nrow(WINE),0.7*nrow(WINE)) TRAIN <- WINE[train.rows,] HOLDOUT <- WINE[-train.rows,] M <- glm(Quality~.^2,data=TRAIN,family=binomial) generalization_error(M,HOLDOUT) #Random forest predicts best on the holdout sample TREE <- rpart(Quality~.,data=TRAIN) FOREST <- randomForest(Quality~.,data=TRAIN) generalization_error(TREE,HOLDOUT) generalization_error(FOREST,HOLDOUT)
#Education analytics data(STUDENT) set.seed(1010) train.rows <- sample(1:nrow(STUDENT),0.7*nrow(STUDENT)) TRAIN <- STUDENT[train.rows,] HOLDOUT <- STUDENT[-train.rows,] M <- lm(CollegeGPA~.,data=TRAIN) #Also estimate the generalization error of the model generalization_error(M,HOLDOUT,Kfold=TRUE,seed=5020) #Try partition and randomforest, though they do not perform as well as regression here TREE <- rpart(CollegeGPA~.,data=TRAIN) FOREST <- randomForest(CollegeGPA~.,data=TRAIN) generalization_error(TREE,HOLDOUT) generalization_error(FOREST,HOLDOUT) #Wine data(WINE) set.seed(2020) train.rows <- sample(1:nrow(WINE),0.7*nrow(WINE)) TRAIN <- WINE[train.rows,] HOLDOUT <- WINE[-train.rows,] M <- glm(Quality~.^2,data=TRAIN,family=binomial) generalization_error(M,HOLDOUT) #Random forest predicts best on the holdout sample TREE <- rpart(Quality~.,data=TRAIN) FOREST <- randomForest(Quality~.,data=TRAIN) generalization_error(TREE,HOLDOUT) generalization_error(FOREST,HOLDOUT)
A simple function to take the output of a partition model created with rpart
and return information abouthe complexity parameter and performance of varies models.
getcp(TREE)
getcp(TREE)
TREE |
An object of class |
This function prints out a table of the complexity parameter, number of splits, relative error, cross validation error, and standard deviation of cross validation error for a partition model. It adds helpful advice for what the value of CP is for the tree that had the lowest cross validation error and also the value of CP for the simplest tree with a cross validation error at most 1 standard deviation above the lowest.
Further, a plot is made of the estimated generalization error (xerror
) versus the number of splits to illustrate when the tree stops improving. Vertical lines are draw at the number of splits corresponding to the lowest estimated generalization error to the tree selected by the one standard deviation rule.
Adam Petrie
Introduction to Regression and Modeling
data(JUNK) TREE <- rpart(Junk~.,data=JUNK,control=rpart.control(cp=0,xval=10,minbucket=5)) getcp(TREE)
data(JUNK) TREE <- rpart(Junk~.,data=JUNK,control=rpart.control(cp=0,xval=10,minbucket=5)) getcp(TREE)
This function plots the leverage vs. deleted studentized residuals for a regression model, highlighting points that are influent based on these two factors as well as Cook's distance
influence_plot(M,large.cook,cooks=FALSE)
influence_plot(M,large.cook,cooks=FALSE)
M |
A linear regression model fitted with lm() |
large.cook |
The threshold for a "large" Cook's distance. If not specified, a default of 4/n is used. |
cooks |
|
A point is influential if its addition to the data changes the regression substantially. One way of measuring influence is by looking at the point's leverage (distance from the center of the predictor's datacloud with respect to it shape) and deleted studentized residual (relative size of the residual with respect to a regression made without that point). Points with leverages larger than 2(k+1)/n (where k is the number of predictors) and deleted studentized residuals larger than 2 in magnitude are considered influential.
Influence can also be measured by Cook's distance, which essentially combines the above two measures. This function considers the Cook's distances to be large when it exceeds 4/n, but the user can specify another cutoff.
The radius of a point is proportional to the square root of the Cook's distance. Influential points according to leverage/residual criteria have an X through them while influential points according to Cook's distance are bolded.
The function returns the row numbers of influential observations.
A list with the row numbers of influential points according to Cook's distance ($Cooks
) and according to leverage/residual criteria ($Leverage
).
Adam Petrie
Introduction to Regression and Modeling
cooks.distance
, hatvalues
, rstudent
data(TIPS) M <- lm(TipPercentage~.-Tip,data=TIPS) influence_plot(M)
data(TIPS) M <- lm(TipPercentage~.-Tip,data=TIPS) influence_plot(M)
Building a junk mail classifier based on word and character frequencies
data("JUNK")
data("JUNK")
A data frame with 4601 observations on the following 58 variables.
Junk
a factor with levels Junk
Safe
make
a numeric vector, the percentage (0-100) of words in the email that are the word make
address
a numeric vector
all
a numeric vector
X3d
a numeric vector, the percentage (0-100) of words in the email that are the word 3d
our
a numeric vector
over
a numeric vector
remove
a numeric vector
internet
a numeric vector
order
a numeric vector
mail
a numeric vector
receive
a numeric vector
will
a numeric vector
people
a numeric vector
report
a numeric vector
addresses
a numeric vector
free
a numeric vector
business
a numeric vector
email
a numeric vector
you
a numeric vector
credit
a numeric vector
your
a numeric vector
font
a numeric vector
X000
a numeric vector, the percentage (0-100) of words in the email that are the word 000
money
a numeric vector
hp
a numeric vector
hpl
a numeric vector
george
a numeric vector
X650
a numeric vector
lab
a numeric vector
labs
a numeric vector
telnet
a numeric vector
X857
a numeric vector
data
a numeric vector
X415
a numeric vector
X85
a numeric vector
technology
a numeric vector
X1999
a numeric vector
parts
a numeric vector
pm
a numeric vector
direct
a numeric vector
cs
a numeric vector
meeting
a numeric vector
original
a numeric vector
project
a numeric vector
re
a numeric vector
edu
a numeric vector
table
a numeric vector
conference
a numeric vector
semicolon
a numeric vector, the percentage (0-100) of characters in the email that are semicolons
parenthesis
a numeric vector
bracket
a numeric vector
exclamation
a numeric vector
dollarsign
a numeric vector
hashtag
a numeric vector
capital_run_length_average
a numeric vector, average length of uninterrupted sequence of capital letters
capital_run_length_longest
a numeric vector, length of longest uninterrupted sequence of capital letters
capital_run_length_total
a numeric vector, total number of capital letters in the email
The collection of junk emails came from the postmaster and individuals who classified the email as junk. The collection of safe emails were from work and personal emails. Note that most of the variables are percents and can vary from 0-100, though most values are much less than 1 (1%).
Adapted from the Spambase Data Set at the UCI data repository https://archive.ics.uci.edu/ml/datasets/Spambase. Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt; Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304. Donor: George Forman (gforman at nospam hpl.hp.com)
Interest in frequent flier program (artificial)
data("LARGEFLYER")
data("LARGEFLYER")
A data frame with 100000 observations on the following 2 variables.
Gender
a factor with levels Female
Male
Interest
a factor with levels No
Yes
This artificial datasets tabulates the interest in a new frequent flyer program based on gender. It illustrates that a statistically significant association may have absolutely no practical significance.
The profit of newly released products over the first few months of their release
data("LAUNCH")
data("LAUNCH")
A data frame with 652 observations on the following 420 variables.
Profit
an anonymized numeric vector, the profit from the product over the first few months of release
x1
an anonymized numeric vector
x2
an anonymized numeric vector
x3
an anonymized numeric vector
x4
an anonymized numeric vector
x5
an anonymized numeric vector
x6
an anonymized numeric vector
x7
an anonymized numeric vector
x8
an anonymized numeric vector
x9
an anonymized numeric vector
x10
an anonymized numeric vector
x11
an anonymized numeric vector
x12
an anonymized numeric vector
x13
an anonymized numeric vector
x14
an anonymized numeric vector
x15
an anonymized numeric vector
x16
an anonymized numeric vector
x17
an anonymized numeric vector
x18
an anonymized numeric vector
x19
an anonymized numeric vector
x20
an anonymized numeric vector
x21
an anonymized numeric vector
x22
an anonymized numeric vector
x23
an anonymized numeric vector
x24
an anonymized numeric vector
x25
an anonymized numeric vector
x26
an anonymized numeric vector
x27
an anonymized numeric vector
x28
an anonymized numeric vector
x29
an anonymized numeric vector
x30
an anonymized numeric vector
x31
an anonymized numeric vector
x32
an anonymized numeric vector
x33
an anonymized numeric vector
x34
an anonymized numeric vector
x35
an anonymized numeric vector
x36
an anonymized numeric vector
x37
an anonymized numeric vector
x38
an anonymized numeric vector
x39
an anonymized numeric vector
x40
an anonymized numeric vector
x41
an anonymized numeric vector
x42
an anonymized numeric vector
x43
an anonymized numeric vector
x44
an anonymized numeric vector
x45
an anonymized numeric vector
x46
an anonymized numeric vector
x47
an anonymized numeric vector
x48
an anonymized numeric vector
x49
an anonymized numeric vector
x50
an anonymized numeric vector
x51
an anonymized numeric vector
x52
an anonymized numeric vector
x53
an anonymized numeric vector
x54
an anonymized numeric vector
x55
an anonymized numeric vector
x56
an anonymized numeric vector
x57
an anonymized numeric vector
x58
an anonymized numeric vector
x59
an anonymized numeric vector
x60
an anonymized numeric vector
x61
an anonymized numeric vector
x62
an anonymized numeric vector
x63
an anonymized numeric vector
x64
an anonymized numeric vector
x65
an anonymized numeric vector
x66
an anonymized numeric vector
x67
an anonymized numeric vector
x68
an anonymized numeric vector
x69
an anonymized numeric vector
x70
an anonymized numeric vector
x71
an anonymized numeric vector
x72
an anonymized numeric vector
x73
an anonymized numeric vector
x74
an anonymized numeric vector
x75
an anonymized numeric vector
x76
an anonymized numeric vector
x77
an anonymized numeric vector
x78
an anonymized numeric vector
x79
an anonymized numeric vector
x80
an anonymized numeric vector
x81
an anonymized numeric vector
x82
an anonymized numeric vector
x83
an anonymized numeric vector
x84
an anonymized numeric vector
x85
an anonymized numeric vector
x86
an anonymized numeric vector
x87
an anonymized numeric vector
x88
an anonymized numeric vector
x89
an anonymized numeric vector
x90
an anonymized numeric vector
x91
an anonymized numeric vector
x92
an anonymized numeric vector
x93
an anonymized numeric vector
x94
an anonymized numeric vector
x95
an anonymized numeric vector
x96
an anonymized numeric vector
x97
an anonymized numeric vector
x98
an anonymized numeric vector
x99
an anonymized numeric vector
x100
an anonymized numeric vector
x101
an anonymized numeric vector
x102
an anonymized numeric vector
x103
an anonymized numeric vector
x104
an anonymized numeric vector
x105
an anonymized numeric vector
x106
an anonymized numeric vector
x107
an anonymized numeric vector
x108
an anonymized numeric vector
x109
an anonymized numeric vector
x110
an anonymized numeric vector
x111
an anonymized numeric vector
x112
an anonymized numeric vector
x113
an anonymized numeric vector
x114
an anonymized numeric vector
x115
an anonymized numeric vector
x116
an anonymized numeric vector
x117
an anonymized numeric vector
x118
an anonymized numeric vector
x119
an anonymized numeric vector
x120
an anonymized numeric vector
x121
an anonymized numeric vector
x122
an anonymized numeric vector
x123
an anonymized numeric vector
x124
an anonymized numeric vector
x125
an anonymized numeric vector
x126
an anonymized numeric vector
x127
an anonymized numeric vector
x128
an anonymized numeric vector
x129
an anonymized numeric vector
x130
an anonymized numeric vector
x131
an anonymized numeric vector
x132
an anonymized numeric vector
x133
an anonymized numeric vector
x134
an anonymized numeric vector
x135
an anonymized numeric vector
x136
an anonymized numeric vector
x137
an anonymized numeric vector
x138
an anonymized numeric vector
x139
an anonymized numeric vector
x140
an anonymized numeric vector
x141
an anonymized numeric vector
x142
an anonymized numeric vector
x143
an anonymized numeric vector
x144
an anonymized numeric vector
x145
an anonymized numeric vector
x146
an anonymized numeric vector
x147
an anonymized numeric vector
x148
an anonymized numeric vector
x149
an anonymized numeric vector
x150
an anonymized numeric vector
x151
an anonymized numeric vector
x152
an anonymized numeric vector
x153
an anonymized numeric vector
x154
an anonymized numeric vector
x155
an anonymized numeric vector
x156
an anonymized numeric vector
x157
an anonymized numeric vector
x158
an anonymized numeric vector
x159
an anonymized numeric vector
x160
an anonymized numeric vector
x161
an anonymized numeric vector
x162
an anonymized numeric vector
x163
an anonymized numeric vector
x164
an anonymized numeric vector
x165
an anonymized numeric vector
x166
an anonymized numeric vector
x167
an anonymized numeric vector
x168
an anonymized numeric vector
x169
an anonymized numeric vector
x170
an anonymized numeric vector
x171
an anonymized numeric vector
x172
an anonymized numeric vector
x173
an anonymized numeric vector
x174
an anonymized numeric vector
x175
an anonymized numeric vector
x176
an anonymized numeric vector
x177
an anonymized numeric vector
x178
an anonymized numeric vector
x179
an anonymized numeric vector
x180
an anonymized numeric vector
x181
an anonymized numeric vector
x182
an anonymized numeric vector
x183
an anonymized numeric vector
x184
an anonymized numeric vector
x185
an anonymized numeric vector
x186
an anonymized numeric vector
x187
an anonymized numeric vector
x188
an anonymized numeric vector
x189
an anonymized numeric vector
x190
an anonymized numeric vector
x191
an anonymized numeric vector
x192
an anonymized numeric vector
x193
an anonymized numeric vector
x194
an anonymized numeric vector
x195
an anonymized numeric vector
x196
an anonymized numeric vector
x197
an anonymized numeric vector
x198
an anonymized numeric vector
x199
an anonymized numeric vector
x200
an anonymized numeric vector
x201
an anonymized numeric vector
x202
an anonymized numeric vector
x203
an anonymized numeric vector
x204
an anonymized numeric vector
x205
an anonymized numeric vector
x206
an anonymized numeric vector
x207
an anonymized numeric vector
x208
an anonymized numeric vector
x209
an anonymized numeric vector
x210
an anonymized numeric vector
x211
an anonymized numeric vector
x212
an anonymized numeric vector
x213
an anonymized numeric vector
x214
an anonymized numeric vector
x215
an anonymized numeric vector
x216
an anonymized numeric vector
x217
an anonymized numeric vector
x218
an anonymized numeric vector
x219
an anonymized numeric vector
x220
an anonymized numeric vector
x221
an anonymized numeric vector
x222
an anonymized numeric vector
x223
an anonymized numeric vector
x224
an anonymized numeric vector
x225
an anonymized numeric vector
x226
an anonymized numeric vector
x227
an anonymized numeric vector
x228
an anonymized numeric vector
x229
an anonymized numeric vector
x230
an anonymized numeric vector
x231
an anonymized numeric vector
x232
an anonymized numeric vector
x233
an anonymized numeric vector
x234
an anonymized numeric vector
x235
an anonymized numeric vector
x236
an anonymized numeric vector
x237
an anonymized numeric vector
x238
an anonymized numeric vector
x239
an anonymized numeric vector
x240
an anonymized numeric vector
x241
an anonymized numeric vector
x242
an anonymized numeric vector
x243
an anonymized numeric vector
x244
an anonymized numeric vector
x245
an anonymized numeric vector
x246
an anonymized numeric vector
x247
an anonymized numeric vector
x248
an anonymized numeric vector
x249
an anonymized numeric vector
x250
an anonymized numeric vector
x251
an anonymized numeric vector
x252
an anonymized numeric vector
x253
an anonymized numeric vector
x254
an anonymized numeric vector
x255
an anonymized numeric vector
x256
an anonymized numeric vector
x257
an anonymized numeric vector
x258
an anonymized numeric vector
x259
an anonymized numeric vector
x260
an anonymized numeric vector
x261
an anonymized numeric vector
x262
an anonymized numeric vector
x263
an anonymized numeric vector
x264
an anonymized numeric vector
x265
an anonymized numeric vector
x266
an anonymized numeric vector
x267
an anonymized numeric vector
x268
an anonymized numeric vector
x269
an anonymized numeric vector
x270
an anonymized numeric vector
x271
an anonymized numeric vector
x272
an anonymized numeric vector
x273
an anonymized numeric vector
x274
an anonymized numeric vector
x275
an anonymized numeric vector
x276
an anonymized numeric vector
x277
an anonymized numeric vector
x278
an anonymized numeric vector
x279
an anonymized numeric vector
x280
an anonymized numeric vector
x281
an anonymized numeric vector
x282
an anonymized numeric vector
x283
an anonymized numeric vector
x284
an anonymized numeric vector
x285
an anonymized numeric vector
x286
an anonymized numeric vector
x287
an anonymized numeric vector
x288
an anonymized numeric vector
x289
an anonymized numeric vector
x290
an anonymized numeric vector
x291
an anonymized numeric vector
x292
an anonymized numeric vector
x293
an anonymized numeric vector
x294
an anonymized numeric vector
x295
an anonymized numeric vector
x296
an anonymized numeric vector
x297
an anonymized numeric vector
x298
an anonymized numeric vector
x299
an anonymized numeric vector
x300
an anonymized numeric vector
x301
an anonymized numeric vector
x302
an anonymized numeric vector
x303
an anonymized numeric vector
x304
an anonymized numeric vector
x305
an anonymized numeric vector
x306
an anonymized numeric vector
x307
an anonymized numeric vector
x308
an anonymized numeric vector
x309
an anonymized numeric vector
x310
an anonymized numeric vector
x311
an anonymized numeric vector
x312
an anonymized numeric vector
x313
an anonymized numeric vector
x314
an anonymized numeric vector
x315
an anonymized numeric vector
x316
an anonymized numeric vector
x317
an anonymized numeric vector
x318
an anonymized numeric vector
x319
an anonymized numeric vector
x320
an anonymized numeric vector
x321
an anonymized numeric vector
x322
an anonymized numeric vector
x323
an anonymized numeric vector
x324
an anonymized numeric vector
x325
an anonymized numeric vector
x326
an anonymized numeric vector
x327
an anonymized numeric vector
x328
an anonymized numeric vector
x329
an anonymized numeric vector
x330
an anonymized numeric vector
x331
an anonymized numeric vector
x332
an anonymized numeric vector
x333
an anonymized numeric vector
x334
an anonymized numeric vector
x335
an anonymized numeric vector
x336
an anonymized numeric vector
x337
an anonymized numeric vector
x338
an anonymized numeric vector
x339
an anonymized numeric vector
x340
an anonymized numeric vector
x341
an anonymized numeric vector
x342
an anonymized numeric vector
x343
an anonymized numeric vector
x344
an anonymized numeric vector
x345
an anonymized numeric vector
x346
an anonymized numeric vector
x347
an anonymized numeric vector
x348
an anonymized numeric vector
x349
an anonymized numeric vector
x350
an anonymized numeric vector
x351
an anonymized numeric vector
x352
an anonymized numeric vector
x353
an anonymized numeric vector
x354
an anonymized numeric vector
x355
an anonymized numeric vector
x356
an anonymized numeric vector
x357
an anonymized numeric vector
x358
an anonymized numeric vector
x359
an anonymized numeric vector
x360
an anonymized numeric vector
x361
an anonymized numeric vector
x362
an anonymized numeric vector
x363
an anonymized numeric vector
x364
an anonymized numeric vector
x365
an anonymized numeric vector
x366
an anonymized numeric vector
x367
an anonymized numeric vector
x368
an anonymized numeric vector
x369
an anonymized numeric vector
x370
an anonymized numeric vector
x371
an anonymized numeric vector
x372
an anonymized numeric vector
x373
an anonymized numeric vector
x374
an anonymized numeric vector
x375
an anonymized numeric vector
x376
an anonymized numeric vector
x377
an anonymized numeric vector
x378
an anonymized numeric vector
x379
an anonymized numeric vector
x380
an anonymized numeric vector
x381
an anonymized numeric vector
x382
an anonymized numeric vector
x383
an anonymized numeric vector
x384
an anonymized numeric vector
x385
an anonymized numeric vector
x386
an anonymized numeric vector
x387
an anonymized numeric vector
x388
an anonymized numeric vector
x389
an anonymized numeric vector
x390
an anonymized numeric vector
x391
an anonymized numeric vector
x392
an anonymized numeric vector
x393
an anonymized numeric vector
x394
an anonymized numeric vector
x395
an anonymized numeric vector
x396
an anonymized numeric vector
x397
an anonymized numeric vector
x398
an anonymized numeric vector
x399
an anonymized numeric vector
x400
an anonymized numeric vector
x401
an anonymized numeric vector
x402
an anonymized numeric vector
x403
an anonymized numeric vector
x404
an anonymized numeric vector
x405
an anonymized numeric vector
x406
an anonymized numeric vector
x407
an anonymized numeric vector
x408
an anonymized numeric vector
x409
an anonymized numeric vector
x410
an anonymized numeric vector
x411
an anonymized numeric vector
x412
an anonymized numeric vector
x413
an anonymized numeric vector
x414
an anonymized numeric vector
x415
an anonymized numeric vector
x416
an anonymized numeric vector
x417
an anonymized numeric vector
x418
an anonymized numeric vector
x419
an anonymized numeric vector
This example is inspired by the Online Product Sales competition on kaggle.com. The goal is to isolate the minimum number predictors required for accurately predicting Profit
. Since the data is based on an actual case, all predictors are anonymized (some were originally categorical but are treated as numerical for the example).
Inspired by https://www.kaggle.com/c/online-sales
This function finds the mode of a categorical variable
mode_factor(x)
mode_factor(x)
x |
a factor |
The mode is the most frequently occuring level of a categorical variable. This function returns the mode of a categorical variable. If there is a tie for the most frequent level, it returns all modes.
Adam Petrie
Introduction to Regression and Modeling
data(EX6.CLICK) mode_factor(EX6.CLICK$DeviceModel) #To see how often it appears try sorting a table sort( table(EX6.CLICK$DeviceModel),decreasing=TRUE ) x <- c( rep(letters[1:4],5), "e", "f" ) #multimodel mode_factor(x)
data(EX6.CLICK) mode_factor(EX6.CLICK$DeviceModel) #To see how often it appears try sorting a table sort( table(EX6.CLICK$DeviceModel),decreasing=TRUE ) x <- c( rep(letters[1:4],5), "e", "f" ) #multimodel mode_factor(x)
Provides a mosaic plot to visualize the association between two categorical variables
mosaic(formula,data,color=TRUE,labelat=c(),xlab=c(),ylab=c(), magnification=1,equal=FALSE,inside=FALSE,ordered=FALSE)
mosaic(formula,data,color=TRUE,labelat=c(),xlab=c(),ylab=c(), magnification=1,equal=FALSE,inside=FALSE,ordered=FALSE)
formula |
A standard R formula written as y~x, where y is the name of the variable playing the role of y and x is the name of the variable playing the role of x. |
data |
An optional argument giving the name of the data frame that contains x and y. If not specified, the function will use existing definitions in the parent environment. |
color |
|
labelat |
a vector of factor levels of |
xlab |
Label of horizontal axis if you want something different that the name of the |
ylab |
Label of vertical axis if you want something different that the name of the |
magnification |
Magnification of the labels of the |
equal |
If |
inside |
If |
ordered |
If |
This function shows a mosaic plot to visualize the conditional distributions of y
for each level of x
, along with the marginal distribution of y
to the right of the plot. The widths of the segmented bar charts are proportional to the frequency of each level of x
. These plots are the same that appear using associate
.
Adam Petrie
Introduction to Regression and Modeling
data(ACCOUNT) mosaic(Area.Classification~Purchase,data=ACCOUNT,color=TRUE) data(EX6.CLICK) #Default presentation: not very useful mosaic(Click~DeviceModel,data=EX6.CLICK) #Better presentation mosaic(Click~DeviceModel,data=EX6.CLICK,equal=TRUE,inside=TRUE,magnification=0.8)
data(ACCOUNT) mosaic(Area.Classification~Purchase,data=ACCOUNT,color=TRUE) data(EX6.CLICK) #Default presentation: not very useful mosaic(Click~DeviceModel,data=EX6.CLICK) #Better presentation mosaic(Click~DeviceModel,data=EX6.CLICK,equal=TRUE,inside=TRUE,magnification=0.8)
Movie grosses from the late 1990s
data("MOVIE")
data("MOVIE")
A data frame with 309 observations on the following 3 variables.
Movie
a factor giving the name of the movie
Weekend
a numeric vector, the opening weekend gross (millions of dollars)
Total
a numeric vector, the total US gross (millions of dollars)
The goal is to predict the total gross of a movie based on its opening weekend gross.
Compiled via information provided on https://www.imdb.com/
Statistics for NFL teams from the 2002-2012 seasons
data("NFL")
data("NFL")
A data frame with 352 observations on the following 113 variables.
X4.Wins
a numeric vector, number of wins (0-16) of an NFL team for the season
X5.OffTotPlays
a numeric vector, number of total plays made on offense for the season
X6.OffTotYdsperPly
a numeric vector
X7.OffTot1stDwns
a numeric vector
X8.OffPass1stDwns
a numeric vector
X9.OffRush1stDwns
a numeric vector
X10.OffFumblesLost
a numeric vector
X11.OffPassComp
a numeric vector
X12.OffPassComp
a numeric vector
X13.OffPassYds
a numeric vector
X14.OffPassTds
a numeric vector
X15.OffPassTD
a numeric vector
X16.OffPassINTs
a numeric vector
X17.OffPassINT
a numeric vector
X18.OffPassLongest
a numeric vector
X19.OffPassYdsperAtt
a numeric vector
X20.OffPassAdjYdsperAtt
a numeric vector
X21.OffPassYdsperComp
a numeric vector
X22.OffPasserRating
a numeric vector
X23.OffPassSacksAlwd
a numeric vector
X24.OffPassSackYds
a numeric vector
X25.OffPassNetYdsperAtt
a numeric vector
X26.OffPassAdjNetYdsperAtt
a numeric vector
X27.OffPassSack
a numeric vector
X28.OffRushYds
a numeric vector
X29.OffRushTds
a numeric vector
X30.OffRushLongest
a numeric vector
X31.OffRushYdsperAtt
a numeric vector
X32.OffFumbles
a numeric vector
X33.OffPuntReturns
a numeric vector
X34.OffPRYds
a numeric vector
X35.OffPRTds
a numeric vector
X36.OffPRLongest
a numeric vector
X37.OffPRYdsperAtt
a numeric vector
X38.OffKRTds
a numeric vector
X39.OffKRLongest
a numeric vector
X40.OffKRYdsperAtt
a numeric vector
X41.OffAllPurposeYds
a numeric vector
X42.1to19ydFGAtt
a numeric vector
X43.1to19ydFGMade
a numeric vector
X44.20to29ydFGAtt
a numeric vector
X45.20to29ydFGMade
a numeric vector
X46.1to29ydFG
a numeric vector
X47.30to39ydFGAtt
a numeric vector
X48.30to39ydFGMade
a numeric vector
X49.30to39ydFG
a numeric vector
X50.40to49ydFGAtt
a numeric vector
X51.40to49ydFGMade
a numeric vector
X52.50ydFGAtt
a numeric vector
X53.50ydFGAtt
a numeric vector
X54.40ydFG
a numeric vector
X55.OffTotFG
a numeric vector
X56.OffXP
a numeric vector
X57.OffTimesPunted
a numeric vector
X58.OffPuntYards
a numeric vector
X59.OffLongestPunt
a numeric vector
X60.OffTimesHadPuntBlocked
a numeric vector
X61.OffYardsPerPunt
a numeric vector
X62.FmblTds
a numeric vector
X63.DefINTTdsScored
a numeric vector
X64.BlockedKickorMissedFGRetTds
a numeric vector
X65.Off2ptConvMade
a numeric vector
X66.DefSafetiesScored
a numeric vector
X67.DefTotYdsAlwd
a numeric vector
X68.DefTotPlaysAlwd
a numeric vector
X69.DefTotYdsperPlayAlwd
a numeric vector
X70.DefTot1stDwnsAlwd
a numeric vector
X71.DefPass1stDwnsAlwd
a numeric vector
X72.DefRush1stDwnsAlwd
a numeric vector
X73.DefFumblesRecovered
a numeric vector
X74.DefPassCompAlwd
a numeric vector
X75.DefPassAttAlwd
a numeric vector
X76.DefPassCompAlwd
a numeric vector
X77.DefPassYdsAlwd
a numeric vector
X78.DefPassTdsAlwd
a numeric vector
X79.DefPassTDAlwd
a numeric vector
X80.DefPassINTs
a numeric vector
X81.DefPassINT
a numeric vector
X82.DefPassYdsperAttAlwd
a numeric vector
X83.DefPassAdjYdsperAttAlwd
a numeric vector
X84.DefPassYdsperCompAlwd
a numeric vector
X85.DefPasserRatingAlwd
a numeric vector
X86.DefPassSacks
a numeric vector
X87.DefPassSackYds
a numeric vector
X88.DefPassNetYdsperAttAlwd
a numeric vector
X89.DefPassAdjNetYdsperAttAlwd
a numeric vector
X90.DefPassSack
a numeric vector
X91.DefRushYdsAlwd
a numeric vector
X92.DefRushTdsAlwd
a numeric vector
X93.DefRushYdsperAttAlwd
a numeric vector
X94.DefPuntReturnsAlwd
a numeric vector
X95.DefPRTdsAlwd
a numeric vector
X96.DefKickReturnsAlwd
a numeric vector
X97.DefKRTdsAlwd
a numeric vector
X98.DefKRYdsperAttAlwd
a numeric vector
X99.DefTotFGAttAlwd
a numeric vector
X100.DefTotFGAlwd
a numeric vector
X101.DefXPAlwd
a numeric vector
X102.DefPuntsAlwd
a numeric vector
X103.DefPuntYdsAlwd
a numeric vector
X104.DefPuntYdsperAttAlwd
a numeric vector
X105.Def2ptConvAlwd
a numeric vector
X106.OffSafeties
a numeric vector
X107.OffRushSuccessRate
a numeric vector
X108.OffRunPassRatio
a numeric vector
X109.OffRunPly
a numeric vector
X110.OffYdsPt
a numeric vector
X111.DefYdsPt
a numeric vector
X112.HeadCoachDisturbance
a factor with levels No
Yes
, whether the head coached changed between this season and the last
X113.QBDisturbance
a factor with levels No
Yes
, whether the quarterback changed between this season and the last
X114.RBDisturbance
a factor with levels ?
No
Yes
, whether the runningback changed between this seasons and the last
X115.OffPassDropRate
a numeric vector
X116.DefPassDropRate
a numeric vector
Data was collected from many sources on the internet by a student for use in an independent study in the spring of 2013. Abbreviations for predictor variables typically follow the full name in prior variables, e.g., KR = kick returns, PR = punt returns, XP = extra point. Data is organized by year, so rows 1-32 rows are from 2002, rows 33-64 are from 2003, etc.
Contact the originator Weller Ross ([email protected]) for further details.
NFL
dataset
A subset of the NFL
dataset contain some statistics of teams on offense
data("OFFENSE")
data("OFFENSE")
A data frame with 352 observations on the following 10 variables.
Win
a numeric vector, number of wins of team over the season (0-16)
FirstDowns
a numeric vector, number of first downs made over the season
PassingYards
a numeric vector, number of passing yards over the season
Interceptions
a numeric vector, number of times ball was intercepted on offense
RushingYards
a numeric vector, number of rushing yards over the season
Fumbles
a numeric vector, number of fumbles made on offense
X1to19FGAttempts
a numeric vector, number of field goal attempts made from 1-19 yards
X20to29FGAttempts
a numeric vector, number of field goal attemps made from 20-29 yards
X30to39FGAttempts
a numeric vector
X40to50FGAttempts
a numeric vector
A small subset of the NFL
dataset contain select statistics. Seasons are from 2002-2012
This function shows regression lines on user-defined data before and after adding an additional point.
outlier_demo(cex.leg=0.8)
outlier_demo(cex.leg=0.8)
cex.leg |
A number specifying the magnification of legends inside the plot. Smaller numbers mean smaller font. |
This function allows the user to generate data by click on a plot. Once two points are added, the least squares regression line is draw. When an additional point is added, the regression line updates while also showing the line without that point. The effect of outliers on a regression line can easily be illustrated. Pressing the red UNDO button on the plot will allow you to take away recently added points for further exploration.
Note: To end the demo, you MUST click on the red box labeled "End" (or press Escape, which will return an error)
Adam Petrie
Introduction to Regression and Modeling
This function gives a demonstration of how overfitting occurs on a user-inputted dataset by showing the estimated generalization error as additional variables are added to the regression model (up to all two-way interactions).
overfit_demo(DF,y=NA,seed=NA,aic=TRUE)
overfit_demo(DF,y=NA,seed=NA,aic=TRUE)
DF |
The data frame where demonstration will occur. |
y |
The response variable (in quotes) |
seed |
Optional argument setting the random number seed if results need to be reproduced |
aic |
logical, if |
This function splits DF
in half to obtain training and holdout samples. Regression models are constructed using a forward selection procedure (adding the variable that decreases the AIC the most on the training set), starting at the naive model and terminating at the full model with all two-way interactions.
The generalization error of each model is computed on the holdout sample. The AIC (or RMSE on the training) and generalization errors are plotted versus the number of variables in the model to illustrate overfitting. Typically, the generalization error decreases at first as useful variables are added to the model, then the generalization error increases after the new variables added start to fit the quirks present only in the training data. When this happens, the model is said to be overfit.
Adam Petrie
Introduction to Regression and Modeling
#Overfitting occurs after about 10 predictors (AIC begins to increase after 12/13) data(BODYFAT) overfit_demo(BODYFAT,y="BodyFat",seed=1010) #Overfitting occurs after about 5 predictors data(OFFENSE) overfit_demo(OFFENSE,y="Win",seed=1997,aic=FALSE)
#Overfitting occurs after about 10 predictors (AIC begins to increase after 12/13) data(BODYFAT) overfit_demo(BODYFAT,y="BodyFat",seed=1010) #Overfitting occurs after about 5 predictors data(OFFENSE) overfit_demo(OFFENSE,y="Win",seed=1997,aic=FALSE)
Diabetes among women aged 21+ with Pima heritage
data("PIMA")
data("PIMA")
A data frame with 392 observations on the following 8 variables.
Pregnant
a numeric vector, number of times the woman has been pregnant
Glucose
a numeric vector, plasma glucose concentration
BloodPressure
a numeric vector, diastolic blood pressure in mm Hg
BodyFat
a numeric vector, a measurement of the triceps skinfold thickness which is an indicator of body fat percentage
Insulin
a numeric vector, 2-hour serum insulin
BMI
a numeric vector, body mass index
Age
a numeric vector, years
Diabetes
a factor with levels No
Yes
Data on 768 women belonging to the Pima tribe. The purpose is to study the associations between having diabetes and various physiological characteristics. Although there are surely other factors (including genetic) that influence the chance of having diabetes, the hope is that by having women who are genetically similar (all from the Pima tribe), that these other factors are naturally accounted for.
Adapted from the UCI data repository https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes. A variable measuring the “diabetes pedigree function" has been omitted.
Dosages and mortality of cockroaches
data("POISON")
data("POISON")
A data frame with 481 observations on the following 2 variables.
Dose
a numeric vector indicated the dosage of the poison administered to the cockroach
Outcome
a factor with levels Die
Live
Artificial data illustrating a dose-reponse curve. The probability of dying is well-modeled by a logistic regression model.
This function gives a demonstration of what simple linear or logistic regression lines could have looked like "by chance" if x and y were unrelated. A scatterplot and fitted regression line is displayed along with the regression lines produced when x and y are unrelated via the permutation procedure. The sum of squared error reductions for all lines (for linear regressions) are also displayed for an informal assessement of significance.
possible_regressions(M,permutations=100,sse=TRUE,reduction=TRUE)
possible_regressions(M,permutations=100,sse=TRUE,reduction=TRUE)
M |
A simple linear regression model from |
permutations |
The number of artificial samples generated with the permutation procedure to consider (each will have y and x be independent by design). |
sse |
Optional argument to either show or hide the histogram of sum of squared errors of the regression lines. |
reduction |
Optional argument that, if |
This function gives a scatterplot and fitted regression line for M
in red for a linear regression, or the fitted logistic curve (in black) for logistic regression. Then, via the permutation procedure, it generates permutations
, artificial samples where the observed values of x and y are paired up at random, ensuring that no relationship exists between them. A regression is fit on this permutation sample, and the regression line is drawn in grey to illustrate how it may look "by chance" when x and y are unrelated.
If requested, a histogram of the sum of squared error reductions of each of the regressions on the permutation datasets (and the original regression in red) is displayed to allow for an informal assessement of the statistical significance of the regression.
Adam Petrie
Introduction to Regression and Modeling
#A weak but statistically significant relationship data(TIPS) M <- lm(TipPercentage~Bill,data=TIPS) possible_regressions(M) #A very strong relationship data(SURVEY10) M <- lm(PercMoreIntelligentThan~PercMoreAttractiveThan,data=SURVEY10) possible_regressions(M,permutations=1000) #Show raw SSE instead of reductions M <- lm(TipPercentage~PartySize,data=TIPS) possible_regressions(M,reduction=FALSE)
#A weak but statistically significant relationship data(TIPS) M <- lm(TipPercentage~Bill,data=TIPS) possible_regressions(M) #A very strong relationship data(SURVEY10) M <- lm(PercMoreIntelligentThan~PercMoreAttractiveThan,data=SURVEY10) possible_regressions(M,permutations=1000) #Show raw SSE instead of reductions M <- lm(TipPercentage~PartySize,data=TIPS) possible_regressions(M,reduction=FALSE)
Sales of a product two quarters after release
data("PRODUCT")
data("PRODUCT")
A data frame with 2768 observations on the following 4 variables.
Outcome
a factor with levels fail
success
indicating whether the product was deemed a success or failure
Category
a factor with levels A
B
C
D
, the type of item (e.g., kitchen, toys, consumables)
Trend
a factor with levels down
up
, indicating whether the sales over the first 13 weeks had an upward trend or downward trend according to a simple linear regression
SoldWeek13
a numeric vector, the number of items sold 13 weeks after release
Inspired by the dunnhumby hackathon hosted at https://www.kaggle.com/c/hack-reduce-dunnhumby-hackathon. The goal is to predict whether a product will be a success or failure half a year after its release based on its characteristics and performance during the first quarter after its release.
Adapted from https://www.kaggle.com/c/hack-reduce-dunnhumby-hackathon
Purchase habits of customers
data("PURCHASE")
data("PURCHASE")
A data frame with 27723 observations on the following 6 variables.
Purchase
a factor with levels Buy
No
, whether the customer made a purchase in the following 30 days
Visits
a numeric vector, number of visits customer has made to the chain in last 90 days
Spent
a numeric vector, amount of money customer has spent at the chain the last 90 days
PercentClose
a numeric vector, the percentage of customers' purchases that occur within 5 miles of their home
Closest
a numeric vector, the distance between the customer's home and the nearest store in the chain
CloseStores
a numeric vector, the number of stores in the chain within 5 miles of the customer's home
A nation-wide chain is curious as to whether it can predict whether a former customer will make a purchase at one of its stores in the next 30 days based on the customer's spending habits. Some variables are known by the chain (e.g., Visits
) and some are available to purchase from credit card companies (e.g., PercentClose
). Is purchasing additional information about the customer worth it?
Adapted from real data on the condition that neither the name of the chain nor other parties be disclosed.
A QQ plot designed with statistics students in mind
qq(x,ax=NA,leg=NA,cex.leg=0.8)
qq(x,ax=NA,leg=NA,cex.leg=0.8)
x |
A vector of data |
ax |
The name you want to call |
leg |
Optional argument that places a legend in the top left of the plot with the text given by |
cex.leg |
Optional argument that gives the magnification of the text in the legend |
This function gives a "QQ plot" that is more easily interpreted than the standard QQ plot. Instead of plotting quantiles, it plots the observed values of x
versus the values expected had x
come from a Normal distribution.
The distribution can be considered approximately Normal if the points stay within the upper/lower dashed red lines (with the possible exception at the far left/right) and if there is no overall global curvature.
Adam Petrie
Introduction to Regression and Modeling
#Distribution does not resemble a Normal data(TIPS) qq(TIPS$Bill,ax="Bill") #Distribution resembles aNormal data(ATTRACTF) qq(ATTRACTF$Score,ax="Attractiveness Score")
#Distribution does not resemble a Normal data(TIPS) qq(TIPS$Bill,ax="Bill") #Distribution resembles aNormal data(ATTRACTF) qq(ATTRACTF$Score,ax="Attractiveness Score")
Harris Bank Salary data
data("SALARY")
data("SALARY")
A data frame with 93 observations on the following 5 variables.
Salary
a numeric vector, starting monthly salary in dollars
Education
a numeric vector, years of schooling at the time of hire
Experience
a numeric vector, number of years of previous work experience
Months
a numeric vector, number of months after Jan 1 1969 that the individual was hired
Gender
a factor with levels Female
Male
Real data used in a court lawsuit. 93 randomly selected employees of Harris Bank Chicago from 1977. Values in this data have been scaled from the original values (e.g., Experience
in years instead of months, Education
starts at 0 instead of 8, etc.)
Adapted from the case study at http://www.stat.ualberta.ca/statslabs/casestudies/sexdiscrimination.htm
Plots all pairwise interactions present in a regression model to allow for an informal assessment of their strength. When both variables are quantitative, the implicit regression lines of y vs. x1 for a small, the median, and a large value of x2 are provided (and vice versa). If one of the variables is categorical, the implicit regression lines of y vs. x as displayed for each level of the categorical variable.
see_interactions(M,pos="bottomright",many=FALSE,level=0.95,...)
see_interactions(M,pos="bottomright",many=FALSE,level=0.95,...)
M |
A fitted linear regression model with interactions between quantitative variables. |
pos |
Where to put the legend, one of "topleft", "top", "topright", "left","center","right","bottomleft","bottom","bottomright" |
many |
If |
level |
Defines what makes a "small" and "large" value of x1 and x2. By default |
... |
Additional arguments to |
When determining the implicit regression lines, all variables not involved in the interaction are assumed to be equal 0 (if quantitative) or equal to the level that comes first alphabetically (if categorical). Tickmarks on the y axis are thus irrelevant and are not displayed.
The plots allow an informal assessment of the presence of an interaction between the variables x1 and x2 in the model, after accounting for the other predictors. If the implicit regression lines are nearly parallel, then the interaction is weak if it exists at all. If the implicit regression lines have noticeably different slopes, then the interaction is strong.
When an interaction is present, then the strength of the relationship between y and x1 depends on the value of x2. In other words, the difference in the average value of y between two individuals who differ in x1 by 1 unit depends on their (common) value of x2 (sometimes the expected difference is large; sometimes it is small).
If one of the variables in the interaction is cateogorical, the presence of an interaction implies that the strength of the relationship between y and x is different between levels of the categorical variable. In other words, sometimes the difference in the expected value of y between an individual with level A and an individual with level B is large and sometimes it is small (and this depends on the common value of x of the individuals we are comparing).
The command visualize.model
gives a better representation when only two predictors are in the model.
Adam Petrie
Introduction to Regression and Modeling
data(SALARY) M <- lm(Salary~.^2,data=SALARY) #see_interactions(M,many=TRUE) #not run since it requires user input data(STUDENT) M <- lm(CollegeGPA~(Gender+HSGPA+Family)^2+HSGPA*ACT,data=STUDENT) see_interactions(M,cex=0.6)
data(SALARY) M <- lm(Salary~.^2,data=SALARY) #see_interactions(M,many=TRUE) #not run since it requires user input data(STUDENT) M <- lm(CollegeGPA~(Gender+HSGPA+Family)^2+HSGPA*ACT,data=STUDENT) see_interactions(M,cex=0.6)
This function takes the output of regsubsets
and prints out a table of the top performing models based on AIC criteria.
see_models(ALLMODELS,report=0,aicc=FALSE,reltomin=FALSE)
see_models(ALLMODELS,report=0,aicc=FALSE,reltomin=FALSE)
ALLMODELS |
An object of class regsubsets created from |
report |
An optional argument specifying the number of top models to print out. If left at a default of 0, the function reports all models whose AICs are within 4 of the lowest overall AIC. |
aicc |
Either |
reltomin |
Either |
This function uses the summary
function applied to the output of regsubsets
. The AIC is calculated to be the one obtained via extractAIC
to allow for easy comparison with build.model
and step
.
Although the model with the lowest AIC is typically chosen when making a descriptive model, models with AICs within 2 are essentially functionally equivalent. Any model with an AIC within 2 of the smallest is a reasonable choice since there is no statistical reason to prefer one over the other. The function returns a data frame of the AIC (or AICc), the number of variables, and the predictors in the "best" models.
Recall that the function regsubsets
by default considers up to 8 predictors and does not preserve model hierarchy. Interactions may appear without both component terms. Further, only a subset of the indicator variables used to represent a categorical variable may appear.
Adam Petrie
Introduction to Regression and Modeling
data(SALARY) ALL <- regsubsets(Salary~.^2,data=SALARY,method="exhaustive",nbest=4) see_models(ALL) #By default, regsubsets considers up to 8 predictors, here it looks at up to 15 data(ATTRACTF) ALL <- regsubsets(Score~.,data=ATTRACTF,nvmax=15,nbest=1) see_models(ALL,aicc=TRUE,report=5)
data(SALARY) ALL <- regsubsets(Salary~.^2,data=SALARY,method="exhaustive",nbest=4) see_models(ALL) #By default, regsubsets considers up to 8 predictors, here it looks at up to 15 data(ATTRACTF) ALL <- regsubsets(Score~.,data=ATTRACTF,nvmax=15,nbest=1) see_models(ALL,aicc=TRUE,report=5)
Produces a segmented barchart of the input variable, forcing it to be categorical if necessary
segmented_barchart(x)
segmented_barchart(x)
x |
A vector. If numerical, it is treated as categorical variable in the form of a factor |
Standard segmented barchart. Shaded areas are labeled with the levels they represent, and the percentage of cases with that level is labeled on the axis to the right.
Adam Petrie
Introduction to Regression and Modeling
data(STUDENT) segmented_barchart(STUDENT$Family) #Categorical variable data(TIPS) segmented_barchart(TIPS$PartySize) #Numerical variable treated as categorical
data(STUDENT) segmented_barchart(STUDENT$Family) #Categorical variable data(TIPS) segmented_barchart(TIPS$PartySize) #Numerical variable treated as categorical
Interest in a frequent flier program (artificial)
data("SMALLFLYER")
data("SMALLFLYER")
A data frame with 100 observations on the following 2 variables.
Gender
a factor with levels Female
Male
Interest
a factor with levels No
Yes
This artificial datasets tabulates the interest in a new frequent flyer program based on gender. A larger version of the same data is in LARGEFLYER
.
Predicting future sales based on sales data in first quarter after release
data("SOLD26")
data("SOLD26")
A data frame with 2768 observations on the following 16 variables.
SoldWeek26
a numeric vector, the number of items sold 26 weeks after release and the quantity to predict
StoresSelling1
a numeric vector, the number of stores selling the item 1 week after release
StoresSelling3
a numeric vector
StoresSelling5
a numeric vector
StoresSelling7
a numeric vector
StoresSelling9
a numeric vector
StoresSelling11
a numeric vector
StoresSelling13
a numeric vector
StoresSelling26
a numeric vector, the planned number of stores selling the item 26 weeks after release
Sold1
a numeric vector, the number of items sold 1 week after release
Sold3
a numeric vector
Sold5
a numeric vector
Sold7
a numeric vector
Sold9
a numeric vector
Sold11
a numeric vector
Sold13
a numeric vector, the number of items sold 13 weeks after release
Inspired by the dunnhumby hackathon hosted at https://www.kaggle.com/c/hack-reduce-dunnhumby-hackathon. The goal is to predict the number of items sold 26 weeks after released based on the characteristics of its sales during the first 13 weeks after release (along with information about how many stores are planning to sell the product 26 weeks after release).
Adapted from https://www.kaggle.com/c/hack-reduce-dunnhumby-hackathon
Speed vs. Fuel Efficiency
data("SPEED")
data("SPEED")
A data frame with 40 observations on the following 2 variables.
AverageSpeed
a numeric vector describing the average speed that the vehicle was driven
FuelEfficiency
a numeric vector describing the measured fuel efficiency
The relationship between fuel efficiency and speed is non-monotonic.
Artificial
Data on the College GPAs of students in an introductory statistics class
data("STUDENT")
data("STUDENT")
A data frame with 607 observations on the following 19 variables.
CollegeGPA
a numeric vector
Gender
a factor with levels Female
Male
HSGPA
a numeric vector, can range up to 5 if the high school allowed it
ACT
a numeric vector, ACT score
APHours
a numeric vector, number of AP hours student took in HS
JobHours
a numeric vector, number of hours student currently works on average
School
a factor with levels Private
Public
, type of HS
Languages
a numeric vector
Honors
a numeric vector, number of honors classes taken in HS
Smoker
a factor with levels No
Yes
AffordCollege
a factor with levels No
Yes
, can the student and his/her family pay for the University of Tennessee without taking out loans?
HSClubs
a numeric vector, number of clubs belonged to in HS
HSJob
a factor with levels No
Yes
, whether the student maintained a job at some point while in HS
Churchgoer
a factor with levels No
Yes
, answer to the question Do you regularly attend chruch?
Height
a numeric vector (inches)
Weight
a numeric vector (lbs)
Class
a factor with levels Junior
Senior
Sophomore
Family
what position they are in the family, a factor with levels Middle Child
Oldest Child
Only Child
Youngest Child
Pet
favorite pet, a factor with levels Both
Cat
Dog
Neither
Same data as EDUCATION
with the addition of the Class
variable and with slighly different names for variables.
Responses are from students in an introductory statistics class at the University of Tennessee in 2010.
This function determines levels that are similar to each other either in terms of their average value of some quantitative variable or the percentages of each level of a two-level categorical variable. Use it to get a rough idea of what levels are "about the same" with regard to some variable.
suggest_levels(formula,data,maxlevels=NA,target=NA,recode=FALSE,plot=TRUE,...)
suggest_levels(formula,data,maxlevels=NA,target=NA,recode=FALSE,plot=TRUE,...)
formula |
A standard R formula written as y~x. Here, x is the variable whose levels you wish to combine, and y is the quantitative or two-level categorical variable. |
data |
An optional argument giving the name of the data frame that contains x and y. If not specified, the function will use existing definitions in the parent environment. |
maxlevels |
The maximum number of combined levels to consider (cannot exceed 26). |
target |
The number of resulting levels into which the levels of x will be combined. Will default to the suggested value of the fewest number whose resulting BIC is no more than 4 above the lowest BIC of any combination. |
recode |
|
plot |
|
... |
Additional arguments used to make the plot. Typically this will be |
This function calculates the average value (or percentage of each level) of y for each level of x. It then builds a partition model taking y to be this average value (or percentage) with x being the predictor variable. The first split yields the "best" scheme for combining levels of x into 2 values. The second split yields the "best" scheme for combining levels of x into 3 values, etc.
The argument maxlevels
specifies the maximum numbers of levels in the combination scheme. By default, it will use the number of levels of x (ie, no combination). Setting this to a lower number saves time, since most likely a small number of combined levels is desired. This is useful for seeing how different combination schemes compare.
The argument target
will force the algorithm to producing exactly this number of combined levels. This is useful once you have determined how many levels of x you want.
If recode
is FALSE
, a table showing the combined levels along with the "BIC" of the combination scheme (lower is better, but a difference of around 4 or less is negligible). The suggested combination will be the fewer number of levels which has as BIC no more than 4 above the scheme that gave the lowest BIC.
If recode
is TRUE
, a list of three elements is produced. $Conversion1
gives a table of the Old and New levels alphabetized by Old while $Conversion2
gives a table of the Old and New levels alphabized by New. $newlevels
gives a factor of the cases levels under the new combination scheme. If target
is not set, it will use the suggested number of levels.
Adam Petrie
Introduction to Regression and Modeling
data(DONOR) #Can levels of URBANICITY be treated the same with regards to probability of donation? #Analysis suggests yes (all levels in one) suggest_levels(Donate~URBANICITY,data=DONOR) #Can levels of URBANICITY be treated the same with regards to donation amount? #Analysis suggests yes, but perhaps there are four "effective levels" suggest_levels(Donation.Amount~URBANICITY,data=DONOR) SL <- suggest_levels(Donation.Amount~URBANICITY,data=DONOR,target=4,recode=TRUE) SL$Conversion #Add a column to the DONOR dataframe that contains these new cluster identities DONOR$newCLUSTER_CODE <- SL$newlevels
data(DONOR) #Can levels of URBANICITY be treated the same with regards to probability of donation? #Analysis suggests yes (all levels in one) suggest_levels(Donate~URBANICITY,data=DONOR) #Can levels of URBANICITY be treated the same with regards to donation amount? #Analysis suggests yes, but perhaps there are four "effective levels" suggest_levels(Donation.Amount~URBANICITY,data=DONOR) SL <- suggest_levels(Donation.Amount~URBANICITY,data=DONOR,target=4,recode=TRUE) SL$Conversion #Add a column to the DONOR dataframe that contains these new cluster identities DONOR$newCLUSTER_CODE <- SL$newlevels
Reports the RMSE, AIC, and variable importances for a partition model or the variable importances from a random forest.
summarize_tree(TREE)
summarize_tree(TREE)
TREE |
A partition model created with |
Extracts the RMSE and AIC of a partition model and the variable importances of partition models or random forests.
Adam Petrie
Introduction to Regression and Modeling
data(WINE) TREE <- rpart(Quality~.,data=WINE,control=rpart.control(cp=0.01,xval=10,minbucket=5)) summarize_tree(TREE) RF <- randomForest(Quality~.,data=WINE,ntree=50) summarize_tree(RF) data(NFL) TREE <- rpart(X4.Wins~.,data=NFL,control=rpart.control(cp=0.002,xval=10,minbucket=5)) summarize_tree(TREE) RF <- randomForest(X4.Wins~.,data=NFL,ntree=50) summarize_tree(RF)
data(WINE) TREE <- rpart(Quality~.,data=WINE,control=rpart.control(cp=0.01,xval=10,minbucket=5)) summarize_tree(TREE) RF <- randomForest(Quality~.,data=WINE,ntree=50) summarize_tree(RF) data(NFL) TREE <- rpart(X4.Wins~.,data=NFL,control=rpart.control(cp=0.002,xval=10,minbucket=5)) summarize_tree(TREE) RF <- randomForest(X4.Wins~.,data=NFL,ntree=50) summarize_tree(RF)
Characteristics of students in an introductory statistics class at the University of Tennessee in 2009
data("SURVEY09")
data("SURVEY09")
A data frame with 579 observations on the following 47 variables.
X01.ID
a numeric vector
X02.Gender
a factor with levels Female
Male
X03.Weight
a numeric vector, estimated weight
X04.DesiredWeight
a numeric vector
X05.Class
a factor with levels Freshman
Junior
Senior
Sophmore
X06.BornInTN
a factor with levels No
Yes
X07.Greek
a factor with levels No
Yes
, if the student belongs to a fraternity/sorority
X08.UTFirstChoice
a factor with levels No
Yes
X09.Churchgoer
a factor with levels No
Yes
, does student attend a religious service once a week
X10.ParentsMarried
a factor with levels No
Yes
X11.GPA
a numeric vector
X12.SittingLocation
a factor with levels Back
Front
Middle
Varies
X13.WeeklyHoursStudied
a numeric vector
X14.Scholarship
a factor with levels No
Yes
X15.FacebookFriends
a numeric vector
X16.AgeFirstKiss
a numeric vector, age at which student had their first romantic kiss
X17.CarYear
a numeric vector
X18.DaysPerWeekAlcohol
a numeric vector, how many days a week student typically drinks
X19.NumDrinksParty
a numeric vector, how many drinks student typically has when he or she goes to a party
X20.CellProvider
a factor with levels ATT
Sprint
USCellar
Verizon
X21.FreqDroppedCalls
a factor with levels Occasionally
Often
Rarely
X22.MarriedAt
a numeric vector, age by which student hopes to be married
X23.KidsBy
a numeric vector, age by which students hopes to have kids
X24.Computer
a factor with levels Mac
Windows
X25.FastestDrivingSpeed
a numeric vector
X26.BusinessMajor
a factor with levels No
Yes
X27.Major
a factor with levels Business
NonBusiness
X28.TxtsPerDay
a numeric vector
X29.FootballGames
a numeric vector, games student hopes to attend
X30.HoursWorkOut
a numeric vector, per week
X31.MilesToSchool
a numeric vector, each day
X32.MoneyInBank
a numeric vector
X33.MoneyOnHaircut
a numeric vector
X34.PercentTuitionYouPay
a numeric vector
X35.SongsDownloaded
a numeric vector, songs typically downloaded (legally/illegally) a month
X36.ParentCollegeGraduate
a factor with levels No
Yes
X37.HoursSleepPerNight
a numeric vector
X38.Last2DigitsPhone
a numeric vector
X39.NumClassesMissed
a numeric vector
X40.BooksReadThisYear
a numeric vector
X41.UseChopsticks
a factor with levels No
Yes
X42.YourAttractiveness
a numeric vector, 1 (unattractive) to 5 (very attractive)
X43.Obama
a factor with levels No
NotVote
Yes
X44.HoursWorkedPerWeek
a numeric vector, at a job outside of a school
X45.MoviesInTheater
a numeric vector, number watched in theater this year
X46.KnowSomeoneH1N1
a factor with levels No
Yes
X47.ReadBeacon
a factor with levels No
Yes
, the school newspaper
Students answered 47 questions to generate data for a project in an introductory statistics class at the University of Tennessee in the Fall of 2009. The responses here have only had minimal cleaning (negative numbers omitted) so some data is bad (e.g., a weight of 16). The questions were:
Stat 201 Fall 2009 Survey Questions 1. What section are you in? 2. Gender [Male, Female] 3. Your weight (in pounds) [0 to 500] 4. What is your desired weight (in pounds)? [0 to 1000] 5. What year are you? [Freshman, Sophomore, Junior, Senior, Other] 6. Were you born in Tennessee? [Yes, No] 7. Are you a member of a Greek social society (i.e., a Fraternity/Sorority? [Yes, No] 8. Was UT your first choice? [Yes, No] 9. Do you usually attend a religious service once a week? [Yes, No] 10. Are your parents married? [Yes, No] 11. Thus far, what is your GPA (look up on CPO if you need to)? [0 to 4] 12. Given a choice, where do you like to sit in class? [The front row, Near the front, Around the middle, Near the back, The back row, Somewhere different all the time] 13. On average, how many hours per day do you study/do homework? [0 to 24] 14. Do you receive one or more scholarships? [Yes, No] 15. How many Facebook friends do you have? Type -1 if you dont use Facebook. [-1 to 5000] 16. How old were you when you had your first romantic kiss? Type -1 if it has not happened yet. [-1 to 100] 17. What is the year of the car you drive most often? Type a four digit number. Enter 1908 if you never drive a car. [1908 to 2011] 18. On average, how many days per week do you consume one or more alcoholic beverage? Type -1 if you never drink alcoholic beverages. [-1 to 7] 19. On average, how many alcoholic drinks do you have when you party? Type -1 if you never drink alcoholic beverages. [-1 to 100] 20. Which cell phone provider do you use (the most, if you have multiple services)? [ATT (Cingular), Cricket, Sprint, T-Mobile, U.S. Cellular, Verizon, Other, I dont use a cell phone] 21. How often do you have dropped calls? [Never, Rarely, Sometimes, Often, Constantly] 22. What is the age at which you hope to be married? Type -1 if you are already married and type -2 if you never want to get married. [-2 to 100] 23. What is the age at which you hope to have your first child? Type -1 if you already have one or more children, type -2 if you never want to have children. [-2 to 100] 24. What type of computer do you use most often? [PC running Windows, PC running linux, Mac running Mac OS, Mac running linux, Mac running Windows, Other, I dont understand the choices above] 25. What is the fastest speed (in miles per hour) you have ever achieved while driving a car? [0 to 300] 26. Do you plan on going into the Business School? [Yes, No] 27. What is your desired (or actual) major? [Accounting, Economics, Finance, Logistics, Marketing, Statistics, Other] 28. How many text messages do you typically send on any given day? Type -1 if you never send text messages. [-1 to 1000] 29. How many UT football games do you hope to attend this year? (Include games already attended this year. Do not include scrimmages.) [0 to 14] 30. How many hours a week do you work out/play sports/exercise, etc.? [0 to 168] 31. How many miles do you drive to school on a typical day? [0 to 500] 32. How much money do you have in your bank account? Type -999 if you think its none of our business. [-999 to 10000000] 33. How much do you typically spend on a hair cut? [0 to 1000] 34. What percent of tuition are you personally responsible for? Type a number between 0 and 100. [0 to 100] 35. Typically, how many songs do you download a month (both legally and/or illegally)? [0 to 10000] 36. Did at least one of your parents graduate from college? [Yes, No] 37. On average, how many hours do you sleep a night? [0 to 24] 38. What are the last two digits of your phone number? (Type 0 for 00, 1 for 01, 2 for 02, etc.) [0 to 99] 39. Approximately how many classes have you missed/skipped so far this semester? (For all your courses, including absences for legitimate excuses) [0 to 150] 40. How many books (other than textbooks) have you read so far this year? [0 to 1000] 41. Are you proficient with a pair of chopsticks? [Yes, No] 42. How would you rate your attractiveness on a scale of 1 to 5, with 5 being the most attractive? [1 to 5] 43. Did you vote for Barack Obama in last Novembers election? [Yes, No I voted for someone else, No I didnt vote at all] 44. On average, how many hours do you work at a job per week? [0 to 168] 45. How many movies have you watched in theaters this year? [0 to 1000] 46. Do you personally know someone who has come down with H1N1 virus? [Yes, No] 47. Do you read the Daily Beacon on a regular basis? [Yes, No]
Characteristics of students in an introductory statistics class at the University of Tennessee in 2010
data("SURVEY10")
data("SURVEY10")
A data frame with 699 observations on the following 20 variables.
Gender
a factor with levels Female
Male
Height
a numeric vector
Weight
a numeric vector
DesiredWeight
a numeric vector
GPA
a numeric vector
TxtPerDay
a numeric vector
MinPerDayFaceBook
a numeric vector
NumTattoos
a numeric vector
NumBodyPiercings
a numeric vector
Handedness
a factor with levels Ambidextrous
Left
Right
WeeklyHrsVideoGame
a numeric vector
DistanceMovedToSchool
a numeric vector
PercentDateable
a numeric vector
NumPhoneContacts
a numeric vector
PercMoreAttractiveThan
a numeric vector
PercMoreIntelligentThan
a numeric vector
PercMoreAthleticThan
a numeric vector
PercFunnierThan
a numeric vector
SigificantOther
a factor with levels No
Yes
OwnAttractiveness
a numeric vector
Students answered 50 questions to generate data for a project in an introductory statistics class at the University of Tennessee in the Fall of 2010. The data here represent a selection of the questions. The responses have been somewhat cleaned (unlike SURVEY09
) where obviously bogus responses have been omitted, but there may still be issue.
The selected questions were:
Gender
Gender [Male, Female]
Height
Your height (in inches) [48 to 96]
Weight
Your weight (in pounds) [0 to 500]
DesiredWeight
What is your desired weight (in pounds)? [0 to 1000]
GPA
Thus far, what is your GPA (look up on CPO if you need to)? [0 to 4]
TxtPerDay
How many text messages do you typically send on any given day? Type 0 if you
never send text messages. [0 to 1000]
MinPerDayFaceBook
On average, how many minutes per day do you spend on internet social networks
(such as Facebook, MySpace, Twitter, LinkedIn, etc.)? [0 to 1440]
NumTattoos
How many tattoos do you have? [0 to 100]
NumBodyPiercings
How many body piercings do you have (do not include piercings you have let heal
up and are gone)? Count each piercing separately (i.e., pierced ears counts as 2
piercings). [0 to 100]
Handedness
Are you right-handed, left-handed, or ambidextrous? [Right-Handed, Left-
Handed, Ambidextrous]
WeeklyHrsVideoGame
About how many hours a week do you play video games? This includes console games like Wii, Playstation, Xbox, as well as gaming apps for your phone, online games in Facebook, general computer games, etc. [0 to 168]
DistanceMovedToSchool
Go to maps.google.com or another website that provides maps. Get directions from your home address (the house/apartment/etc. you most recently lived in before coming to college) and the zip code 37996. How many miles does it say the trip is? Type the smallest number if offered multiple routes. Type 0 if you are unable to get driving directions for any reason. [0 to 5000]
PercentDateable
What percentage of people around your age in your preferred gender do you
consider dateable? [0 to 100]
NumPhoneContacts
How many contacts do you have in your cell phone? Answer 0 if you don't use a
cell phone, or have no contacts in your cell phone. [0 to 1000]
PercMoreAttractiveThan
What percentage of people at UT of your own gender and class level do you think you are more attractive than? [0 to 100]
PercMoreIntelligentThan
What percentage of people at UT of your own gender and class level do you think you are more intelligent than? [0 to 100]
PercMoreAthleticThan
What percentage of people at UT of your own gender and class level do you think you are more athletic than? [0 to 100]
PercFunnierThan
What percentage of people at UT of your own gender and class level do you think you are funnier than? [0 to 100]
SigificantOther
Do you have a significant other? [Yes, No]
OwnAttractiveness
On a scale of 1-100, with 100 being the most attractive, rate your own
attractiveness. [1 to 100]
Characteristics of students in an introductory statistics class at the University of Tennessee in 2011
data("SURVEY11")
data("SURVEY11")
A data frame with 628 observations on the following 51 variables.
X01.ID
a numeric vector
X02.Gender
a factor with levels F
M
X03.Height
a numeric vector
X04.Weight
a numeric vector
X05.SatisfiedWithWeight
a factor with levels No I Wish I Weighed Less
No I Wish I Weighed More
Yes
X06.Class
a factor with levels Freshman
Junior
Senior
Sophomore
X07.GPA
a numeric vector
X08.Greek
a factor with levels No
Yes
X09.PoliticalBeliefs
a factor with levels Conservative
Liberal
Mix
X10.BornInTN
a factor with levels No
Yes
X11.HairColor
a factor with levels Black
Blonde
Brown
Red
X12.GrowUpInUS
a factor with levels No
Yes
X13.NumberHousemates
a numeric vector
X14.FacebookFriends
a numeric vector
X15.NumPeopleTalkToOnPhone
a numeric vector
X16.MinutesTalkOnPhone
a numeric vector
X17.PeopleSendTextsTo
a numeric vector
X18.NumSentTexts
a numeric vector
X19.Computer
a factor with levels Mac
PC
X20.Churchgoer
a factor with levels No
Yes
X21.HoursAtJob
a numeric vector
X22.FastestCarSpeed
a numeric vector
X23.NumTimesBrushTeeth
a numeric vector
X24.SleepPerNight
a numeric vector
X25.MinutesExercisingDay
a numeric vector
X26.BooksReadMonth
a numeric vector
X27.ShowerLength
a numeric vector
X28.PercentRecordedTV
a numeric vector
X29.MostMilesRunOneDay
a numeric vector
X30.MorningPerson
a factor with levels No
Yes
X31.PercentStudentsDateable
a numeric vector
X32.PercentYouAreMoreAttractive
a numeric vector
X33.PercentYouAreSmarter
a numeric vector
X34.RelationshipStatus
a factor with levels Complicated
Dating
Married
Single
X35.AgeFirstKiss
a numeric vector
X36.WeaponAttractMate
a factor with levels Humor
Intelligence
Looks
Other
X37.NumSignificantOthers
a numeric vector
X38.WeeksLongestRelationship
a numeric vector
X39.NumDrinksWeek
a numeric vector
X40.FavAlcohol
a factor with levels Beer
Liquor
None
Wine
X41.SpeedingTickets
a numeric vector
X42.Smoker
a factor with levels No
Yes
X43.IllegalDrugs
a factor with levels No
Yes
X44.DefendantInCourt
a factor with levels No
Yes
X45.NightInJail
a factor with levels No
Yes
X46.BrokenBone
a factor with levels No
Yes
X47.CentsCarrying
a numeric vector
X48.SawLastHarryPotter
a factor with levels No
Yes
X49.NumHarryPotterRead
a numeric vector
X50.HoursContinuouslyAwake
a numeric vector
X51.NumCountriesVisited
a numeric vector
Students answered 51 questions to generate data for a project in an introductory statistics class at the University of Tennessee in the Fall of 2011. The responses have been minimally modified or cleaned. The questions were:
1. What section are you in? (To be viewed only by the Stat 201 coordinator, and removed prior to distributing the data.) 2. What is your gender? [M,F] 3. What is your height (in inches)? [0,100] 4. What is your weight (in pounds)? [0,1000] 5. Are you satisfied with your current weight? [Yes, No I wish I weighed less, No I wish I weighed more] 6. What is your class level? [Freshman, Sophomore, Junior, Senior, 5+ year senior, Non-traditional] 7. What is your current GPA? [0,4] 8. Are you a member of a fraternity/sorority? [Yes, No] 9. Overall, do you consider your social/political beliefs to be: [more liberal, more conservative, a mix of liberal and conservative views] 10. Were you born in Tennessee? [Yes, No] 11. What is your natural hair color? [Black, Brown, Red, Blond, Gray] ##There was a database error requiring Blond and Gray to be combined into one category. 12. Did you grow up in the US? [Yes, No, Some time in the US but a significant time in another country] 13. How many people share your current residence? Count yourself, so if you live alone, answer 1. Also, if you live in a dorm, count yourself plus just your roommates/suitemates. [1, 1000] 14. How many Facebook friends do you currently have? (To see how many friends you have in Facebook, open a new tab or browser window and log in to Facebook, click the down arrow next to Account, select Edit Friends, and on the left of your screen your friends count is in parentheses.) [0,10000] 15. How many people do you talk to on the phone in a typical day? [0,1000] 16. How many MINUTES a day do you typically spend on the phone talking to people? [0,1440] 17. How many different people do you typically send text messages to on a typical day? [0,1000] 18. How many total texts do you think you send to people on a typical day? [0,5000] 19. What type of computer do you use the most? [Mac, PC, Linux] 20. Do you currently attend religious services at least once a month? [Yes, No] 21. About how many HOURS PER WEEK do you work at a job? [0,168] 22. What is the fastest speed you have achieved while driving a car (in miles per hour)? [0, 500] 23. How many times per day do you typically brush your teeth? [0, 100] 24. On a typical school night, how many HOURS do you sleep? [0, 24] 25. How many MINUTES PER DAY do you typically engage in physical activity (e.g., walking to and from class, working out at the gym, sports practice, etc.)? [0, 1440] 26. How many books have you read from cover to cover over the last month for pleasure? [0, 1000] 27. How many MINUTES do you typically spend when you take a shower? [0, 1440] 28. Advertisers are concerned that people are "fast forwarding" past their TV commercials, because more and more people are recording broadcast television and watching it later (for example, on a DVR). Approximately what percent of the TV that you watch (that HAS commercials in it) is something you recorded, and therefore you can "fast forward" past the commercials? [0, 100] 29. What is the longest that you've ever walked/run/hiked in a single day (in MILES)? [0,189] 30. Do you consider yourself a "morning person"? [Yes, No] 31. What percentage of UT students in your preferred gender do you think are dateable? [0, 100] 32. What percentage of UT students do you think you are more attractive than? [0, 100] 33. What percentage of UT students do you think you are more intelligent than? [0, 100] 34. What is your relationship status? [Single, Casually dating one or more people, Dating someone regularly, Engaged, Married, It's complicated] 35. How old were you when you had your first romantic kiss? (Enter 0 if this has not yet happened.) [0, 99] 36. Which of the following would you consider to be your main weapon for attracting a potential mate? [Looks, Intelligence, Sense of Humor, Other] 37. How many boyfriends/girlfriends have you had? (We'll leave it up to you as to what constitutes a boyfriend or girlfriend.) [0, 1000] 38. What is the longest amount of time (in WEEKS) that you have been in a relationship with a significant other? (A shortcut: take the number of months and multiply by 4, or the number of years and multiply by 52.) [0, 4000] 39. How many alcoholic beverages do you typically consume PER WEEK? (consider 1 alcoholic beverage a 12 oz. beer, a 4 oz. glass of wine, a 1 oz. shot of liquor, etc.) [0, 200] 40. What is your favorite kind of alcoholic beverage? [I don't drink alcoholic beverages, Beer, Wine, Whiskey, Vodka, Gin, Tequila, Rum, Other] 41. How may speeding tickets have you received? [0, 500] 42. Do you consider yourself a "smoker"? [Yes, No] 43. Have you ever used an illegal/controlled substance? (Exclude alcohol/cigarettes consumed when underaged.) [Yes, No] 44. Have you ever appeared before a judge/jury as a defendant? (Exclude speeding or parking tickets.) [Yes, No] 45. Have you ever spent the night in a jail cell? [Yes, No] 46. Have you ever broken a bone that required surgery or a cast (or both)? [Yes, No] 47. Check your pockets and/or purse and report how much money in coins (in CENTS) that you currently are carrying. For example, if you have one quarter and one penny, type 26, not 0.26. [0, 1000] 48. Have you seen the latest Harry Potter movie that came out in July 2011? [Yes, No] 49. How many of the seven Harry Potter books have you completely read? [0, 7] 50. Estimate the longest amount of time (in HOURS) that you have continuously stayed awake. [0, 450] 51. How many countries have you ever stepped foot in outside an airport (include the US in your count)? [1, 196]
One waiter recorded information about each tip he received over a period of a few months working in one restaurant. He collected several variables:
data("TIPS")
data("TIPS")
A data frame with 244 observations on the following 8 variables.
TipPercentage
a numeric vector, the tip written as a percentage (0-100) of the total bill
Bill
a numeric vector, the bill amount (dollars)
Tip
a numeric vector, the tip amount (dollars)
Gender
a factor with levels Female
Male
, gender of the payer of the bill
Smoker
a factor with levels No
Yes
, whether the party included smokers
Weekday
a factor with levels Friday
Saturday
Sunday
Thursday
, day of the week
Time
a factor with levels Day
Night
, rough time of day
PartySize
a numeric vector, number of people in party
This is the Tips
dataset in package reshape
, modified to include the tip percentage.
Calculates the variation inflation factors of all predictors in regression models
VIF(mod)
VIF(mod)
mod |
A linear or logistic regression model |
This function is a simple port of vif
from the car
package. The VIF of a predictor is a measure for how easily it is predicted from a linear regression using the other predictors. Taking the square root of the VIF tells you how much larger the standard error of the estimated coefficient is respect to the case when that predictor is independent of the other predictors.
A general guideline is that a VIF larger than 5 or 10 is large, indicating that the model has problems estimating the coefficient. However, this in general does not degrade the quality of predictions. If the VIF is larger than 1/(1-R2), where R2 is the Multiple R-squared of the regression, then that predictor is more related to the other predictors than it is to the response.
Adam Petrie
Introduction to Regression and Modeling with R
#A case where the VIFs are small data(SALARY) M <- lm(Salary~.,data=SALARY) VIF(M) #A case where (some of) the VIFs are large data(BODYFAT) M <- lm(BodyFat~.,data=BODYFAT) VIF(M)
#A case where the VIFs are small data(SALARY) M <- lm(Salary~.,data=SALARY) VIF(M) #A case where (some of) the VIFs are large data(BODYFAT) M <- lm(BodyFat~.,data=BODYFAT) VIF(M)
Provides useful plots to illustrate the inner-workings of regression models with one or two predictors or a partition model with not too many branches.
visualize_model(M,loc="topleft",level=0.95,cex.leg=0.7,midline=TRUE,...)
visualize_model(M,loc="topleft",level=0.95,cex.leg=0.7,midline=TRUE,...)
M |
A linear or logistic regression model with one or two predictors (not all categorical) produced by |
loc |
The location for the legend, if one is to be displayed. Can also be "top", "topright", "left", "center", "right", "bottomleft", "bottom", or "bottomright". |
level |
The level of confidence for confidence and prediction intervals for the case of simple linear regression. |
cex.leg |
Magnification factor for text in legends. Smaller numbers indicate smaller text. Default is 0.7. |
midline |
logical, either |
... |
Additional arguments to |
If M
is a simple linear regression model, this provides a scatter plot, fitted line, and confidence/prediction intervals.
If M
is a simple logistic regression model, this provides the fitted logistic curve.
If M
is a regression with two quantitative predictors, this provides the implicit regression lines when one of the variables equals its 5th (small), 50th (median), and 95th (large) percentiles. The model may have interaction terms. In this case, the p-value of the interaction is output. The definition of small and large can be changed with the level
argument.
If M
is a regression with a quantitative predictor and a categorical predictor (with or without interactions), this provides the implicit regression lines for each level of the categorical predictor. The p-value of the effect test is displayed if an interaction is in the model.
If M
is a partition model from rpart
, this shows the tree.
Adam Petrie
Introduction to Regression and Modeling
data(SALARY) #Simple linear regression with 90% confidence and prediction intervals M <- lm(Salary~Education,data=SALARY) visualize_model(M,level=0.90,loc="bottomright") #Multiple linear regression with two quantitative predictors (no interaction) M <- lm(Salary~Education+Experience,data=SALARY) visualize_model(M) #Multiple linear regression with two quantitative predictors (with interaction) #Take small and large to be the 25th and 75th percentiles M <- lm(Salary~Education*Experience,data=SALARY) visualize_model(M,level=0.75) #Multiple linear regression with one categorical and one quantitative predictor M <- lm(Salary~Education*Gender,data=SALARY) visualize_model(M) data(WINE) #Simple logistic regression with expanded x limits M <- glm(Quality~alcohol,data=WINE,family=binomial) visualize_model(M,xlim=c(0,20)) #Multiple logistic regression with two quantitative predictors M <- glm(Quality~alcohol*sulphates,data=WINE,family=binomial) visualize_model(M,loc="left",midline=FALSE) data(TIPS) #Multiple logistic regression with one categorical and one quantitative predictor #expanded x-limits to see more of the curve M <- glm(Smoker~PartySize*Weekday,data=TIPS,family=binomial) visualize_model(M,loc="topright",xlim=c(-5,15)) #Partition model predicting a quantitative response TREE <- rpart(Salary~.,data=SALARY) visualize_model(TREE) #Partition model predicting a categorical response TREE <- rpart(Quality~.,data=WINE) visualize_model(TREE)
data(SALARY) #Simple linear regression with 90% confidence and prediction intervals M <- lm(Salary~Education,data=SALARY) visualize_model(M,level=0.90,loc="bottomright") #Multiple linear regression with two quantitative predictors (no interaction) M <- lm(Salary~Education+Experience,data=SALARY) visualize_model(M) #Multiple linear regression with two quantitative predictors (with interaction) #Take small and large to be the 25th and 75th percentiles M <- lm(Salary~Education*Experience,data=SALARY) visualize_model(M,level=0.75) #Multiple linear regression with one categorical and one quantitative predictor M <- lm(Salary~Education*Gender,data=SALARY) visualize_model(M) data(WINE) #Simple logistic regression with expanded x limits M <- glm(Quality~alcohol,data=WINE,family=binomial) visualize_model(M,xlim=c(0,20)) #Multiple logistic regression with two quantitative predictors M <- glm(Quality~alcohol*sulphates,data=WINE,family=binomial) visualize_model(M,loc="left",midline=FALSE) data(TIPS) #Multiple logistic regression with one categorical and one quantitative predictor #expanded x-limits to see more of the curve M <- glm(Smoker~PartySize*Weekday,data=TIPS,family=binomial) visualize_model(M,loc="topright",xlim=c(-5,15)) #Partition model predicting a quantitative response TREE <- rpart(Salary~.,data=SALARY) visualize_model(TREE) #Partition model predicting a categorical response TREE <- rpart(Quality~.,data=WINE) visualize_model(TREE)
Attempts to show how the relationship between y and x is being modeled in a partition or random forest model
visualize_relationship(TREE,interest,on,smooth=TRUE,marginal=TRUE,nplots=5, seed=NA,pos="topright",...)
visualize_relationship(TREE,interest,on,smooth=TRUE,marginal=TRUE,nplots=5, seed=NA,pos="topright",...)
TREE |
A partition or random forest model (though it works with many regression models as well) |
interest |
The name of the predictor variable for which the plot of y vs. x is to be made. |
on |
A dataframe giving the values of the other predictor variables for which the relationship is to be visualized. Typically this is the dataframe on which the partition model was built. |
smooth |
If |
marginal |
If |
nplots |
The number of rows of |
seed |
the seed for the random number seed if reproducibility is required |
pos |
the location of the legend |
... |
additional arguments past to |
The function shows a scatterplot of y vs. x in the on
dataframe, then shows how TREE
is modeling the relationship between y and x with predicted values of y for each row in the data and also a curve illustrating the relationship. It is useful for seeing what the relationship between y and x as modeled by TREE
"looks like", both as a whole and for particular combinations of other variables. If marginal
is FALSE
, then differences in the curves indicate the presence of some interaction between x and another variable.
Adam Petrie
Introduction to Regression and Modeling
data(SALARY) FOREST <- randomForest(Salary~.,data=SALARY) visualize_relationship(FOREST,interest="Experience",on=SALARY) visualize_relationship(FOREST,interest="Months",on=SALARY,xlim=c(1,15),ylim=c(2500,4500)) data(WINE) TREE <- rpart(Quality~.,data=WINE) visualize_relationship(TREE,interest="alcohol",on=WINE,smooth=FALSE) visualize_relationship(TREE,interest="alcohol",on=WINE,marginal=FALSE,nplots=7,smooth=FALSE)
data(SALARY) FOREST <- randomForest(Salary~.,data=SALARY) visualize_relationship(FOREST,interest="Experience",on=SALARY) visualize_relationship(FOREST,interest="Months",on=SALARY,xlim=c(1,15),ylim=c(2500,4500)) data(WINE) TREE <- rpart(Quality~.,data=WINE) visualize_relationship(TREE,interest="alcohol",on=WINE,smooth=FALSE) visualize_relationship(TREE,interest="alcohol",on=WINE,marginal=FALSE,nplots=7,smooth=FALSE)
Predicting the quality of wine based on its chemical characteristics
data("WINE")
data("WINE")
A data frame with 2700 observations on the following 12 variables.
Quality
a factor with levels high
low
fixed.acidity
a numeric vector
volatile.acidity
a numeric vector
citric.acid
a numeric vector
residual.sugar
a numeric vector
chlorides
a numeric vector
free.sulfur.dioxide
a numeric vector
total.sulfur.dioxide
a numeric vector
density
a numeric vector
pH
a numeric vector
sulphates
a numeric vector
alcohol
a numeric vector
This is the famous wine dataset from the UCI data repository https://archive.ics.uci.edu/ml/datasets/Wine+Quality with some modifications. Namely, the quality in the original data was a score between 0 and 10. These has been coded as either high or low. See description on UCI for description of variables.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.