There is no simple rule to determine if someone - or something - is trustworthy. Unlike little ditties for poison ivy like leaflets three, let it be, you need to examine multiple features of alien lifeforms to determine if they are friendly or not.
This slide presentation provides the results of predicting the friendly from dangerous Boozonians on planet Hamiltus in the Allen galaxy. Truth be told, the inspiration for this data set is an adaption of the Mushroom Dataset found in the UCI Mushroom Data Set. There are so many web-based solutions for UCI datasets that the team decided to obscure the data without affecting its underlying predictive value.
The dataset provides 8,123 observations from a plantary probe send prior to mission execution. It contains 21 features. The dataset is fairly balanced, which allows for a more generous environment in applying predictive modeling techniques:
This deck is written in R Slidy to demonstrate modeling approaches that yield accurate prediction of the data provided into one of two classes.
Machine learning can keep you alive in use cases of interstellar exploration. Keeping yourself safe from the many other ways of becoming alien-chow is on you.
Good luck, live long and prosper using this model.
If you had to choose your friends, which ones “work for you?”
Let’s split the data into a 70% training set to do the machine learning and use the other 30% to test the model.
alien <- read.csv("./data/mushroomUCI_adapted.csv")
set.seed(524)
train <- sample_frac(alien, 0.7, replace = FALSE)
rows <- as.numeric(row.names(train))
test <- alien[-rows, ]
fit <- rpart(result ~ . , data = train, method = "class")
predicted <- predict(fit, newdata = test, type = "class")
table(predicted, test$result)
##
## predicted dangerous friendly
## dangerous 1164 0
## friendly 11 1262
Not bad…unless you meet one of those 11 aliens.
fancyRpartPlot(fit, main = "Alien Lifeform Classification Results",
sub = "Variable Feature legend is available in the data dictionary")
fitC <- ctree(result ~ ., data = train)
table(predict(fitC, newdata = test), test$result)
##
## dangerous friendly
## dangerous 1169 0
## friendly 6 1262
This is a better predictor, but you still have 6 chances to…well…
(*_*) --> (X_X)
fitRF <- randomForest(result ~ . , data = train)
predictedRF <- predict(fitRF, newdata = test, type = "class")
table(predictedRF, test$result)
##
## predictedRF dangerous friendly
## dangerous 1175 0
## friendly 0 1262
This table shows the results of a random forest model.
In this case, it is a perfect classifier, because of the iterative nature of random forests.
They are not plotted easily, so this output is the way to determine results.
…and here are the relative importance of the variables
## variable importance
## 1 odor 977.350903
## 2 tattoo.print.color 466.871794
## 3 gill.color 242.682321
## 4 gill.size 178.217424
## 5 body.surface.above.neck 137.263940
## 6 body.surface.below.neck 131.236102
## 7 eye.type 130.456605
## 8 population 109.929538
## 9 habitat 75.269221
## 10 gill.spacing 67.248376
## 11 lesions 64.421793
## 12 head.color 46.858628
## 13 body.color.below.neck 45.909275
## 14 body.color.above.neck 44.478718
## 15 number.heads 41.185610
## 16 body.shape 38.037661
## 17 head.surface 18.353848
## 18 head.shape 9.771811
## 19 foot.color 1.925067
## 20 gill.attachment 1.544447
## 21 foot.type 0.000000