To Trust or Not-to-Trust?

There is no simple rule to determine if someone - or something - is trustworthy. Unlike little ditties for poison ivy like leaflets three, let it be, you need to examine multiple features of alien lifeforms to determine if they are friendly or not.

This slide presentation provides the results of predicting the friendly from dangerous Boozonians on planet Hamiltus in the Allen galaxy. Truth be told, the inspiration for this data set is an adaption of the Mushroom Dataset found in the UCI Mushroom Data Set. There are so many web-based solutions for UCI datasets that the team decided to obscure the data without affecting its underlying predictive value.

The dataset provides 8,123 observations from a plantary probe send prior to mission execution. It contains 21 features. The dataset is fairly balanced, which allows for a more generous environment in applying predictive modeling techniques:

friendly lifeforms: 4,208 (51.8%)
dangerous lifeforms: 3,915 (48.2%)

This deck is written in R Slidy to demonstrate modeling approaches that yield accurate prediction of the data provided into one of two classes.

Machine learning can keep you alive in use cases of interstellar exploration. Keeping yourself safe from the many other ways of becoming alien-chow is on you.

Good luck, live long and prosper using this model.

Three classification methods checked

Classification Tree (less than 1% error)
Conditional Inference Tree (less than 1% error)
Random Forest (perfect classifier)

If you had to choose your friends, which ones “work for you?”

Classification Tree

Let’s split the data into a 70% training set to do the machine learning and use the other 30% to test the model.

alien <- read.csv("./data/mushroomUCI_adapted.csv")

set.seed(524)
train <- sample_frac(alien, 0.7, replace = FALSE)
rows <- as.numeric(row.names(train))
test <- alien[-rows, ]

fit <- rpart(result ~ . , data = train, method = "class")
predicted <- predict(fit, newdata = test, type = "class")
table(predicted, test$result)

##            
## predicted   dangerous friendly
##   dangerous      1164        0
##   friendly         11     1262

Not bad…unless you meet one of those 11 aliens.

Classification Tree

fancyRpartPlot(fit, main = "Alien Lifeform Classification Results",
               sub = "Variable Feature legend is available in the data dictionary")

Conditional Inference Tree

fitC <- ctree(result ~ ., data = train)
table(predict(fitC, newdata = test), test$result)

##            
##             dangerous friendly
##   dangerous      1169        0
##   friendly          6     1262

This is a better predictor, but you still have 6 chances to…well…

                     (*_*)  -->  (X_X)

Conditional Inference Tree

Light Gray bars indicates proportion of dangerous lifeforms in terminal leaf

Random Forest

fitRF <- randomForest(result ~ . ,   data = train)
predictedRF <- predict(fitRF, newdata = test, type = "class")
table(predictedRF, test$result)

##            
## predictedRF dangerous friendly
##   dangerous      1175        0
##   friendly          0     1262

This table shows the results of a random forest model.

In this case, it is a perfect classifier, because of the iterative nature of random forests.

They are not plotted easily, so this output is the way to determine results.

Random Forest

…and here are the relative importance of the variables

##                   variable importance
## 1                     odor 977.350903
## 2       tattoo.print.color 466.871794
## 3               gill.color 242.682321
## 4                gill.size 178.217424
## 5  body.surface.above.neck 137.263940
## 6  body.surface.below.neck 131.236102
## 7                 eye.type 130.456605
## 8               population 109.929538
## 9                  habitat  75.269221
## 10            gill.spacing  67.248376
## 11                 lesions  64.421793
## 12              head.color  46.858628
## 13   body.color.below.neck  45.909275
## 14   body.color.above.neck  44.478718
## 15            number.heads  41.185610
## 16              body.shape  38.037661
## 17            head.surface  18.353848
## 18              head.shape   9.771811
## 19              foot.color   1.925067
## 20         gill.attachment   1.544447
## 21               foot.type   0.000000

’alien-nation hackathon

To Trust or Not-to-Trust?

Three classification methods checked

Classification Tree

Classification Tree

Conditional Inference Tree

Conditional Inference Tree

Random Forest

Random Forest