# 机器学习教程 二十-看数据科学家是如何找回丢失的数据的（二）

## 连续型变量是如何做数据填补的

mice就是链式方程多元插值(Multivariate Imputation by Chained Equations)的简写

mice包可以对缺失数据的模式做一个很好的理解，为了说明这个事情，我们依然使用泰坦尼克数据集（不了解请见《十八-R语言特征工程实战》）

``> full1 <- cbind(PassengerId=full\$PassengerId,Pclass=full\$Pclass,Sex=full\$Sex,Age=full\$Age,Fare=full\$Fare,Embarked=full\$Embarked,Title=full\$Title,Fsize=full\$Fsize)``

``````> library(mice)
> md.pattern(full1)
PassengerId Pclass Sex Embarked Title Fsize Fare Age
1045           1      1   1        1     1     1    1   1   0
263           1      1   1        1     1     1    1   0   1
1           1      1   1        1     1     1    0   1   1
0      0   0        0     0     0    1 263 264``````

``````> library(VIM)
> aggr_plot <- aggr(full1, col = c('navyblue', 'red'), numbers=TRUE, sortVars=TRUE,labels=names(full1), cex.axis=.7, gap=3,ylab=c("Histogram of missing data", "Pattern"))``````

``````> set.seed(129)
> mice_mod <- mice(full[, !names(full) %in% c('PassengerId','Name','Ticket','Cabin','Family','Surname','Survived')], method='rf')``````

``> mice_output <- complete(mice_mod)``

``````> par(mfrow=c(1,2))
> hist(full\$Age, freq=F, main='Age: Original Data',
+   col='darkgreen', ylim=c(0,0.04))
> hist(mice_output\$Age, freq=F, main='Age: MICE Output',
+   col='lightgreen', ylim=c(0,0.04))``````

``````> library(lattice)
> xyplot(mice_mod,Fare ~ Age,pch=18,cex=1)``````

xyplot的第一个参数是mice训练出的模型数据，第二个参数Fare ~ Age指明了x轴是Age，y轴是Fare，图中洋红色的点是自动填补的，看起来还是比较符合分布情况的，在这里，我们主要看的是y轴填补数据情况，如果想看Age的填补情况则把两个属性调过来，如下：

``> xyplot(mice_mod,Age ~ Fare,pch=18,cex=1)``

``> densityplot(mice_mod)``

``> stripplot(mice_mod, pch = 20, cex = 1.2)``

## 利用数据预测做数据填补

``````> library("rpart")
> library("rpart.plot")
> my_tree <- rpart(Fare ~ Pclass + Fsize + Embarked, data = train, method = "class", control=rpart.control(cp=0.0001))
> prp(my_tree, type = 4, extra = 100)``````

``````> full\$PassengerId[is.na(full\$Fare)]
[1] 1044``````

``````> predict(my_tree, full[1044,], type = "class")
1044
8.05
248 Levels: 0 4.0125 5 6.2375 6.4375 6.45 6.4958 6.75 6.8583 6.95 ... 512.3292``````