# 机器学习教程 二十-看数据科学家是如何找回丢失的数据的（二）

Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

# 机器学习教程 二十-看数据科学家是如何找回丢失的数据的（二）

## 连续型变量是如何做数据填补的

mice就是链式方程多元插值(Multivariate Imputation by Chained Equations)的简写

mice包可以对缺失数据的模式做一个很好的理解，为了说明这个事情，我们依然使用泰坦尼克数据集（不了解请见《十八-R语言特征工程实战》）

``> full1 <- cbind(PassengerId=full\$PassengerId,Pclass=full\$Pclass,Sex=full\$Sex,Age=full\$Age,Fare=full\$Fare,Embarked=full\$Embarked,Title=full\$Title,Fsize=full\$Fsize)``

``````> library(mice)
> md.pattern(full1)
PassengerId Pclass Sex Embarked Title Fsize Fare Age
1045           1      1   1        1     1     1    1   1   0
263           1      1   1        1     1     1    1   0   1
1           1      1   1        1     1     1    0   1   1
0      0   0        0     0     0    1 263 264``````

``````> library(VIM)
> aggr_plot <- aggr(full1, col = c('navyblue', 'red'), numbers=TRUE, sortVars=TRUE,labels=names(full1), cex.axis=.7, gap=3,ylab=c("Histogram of missing data", "Pattern"))``````

``````> set.seed(129)
> mice_mod <- mice(full[, !names(full) %in% c('PassengerId','Name','Ticket','Cabin','Family','Surname','Survived')], method='rf')``````

``> mice_output <- complete(mice_mod)``

``````> par(mfrow=c(1,2))
> hist(full\$Age, freq=F, main='Age: Original Data',
+   col='darkgreen', ylim=c(0,0.04))
> hist(mice_output\$Age, freq=F, main='Age: MICE Output',
+   col='lightgreen', ylim=c(0,0.04))``````

``````> library(lattice)
> xyplot(mice_mod,Fare ~ Age,pch=18,cex=1)``````

xyplot的第一个参数是mice训练出的模型数据，第二个参数Fare ~ Age指明了x轴是Age，y轴是Fare，图中洋红色的点是自动填补的，看起来还是比较符合分布情况的，在这里，我们主要看的是y轴填补数据情况，如果想看Age的填补情况则把两个属性调过来，如下：

``> xyplot(mice_mod,Age ~ Fare,pch=18,cex=1)``

``> densityplot(mice_mod)``

``> stripplot(mice_mod, pch = 20, cex = 1.2)``

## 利用数据预测做数据填补

``````> library("rpart")
> library("rpart.plot")
> my_tree <- rpart(Fare ~ Pclass + Fsize + Embarked, data = train, method = "class", control=rpart.control(cp=0.0001))
> prp(my_tree, type = 4, extra = 100)``````

``````> full\$PassengerId[is.na(full\$Fare)]
[1] 1044``````

``````> predict(my_tree, full[1044,], type = "class")
1044
8.05
248 Levels: 0 4.0125 5 6.2375 6.4375 6.45 6.4958 6.75 6.8583 6.95 ... 512.3292``````