博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
azure机器学习_Azure机器学习中的数据清理
阅读量:2511 次
发布时间:2019-05-11

本文共 10442 字,大约阅读时间需要 34 分钟。

azure机器学习

介绍 (Introduction)

After discussing the basic features of Azure Machine Learning in my previous article, , we will look at techniques of data cleansing in Azure Machine Learning. Data Cleansing or Data Cleaning is an important aspect when it comes to predicting as quality data will improve the quality of data prediction.

在我之前的文章“ 讨论过Azure机器学习的基本功能之后,我们将研究Azure机器学习中的数据清理技术。 在进行预测时,数据清理或数据清理是重要的方面,因为质量数据将提高数据预测的质量。

There are multiple options for Data Cleansing in Azure Machine Learning, such as removing duplicate data, replacing missing values, and data normalization. Before performing any data cleansing, it is important to summarise data. Data summarization can be achieved via Summarize Data control.

Azure机器学习中的数据清理有多个选项,例如删除重复数据,替换缺失值和数据规范化。 在执行任何数据清理之前,总结数据非常重要。 数据汇总可以通过“ 汇总数据”控制来实现。

汇总数据 (Summarize Data)

There are no parameters to be configured in the summarize data control. Let us configure Summarize Data with sample data, as shown in the below screenshot.

在摘要数据控件中没有要配置的参数。 让我们用示例数据配置Summarize Data ,如下面的屏幕快照所示。

Summarize data control in Azure Machine Learning.

After the summarization, you will see a lot of statistical data such as Count, Unique Value Count, Missing Value Count, Min, Max, Mean, Mean, Deviation, 1st Quartile, Median, 3rd Quartile, Mode, Range, Sample Variance, Sample Standard Deviation, Sample Skewness, and Sample Kurtosis.

汇总后,您将看到很多统计数据,例如计数,唯一值计数,缺失值计数,最小值,最大值,均值,均值,偏差,第一四分位数,中位数,第三四分位数,模式,范围,样本方差,样本标准偏差,样品偏度和样品峰度。

The following screen shows a few columns for the summarized data.

以下屏幕显示了几列汇总数据。

Summarise data values to decide Data Cleaning in Azure Machine Learning.

Skewness and Kurtosis measures indicate data distribution so that you can decide what columns need to be normalized.

偏度和峰度度量指示数据分布,以便您可以决定需要对哪些列进行标准化。

选择所需的列 (Selecting the required columns)

In machine learning, you may not need all attributes for prediction. Therefore, you have the option of selecting only the columns that you need. For example, you may not need to address columns to predict bike buyer patterns. Therefore, you can exclude those columns using Select Columns in Dataset control in the Azure Machine Learning.

在机器学习中,您可能不需要所有属性都可以进行预测。 因此,您可以选择仅选择所需的列。 例如,您可能无需寻址列即可预测自行车购买者的模式。 因此,您可以使用Azure机器学习的“ 数据集”控件中的“ 选择列”来排除这些列。

Using Select Columns in Dataset control in the Azure Machine Learning.

Let us use the adventure works data set. For this purpose, the data set of vTargetMail in the AdventureWorksDW is exported to a CSV file and imported to Azure Machine Learning.

让我们使用冒险作品数据集。 为此,将AdventureWorksDW中的vTargetMail数据集导出到CSV文件并导入到Azure机器学习。

Let us drag and drop the Select Columns in Dataset to the new experiment, and there are a few configuration options in that control. You can select the columns either from their names or rules.

让我们将“ 数据集中选择列”拖放到新实验中,该控件中有一些配置选项。 您可以从列名或规则中选择列。

Selecting columns by name.

As shown in the above screenshot, you can choose the required columns. In the marked AVAILABLE COLUMNS, you can filter columns by data types or by typing the name of the column.

如上面的屏幕截图所示,您可以选择必填列。 在标记为“可用列”中,可以按数据类型或键入列名来过滤列。

If you want to remove only a few columns, you can use the WITH RULES option. In this configuration, you can choose the columns that you want to exclude, as shown in the below screenshot.

如果只想删除几列,则可以使用WITH RULES选项。 在此配置中,您可以选择要排除的列,如以下屏幕截图所示。

Selecting columns by rules.

Depending on the number of columns that you want to eliminate, you can choose the required option. The following will be the final experiment once it is configured.

根据要消除的列数,可以选择所需的选项。 配置完成后,以下将是最终实验。

Configuration for Select Columns in Dataset.

After you run the created experiment, you will see that eliminated columns no longer exist in the data stream.

运行创建的实验后,您将看到消除的列在数据流中不再存在。

Please note that Select Columns in Dataset column was renamed from the previous versions, which was named Project Columns. If you are watching older videos or reading older articles, you might find those are referred to as Project Column control.

请注意,“ 数据集中的选择列”列已从以前的版本重命名,该版本名为“项目列”。 如果您正在观看较旧的视频或阅读较旧的文章,则可能会发现这些被称为“项目列”控件。

Let us look at another control for Data Cleansing in Azure Machine Learning that is Clean Missing Data.

让我们看一下Azure机器学习中用于数据清除的另一个控件,即“ 清除丢失的数据”

清除丢失的数据 (Clean missing data)

Similar to Select Columns in Dataset, Clean Missing Data is also improved from Missing Values Scrubber. Handling of Missing values as a Data Cleansing in Azure Machine Learning is an important technique.

与“数据集中的选择列”相似,“ 清除缺失数据”也从“ 缺失值 清除 器”中得到了改进。 在Azure机器学习中将缺失值作为数据清理进行处理是一项重要技术。

Since vTargerMail is well-cleaned data set, let us use a different data set. In the following example, the Wine data set of Weka is used. Let us create a data set and visualize the data set.

由于vTargerMail是经过良好清理的数据集,因此让我们使用其他数据集。 在以下示例中,使用了Weka的Wine数据集。 让我们创建一个数据集并可视化该数据集。

Finding missing values in the data set.

In the above data set, you will see that there are 2 missing values for the Rose column. Let us drag and drop the Clean Missing Data from the control panel and connect to the data set, as shown below.

在以上数据集中,您将看到“ 玫瑰”列缺少2个值。 让我们从控制面板拖放“ Clean Missing Data ”并连接到数据集,如下所示。

Clean Missing Data control.

The following is the screen that you will see after the Clean Missing Data module is connected to the data set.

以下是在“ 清除缺失的数据”模块连接到数据集之后将看到的屏幕。

Clean Missing Data control.

Now let us configure the Clean Missing Data as Data Cleansing in Azure Machine Learning.

现在,让我们在Azure机器学习中将清除丢失的数据配置为数据清除

Configuring Clean Missing Data control.

First, you need to select what are the columns you need to configure for missing data from the Launch column selector. This is similar to the configuration that we did for the Select Columns in Data Set.

首先,您需要从启动列选择器中选择需要配置哪些列以缺失数据。 这类似于我们为数据集中的“选择列”所做的配置。

Select columns for missing data values.

With the above configuration, now you are ready to configure missing values for Rose as a technique in Data Cleansing in Azure Machine Learning.

使用上述配置,现在您可以配置Azure的缺失值,作为Azure机器学习中数据清理中的一种技术。

Next is to configure the methods of replacing missing values, and there are multiple cleaning options such as Replacing MICE, Custom substitution value, Replace with mean, Replace with Median, Replace with Mode, Remove entire Row, Remove entire Column and Replace using Probabilistic PCA. In these given options, Replace with mean, Replace with Median, Replace with Mode will replace the missing values with statistical operation mentioned in the replacing technique itself. For example, Replace with Median will replace the missing values with the median value of the data set.

接下来是配置替换缺失值的方法,并且有多个清理选项,例如替换MICE,自定义替换值,替换为均值,替换为中位数,替换为模式,删除整行,删除整列以及使用概率PCA替换。 在这些给定的选项中,“替换为均值”,“替换为中位数”,“替换为模式”将使用替换技术本身中提到的统计运算来替换缺失值。 例如,“替换为中位数”会将缺失值替换为数据集的中间值。

Now let us look at Remove entire Row and Remove entire Column options. Removing the entire row and Remove entire Column options are viable options for data cleansing in Azure Machine Learning. Both of those configurations can be done, as shown in the below screenshot.

现在,让我们看一下“ 删除整个行”和“ 删除整个列”选项。 删除整行和删除整列选项是Azure机器学习中数据清理的可行选项。 这两个配置都可以完成,如下面的屏幕快照所示。

Removing the entire row and Remove entire Column options.

If you analyze both outputs, you will see that Removing the Column has removed the Rose columns, whereas the Removing rows option will see the reduction of two rows.

如果同时分析两个输出,则将看到“删除列”已删除玫瑰列,而“删除行”选项将看到两行减少。

Another option is to replace it with a custom value that is configured, as shown in the following screenshot.

另一个选项是将其替换为已配置的自定义值,如以下屏幕截图所示。

Configuring Clean Missing Data with custom substitution value.

In the above configuration, the missing value is replaced with the value 129, as shown in the above screenshot. Another configuration is the Generate missing value indicator. This will generate a column to indicate that the data is replaced with missing data control, as shown in the following screen.

在上面的配置中,缺少的值被替换为值129,如上面的屏幕截图所示。 另一种配置是“ 生成缺失值”指示符。 这将生成一列,以指示该数据已被缺少的数据控件替换,如以下屏幕所示。

Inclusion of missing value indicator.

MICE stands for Multivariate Imputation using Chained Equations, and PCA stands for Principal Component Analysis, which are more statistical operations that can be used to replace missing values.

MICE代表使用链式方程式的多元插补,而PCA代表主成分分析,它们是可用于替换缺失值的更多统计操作。

删除重复的行 (Remove duplicate rows)

Duplicate data is another headache for the data scientists. Therefore, Remove Duplicate Rows control is an important control in Data Cleansing in Azure Machine Learning. Let us drag and drop the Remove Duplicate Rows control, as shown below.

重复数据是数据科学家的又一个头疼问题。 因此,“ 删除重复行”控件是Azure机器学习中数据清理中的重要控件。 让我们拖放“ 删除重复行”控件,如下所示。

Remove Duplicate Rows for Data Cleaning in Azure Machine learning.

Let us assume that we want to remove duplicates rows that have the same values Sweet-white and Rose, which can be done by selecting the following columns in the Remove Duplicate Rows control.

让我们假设我们要删除具有相同值Sweet-white和Rose的重复行,这可以通过在“ 删除重复行”控件中选择以下列来完成。

Column selecting in Removing the entire row and Remove entire Column options

Since Retain first duplicate row configuration is set, two duplicate rows were removed from the data stream.

由于设置了“保留第一个重复的行”配置,因此从数据流中删除了两个重复的行。

编辑元数据 (Edit metadata)

Data type conversion is another task of data cleansing in Azure Machine Learning. In the case of vTagertMail data set, Column names such as TotalChildren, NumberChildrenAtHome, HouseOwnerFlag, NumberCarsOwned, Age, BikeBuyer are set to the numerical field. Since these are categorical variables, we need to convert them by using Edit Metadata control. After choosing the necessary columns as we did before, next is to convert the string data type to categorical data type, as shown in the below screenshot.

数据类型转换是Azure机器学习中数据清理的另一任务。 对于vTagertMail数据集,将列名(例如TotalChildren,NumberChildrenAtHome,HouseOwnerFlag,NumberCarsOwned,Age,BikeBuyer)设置为数字字段。 由于这些是类别变量,因此我们需要使用“ 编辑元数据”控件对其进行转换。 在像我们之前选择了必要的列之后,下一步是将字符串数据类型转换为分类数据类型,如下面的屏幕快照所示。

Edit Metadata configuration.

剪辑值 (Clip values)

Clipping values is another control for Data Cleansing in Azure Machine Learning. You can clip the values which are greater than some value. Let us say you want to clip the Rose value to 150, which is greater than 150. This can be achieved by the Clip Values control with the following configurations.

裁剪值是Azure机器学习中数据清理的另一个控件。 您可以裁剪大于某个值的值。 假设您想将Rose值裁剪为150,该值大于150。这可以通过使用以下配置的Clip Values控件来实现。

Clip Values configuration in Data Cleaning for Azure Machine Learning.

With this configuration, now you will see that rose value, which has more than 150, is replaced with 150 and with an indicator.

使用此配置,现在您将看到超过150的玫瑰值被替换为150和一个指示器。

Output of clipper values.

规范化数据 (Normalize data)

When there is skewness in the data, you can use the Normalize Data control. In this control, you have the options of multiple transformation methods such as ZScore, MinMax, Logistic, LogNormal, and Tanh.

当数据存在偏斜时,可以使用“规范化数据”控件。 在此控件中,您可以选择多种转换方法,例如ZScore,MinMax,Logistic,LogNormal和Tanh。

结论 (Conclusion)

Data Cleansing in Azure Machine Learning is an important process that has to be carried out to improve data quality. To achieve this, there are controls such as Selecting Columns in Data Sets, Clean Missing Data, Remove Duplicate Rows, Clip Values, and Normalize Data.

Azure机器学习中的数据清理是必须执行的重要过程,以提高数据质量。 为此,提供了一些控件,例如选择数据集中的列,清除丢失的数据,删除重复的行,剪切值和规范化数据。

目录 (Table of contents)

Data Cleansing in Azure Machine Learning
Azure机器学习中的数据清理

翻译自:

azure机器学习

转载地址:http://tpnwd.baihongyu.com/

你可能感兴趣的文章
关于mfc的复习
查看>>
第一章:1-04、为什么说因特网是自由印刷术以来人类通信方面最大的变革?
查看>>
ElasticStack系列之三 & 索引前半段过程
查看>>
微信小程序中显示与隐藏(hidden)
查看>>
Java类加载器
查看>>
Java读取txt文件和覆盖写入txt文件和追加写入txt
查看>>
动态加载DLL
查看>>
使用Postman进行接口测试
查看>>
测试代码
查看>>
windows 安装 redis
查看>>
VIM第七版
查看>>
phpcms v9中jquery.sgallery插件升级到soChange
查看>>
Android 平台下Ftp 使用模拟器需要注意的问题
查看>>
linux以16进制查看文件
查看>>
bitmap.h和bitmaptest.c(位映射)
查看>>
避免缓存的ajax传值方法
查看>>
day6 函数
查看>>
iphone学习笔记(二)
查看>>
Android初学第73天
查看>>
14.python读写Excel
查看>>