B站上面那个翻译我有点看不懂,打算自己啃英文翻译了(有自己意译的部分),然后懒得做字幕,就丢在博客上面了,2.2之前的章节结合那个机翻字幕能看懂
2.2监督学习-part-2
So supervised learning algorithms learn to predict input, output or X to Y mapping. And in the last video you saw that regression algorithms, which is a type of supervised learning algorithm learns to predict numbers out of infinitely many possible numbers. There’s a second major type of supervised learning algorithm called a classification algorithm. Let’s take a look at what this means.
监督学习算法能够学习预测输入到输出或者X到Y的映射关系。在上一个视频中你看到的是回归算法,它是一种从无数个可能的数字中预测一个数字的监督学习算法。还有第二种主要类型的监督学习算法,它叫做分类算法。让我们看看分类算法是什么意思。
Take breast cancer detection as an example of a classification problem. Say you’re building a machine learning system so that doctors can have a diagnostic tool to detect breast cancer. This is important because early detection could potentially save a patient’s life.
举一个分类问题的例子:乳腺癌检测。假设你正在构建一个机器学习系统使得医生们能够拥有一个诊断工具去检测乳腺癌。这非常重要,因为早期检测有可能拯救病人的生命。
Using a patient’s medical records your machine learning system tries to figure out if a tumor that is a lump is malignant meaning cancerous or dangerous. Or if that tumor, that lump is benign, meaning that it’s just a lump that isn’t cancerous and isn’t that dangerous? Some of my friends have actually been working on this specific problem.
你的机器学习系统通过利用患者的医疗记录(病例)试图计算出当前患者的肿瘤是否是恶性肿瘤,也就是会癌变或者危及生命的。或者,判断那个肿块是否良性,也就是说它只是一个无害的肿块,不是癌性,也不太危险。我有一些朋友实际上一直在研究这个具体的问题。
So maybe your dataset has tumors of various sizes.
And these tumors are labeled as either benign, which I will designate in this example with a 0, or malignant, which will designate in this example with a 1. You can then plot your data on a graph like this where the horizontal axis represents the size of the tumor and the vertical axis takes on only two values 0 or 1 depending on whether the tumor is benign, 0 or malignant 1.
所以,也许你的数据集中有各种大小的肿瘤。在这个例子中,我们用0表示标记为良性的肿瘤,用1表示表示恶行肿瘤。然后,你可以将数据绘制成这样的图表,在这个图表中,横轴代表肿瘤的大小,纵轴只有两个取值,0或者1,即0代表良性,1代表恶性。
One reason that this is different from regression is that we’re trying to predict only a small number of possible outputs or categories. In this case two possible outputs 0 or 1, benign or malignant. This is different from regression which tries to predict any number, all of the infinitely many number of possible numbers.
与回归算法不同的是,在分类算法中,我们试图预测一小部分可能的输出或者类别。在这个例子中,只可能有两个类别(输出),即0或者1,亦即良性或者恶性。在这一点上,分类算法与回归算法完全不同,回归算法试图从无数个可能的数字中预测出一个数字。
And so the fact that there are only two possible outputs is what makes this classification. Because there are only two possible outputs or two possible categories in this example, you can also plot this data set on a line like this.
正因为在这个例子中只有两种可能的输出结果,所以这是一个分类问题。因为在这个例子中只有两个可能的输出结果或者说只有两种可能的类别,所以你可以可以像这样把数据画在一条线上。
Right now, I’m going to use two different symbols to denote the category using a circle an O to denote the benign examples and a cross to denote the malignant examples. And if new patients walks in for a diagnosis and they have a lump that is this size, then the question is, will your system classify this tumor as benign or malignant?
现在,我将使用两个不同的符合去标记类别,使用圆圈O表示良性案例,使用X代表恶性案例。如果新的病人们走进来寻求医学诊断并且他们有一个这个大小的肿块,那么问题是,你的系统会将这个肿瘤分类为良性还是恶性?
It turns out that in classification problems you can also have more than two possible output categories. Maybe you’re learning algorithm can output multiple types of cancer diagnosis if it turns out to be malignant. So let’s call two different types of cancer type 1 and type 2.
实际上,在分类问题中,可能的输出类别数是可以多于2个的。如果检测的结果是恶性,也许你的学习算法能够输出多种类型的癌症诊断。那么我们将不同的癌症类型称为类型1和类型2。
In this case the average would have three possible output categories it could predict. And by the way in classification, the terms output classes and output categories are often used interchangeably. So what I say class or category when referring to the output, it means the same thing.
在这种情况下就有了三个可以预测的类别。顺便说一下,在分类问题中,术语"output classes"和"output categories"(中文只有一个意思,就是输出类别)经常可以互换使用。所以当我提到上面两个单词的时候,它们表示一个意思。
So to summarize classification algorithms predict categories. Categories don’t have to be numbers. It could be non numeric for example, it can predict whether a picture is that of a cat or a dog. And it can predict if a tumor is benign or malignant. Categories can also be numbers like 0, 1 or 0, 1, 2. But what makes classification different from regression when you’re interpreting the numbers is that classification predicts a small finite limited set of possible output categories such as 0, 1 and 2 but not all possible numbers in between like 0.5 or 1.7.
所以,总结一下,分类算法用于预测类别。类别不一定是数字,可以是非数值的,例如,它可以预测一张图片的内容是猫还是狗。它也可以预测一个肿瘤是良性还是恶性。类别也可以是数字,比如0、1或者0、1、2。但分类问题与回归问题的不同之处在于,当你解释这些数字时,分类问题预测的是一组有限的可能输出类别,比如0、1和2,而不是介于之间的所有可能数字,如0.5或1.7
In the example of supervised learning that we’ve been looking at, we had only one input value the size of the tumor. But you can also use more than one input value to predict an output. Here’s an example, instead of just knowing the tumor size, say you also have each patient’s age in years.
在我们一直在研究的有监督学习示例中,只有一个输入值,即肿瘤的大小。但你也可以使用多个输入值来预测一个输出值。这里有一个例子,除了知道肿瘤的大小之外,假设你还知道每个患者的年龄,以年为单位。
Your new data set now has two inputs, age and tumor size. What in this new dataset we’re going to use circles to show patients whose tumors are benign and crosses to show the patients with a tumor that was malignant. So when a new patient comes in, the doctor can measure the patient’s tumor size and also record the patient’s age.
你的新数据集现在有两个输入值,即年龄和肿瘤大小。在这个新数据集中,我们将使用O表示肿瘤为良性的患者,使用X表示肿瘤为恶性的患者。因此,当一个新的患者来就诊时,医生可以测量患者的肿瘤大小并记录患者的年龄。
And so given this, how can we predict if this patient’s tumor is benign or malignant? Well, given the day said like this, what the learning algorithm might do is find some boundary that separates out the malignant tumors from the benign ones. So the learning algorithm has to decide how to fit a boundary line through this data. The boundary line found by the learning algorithm would help the doctor with the diagnosis.
根据上文,我们如何预测患者的肿瘤是恶性还是良性呢?根据之前所说的,学习算法可能会找到一些界限来区分恶性肿瘤和良性肿瘤。因此,学习算法需要决定如何通过这些数据拟合一个界限线。学习算法所找到的界限线将会帮助医生进行诊断。
In this case the tumor is more likely to be benign. From this example we have seen how to inputs the patient’s age and tumor size can be used. In other machine learning problems often many more input values are required. My friends who worked on breast cancer detection use many additional inputs, like the thickness of the tumor clump, uniformity of the cell size, uniformity of the cell shape and so on. So to recap supervised learning maps input x to output y, where the learning algorithm learns from the quote right answers.
在这个例子里,病人的肿瘤可能是良性的。通过这个例子,我们看到了如何使用患者的年龄和肿瘤大小这两个输入值。在其他机器学习问题中,通常需要更多的输入值。我有些朋友从事乳腺癌检测的研究,他们使用了很多额外的输入值,比如肿瘤团块的厚度、细胞大小的一致性、细胞形状的一致性等等。所以,回顾一下,监督学习将输入x映射到输出y,学习算法会从引用的正确答案(也就是提供给监督学习算法的示例,先提供包含输入x和正确的输出y的案例,监督算法才能学习)中学习。
The two major types of supervised learning our regression and classification. In a regression application like predicting prices of houses, the learning algorithm has to predict numbers from infinitely many possible output numbers. Whereas in classification the learning algorithm has to make a prediction of a category, all of a small set of possible outputs.
监督学习主要分为两类,即回归和分类。在回归算法的应用中,比如房价预测, 学习算法必须从无数个可能的输出结果的数字中预测数值。而在分类算法中,学习算法需要预测类别,分类算法输出的预测类别是极小的,是有限的。
So you now know what is supervised learning, including both regression and classification. I hope you’re having fun. Next there’s a second major type of machine learning called unsupervised learning. Let’s go on to the next video to see what that is.
现在你已经了解了监督学习的内容,包括回归和分类。希望你觉得很有趣。接下来,还有第二个主要类型的机器学习,称为无监督学习。让我们继续下一个视频,看看无监督学习是什么。