预测足球世界杯比赛

news2024/9/23 17:22:02

目录

1. 下载数据集

2. 数据预处理

3. 模型训练与选择

4. 预测


1. 下载数据集

下载后数据如下:

FIFA World Cup | Kaggle

2. 数据预处理

 reprocess_dataset() 方法是数据进行预处理。预处理过的数据如下:
 

 

save_dataset() 方法是对预处理过的数据,进行向量化。

完整代码如下:

import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
import joblib
root_path = "models"

def reprocess_dataset():
    #load data
    results = pd.read_csv('datasets/WorldCupMatches.csv', encoding='gbk')

    #Adding goal difference and establishing who is the winner
    winner = []
    for i in range (len(results['Home Team Name'])):
        if results ['Home Team Goals'][i] > results['Away Team Goals'][i]:
            winner.append(results['Home Team Name'][i])
        elif results['Home Team Goals'][i] < results ['Away Team Goals'][i]:
            winner.append(results['Away Team Name'][i])
        else:
            winner.append('Draw')
    results['winning_team'] = winner

    #adding goal difference column
    results['goal_difference'] = np.absolute(results['Home Team Goals'] - results['Away Team Goals'])

    # narrowing to team patcipating in the world cup, totally there are 32 football teams in 2022
    worldcup_teams = ['Qatar','Germany','Denmark', 'Brazil','France','Belgium', 'Serbia',
                      'Spain','Croatia', 'Switzerland', 'England','Netherlands', 'Argentina',' Iran',
                      'Korea Republic','Saudi Arabia', 'Japan', 'Uruguay','Ecuador','Canada',
                      'Senegal', 'Poland', 'Portugal','Tunisia',  'Morocco','Cameroon','USA',
                      'Mexico','Wales','Australia','Costa Rica', 'Ghana']
    df_teams_home = results[results['Home Team Name'].isin(worldcup_teams)]
    df_teams_away = results[results['Away Team Name'].isin(worldcup_teams)]
    df_teams = pd.concat((df_teams_home, df_teams_away))
    df_teams.drop_duplicates()
    df_teams.count()

    #dropping columns that wll not affect matchoutcomes

    df_teams_new =df_teams[[ 'Home Team Name','Away Team Name','winning_team']]
    print(df_teams_new.head()  )

                   #Building the model
    #the prediction label: The winning_team column will show "2" if the home team has won, "1" if it was a tie, and "0" if the away team has won.

    df_teams_new = df_teams_new.reset_index(drop=True)
    df_teams_new.loc[df_teams_new.winning_team == df_teams_new['Home Team Name'],'winning_team']=2
    df_teams_new.loc[df_teams_new.winning_team == 'Draw', 'winning_team']=1
    df_teams_new.loc[df_teams_new.winning_team == df_teams_new['Away Team Name'], 'winning_team']=0

    print(df_teams_new.count()   )
    df_teams_new.to_csv('datasets/raw_train_data.csv', encoding='gbk', index =False)

def save_dataset():
    df_teams_new = pd.read_csv('datasets/raw_train_data.csv', encoding='gbk')

    feature = df_teams_new[[ 'Home Team Name','Away Team Name']]
    vec = DictVectorizer(sparse=False)

    print(feature.to_dict(orient='records'))
    X =vec.fit_transform(feature.to_dict(orient='records'))
    X = X.astype('int')
    print("===")
    print(vec.get_feature_names())
    print(vec.feature_names_)
    y = df_teams_new[[ 'winning_team']]
    y =y.astype('int')
    print(X.shape)
    print(y.shape)
    joblib.dump(vec, root_path+"/vec.joblib")
    np.savez('datasets/train_data', x= X, y = y)

if __name__ == '__main__':
    reprocess_dataset()
    save_dataset();


3. 模型训练与选择

用不同的传统机器学习方法进行训练,训练后的模型比较

ModelTraining AccuracyTest Accuracy
Logistic Regression67.40%61.60%
SVM67.30%62.70%
Naive Bayes65.50%63.80%
Random Forest90.80%65.50%
XGB75.30%62.00%

可以看到随机森林模型在测试集上准确率最高,所以我们可以用它来做预测。

下面是完整训练代码:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker
import matplotlib.ticker as plticker
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import svm
import sklearn as sklearn
from sklearn.feature_extraction import DictVectorizer
from sklearn.naive_bayes import  MultinomialNB
from sklearn.ensemble import RandomForestClassifier
import joblib
from sklearn.metrics import classification_report
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix


root_path = "models_1"

def get_dataset():
    train_data = np.load('datasets/train_data.npz')

    return train_data

def train_by_LogisticRegression(train_data):
    X = train_data['x']
    y = train_data['y']

     # Separate train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

    logreg = LogisticRegression()
    logreg.fit(X_train, y_train)
    joblib.dump(logreg, root_path+'/LogisticRegression_model.joblib')

    score = logreg.score(X_train, y_train)
    score2 = logreg.score(X_test, y_test)

    print("LogisticRegression Training set accuracy: ", '%.3f'%(score))
    print("LogisticRegression Test set accuracy: ", '%.3f'%(score2))

def train_by_svm(train_data):
    X = train_data['x']
    y = train_data['y']

    # Separate train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

    model = svm.SVC(kernel='linear', verbose=True, probability=True)
    model.fit(X_train, y_train)
    joblib.dump(model, root_path+'/svm_model.joblib')

    score = model.score(X_train, y_train)
    score2 = model.score(X_test, y_test)

    print("SVM Training set accuracy: ", '%.3f' % (score))
    print("SVM Test set accuracy: ", '%.3f' % (score2))

def train_by_naive_bayes(train_data):
    X = train_data['x']
    y = train_data['y']

    # Separate train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

    model = MultinomialNB()
    model.fit(X_train, y_train)
    joblib.dump(model, root_path+'/naive_bayes_model.joblib')

    score = model.score(X_train, y_train)
    score2 = model.score(X_test, y_test)

    print("naive_bayes Training set accuracy: ", '%.3f' % (score))
    print("naive_bayes Test set accuracy: ", '%.3f' % (score2))

def train_by_random_forest(train_data):
    X = train_data['x']
    y = train_data['y']

    # Separate train and test sets
    X_train = X
    y_train = y
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

    model = RandomForestClassifier(criterion='gini', max_features='sqrt')
    model.fit(X_train, y_train)
    joblib.dump(model, root_path+'/random_forest_model.joblib')

    score = model.score(X_train, y_train)
    score2 = model.score(X_test, y_test)

    print("random forest Training set accuracy: ", '%.3f' % (score))
    print("random forest Test set accuracy: ", '%.3f' % (score2))


def train_by_xgb(train_data):
    X = train_data['x']
    y = train_data['y']

    # Separate train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

    model = XGBClassifier(use_label_encoder=False)
    model.fit(X_train, y_train)
    joblib.dump(model, root_path+'/xgb_model.joblib')

    score = model.score(X_train, y_train)

    score2 = model.score(X_test, y_test)

    print("xgb Training set accuracy: ", '%.3f' % (score))
    print("xgb Test set accuracy: ", '%.3f' % (score2))

    y_pred = model.predict(X_test)

    report = classification_report(y_test, y_pred, output_dict=True)
    # show_confusion_matrix(y_test, y_pred)
    print(report)

def show_confusion_matrix(y_true, y_pred, pic_name = "confusion_matrix"):
    confusion = confusion_matrix(y_true=y_true, y_pred=y_pred)
    print(confusion)

    sns.heatmap(confusion, annot=True, cmap= 'Blues', xticklabels=['0','1','2'], yticklabels=['0','1','2'], fmt = '.20g')
    plt.xlabel('Predicted class')
    plt.ylabel('Actual Class')
    plt.title(pic_name)
    # plt.savefig('pic/' + pic_name)
    plt.show()

if __name__ == '__main__':
    train_data = get_dataset()
    train_by_LogisticRegression(train_data)
    train_by_svm(train_data)
    train_by_naive_bayes(train_data)
    train_by_random_forest(train_data)
    train_by_xgb(train_data)

4. 预测

执行下面预测代码,结果是Ecuador胜于Qatar, 英国队胜于伊朗队。

[2]
[[0.05       0.22033333 0.72966667]]
Probability of  Ecuador  winning: 0.730
Probability of Draw: 0.220
Probability of  Qatar  winning: 0.050
[2]
[[0.02342857 0.21770455 0.75886688]]
Probability of  England  winning: 0.759
Probability of Draw: 0.218
Probability of   Iran  winning: 0.023

完整代码

import joblib

worldcup_teams = ['Qatar','Germany','Denmark', 'Brazil','France','Belgium', 'Serbia',
                  'Spain','Croatia', 'Switzerland', 'England','Netherlands', 'Argentina',' Iran',
                  'Korea Republic','Saudi Arabia', 'Japan', 'Uruguay','Ecuador','Canada',
                  'Senegal', 'Poland', 'Portugal','Tunisia',  'Morocco','Cameroon','USA',
                  'Mexico','Wales','Australia','Costa Rica', 'Ghana']
root_path = "models_1"
def verify_team_name(team_name):

    for worldcup_team in worldcup_teams:
        if team_name==worldcup_team:
            return True
    return False


def predict(model_dir =root_path+'/LogisticRegression_model.joblib', team_a='France', team_b = 'Mexico'):

    if not verify_team_name(team_a):
        print(team_a, ' is not correct')
        return
    if not verify_team_name(team_b) :
        print(team_b, ' is not correct')
        return

    logreg = joblib.load(model_dir)

    input_x = [{'Home Team Name': team_a, 'Away Team Name': team_b}]

    vec = joblib.load(root_path+"/vec.joblib")
    input_x = vec.transform(input_x)

    result = logreg.predict(input_x)
    print(result)
    result1 = logreg.predict_proba(input_x)



    print(result1)
    print('Probability of ',team_a , ' winning:', '%.3f'%result1[0][2])
    print('Probability of Draw:', '%.3f' % result1[0][1])
    print('Probability of ', team_b, ' winning:', '%.3f' % result1[0][0])

if __name__ == '__main__':
    team_a = 'Ecuador'
    team_b = 'Qatar'
    predict('models/random_forest_model.joblib', team_a, team_b)
    team_a = 'England'
    team_b = ' Iran'


    predict('models/random_forest_model.joblib', team_a, team_b)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/32538.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

LibreOJ_10010

链接 点此跳转 思路 题目描述 有 nnn 个小朋友坐成一圈&#xff0c;每人有 aia_iai​ 颗糖果。 每人只能给左右两人传递糖果。每人每次传递一颗糖果的代价为 111 。 求使所有人获得均等糖果的最小代价。 分析 设 xix_ixi​ 表示第 iii 个朋友向第 i−1i-1i−1 个小朋友…

爬虫与云服务器云数据库

腾讯云轻量应用服务器TDSQL-MYSQL数据库PYTHON做爬虫 实现目标&#xff1a;轻量应用服务器上运行Python爬虫&#xff0c;把数据写到TDSQL-MYSQL数据库中。 最近双十一&#xff0c;趁着这一波福利&#xff0c;在腾讯云购买了一个轻量应用服务器和TDSQL-MYSQL版的数据库。买来之…

OpenGL学习

1.1&#xff0c;状态机-上下文-对象 GPU渲染流程 OpenGL自身是一个巨大的状态机(State Machine)&#xff1a;一系列的变量描述OpenGL此刻应当如何运行。 状态机&#xff1a;变量&#xff08;描述该如何操作&#xff09;的大集合 OpenGL的状态通常被称为OpenGL上下文(Contex…

异构图注意力网络Heterogeneous Graph Attention Network ( HAN )

文章目录前言一、基础知识1.异构图&#xff08;Heterogeneous Graph&#xff09;2.元路径3.异构图注意力网络二、异构图注意力网络1.结点级别注意力&#xff08;Node-level Attention&#xff09;2.语义级别注意力&#xff08;Semantic-level Attention&#xff09;总结前言 异…

微信商城小程序怎么开发_分享微信商城小程序的搭建

如何搭建好一个微信商城&#xff1f;这三个功能要会用&#xff01; 1.定期低价秒杀&#xff0c;提高商城流量 除了通过私域流量裂变&#xff0c;低价秒杀是为商城引流提高打开率的良好手段。 以不同节日作为嘘头&#xff0c;在情人节、38妇女节、中秋国庆、七夕节等日子&…

前端框架 Nuxtjs Vue3 SEO解决方案 SSR

目录 一、Nuxtjs安装 二、路由规则 三、公共布局 四、Vue3中TypeScript的使用 一、Nuxtjs安装 参考&#xff1a;Installation Get Started with Nuxt安装 - NuxtJS | Nuxt.js 中文网Installation Get Started with Nuxt yarn create nuxt-app <项目名> 项目运行…

GAMES101 作业0 环境配置 PC下简单配置i

前言 GAMES101提供了计算机图形学相关教学知识&#xff0c;闫教授及其团队也为大家准备了相应课程作业。课程作业部署在虚拟机上&#xff0c;以便免去环境部署的麻烦。但对于一些同学来说&#xff0c;还是希望直接在WIN的VS上使用并编码&#xff0c;本文对此进行简单说明。 环…

神经网络和深度学习-后向传播back propagation

后向传播back propagation 首先我们要了解&#xff0c;前向传播&#xff0c;损失函数这些前置知识&#xff0c;下面我们给出一张神经网络的图 反向传播通过导数链式法则计算损失函数对各参数的梯度&#xff0c;并根据梯度进行参数的更新 下面举个简单的例子 我们需要知道x,y,…

Linux C网络通信过程

socket函数、sockaddr_in结构体 和 bind函数 socket函数的作用是创建一个网络文件描述符&#xff0c;程序通过这个文件描述符将数据发送到网络&#xff0c;也通过这个文件描述符从网络中接受数据。观察一下socket函数&#xff1a; int listenfd; listenfd socket(AF_INET, S…

NNDL 作业11:优化算法比较

目录 1. 编程实现图6-1&#xff0c;并观察特征 2. 观察梯度方向 3. 编写代码实现算法&#xff0c;并可视化轨迹 4. 分析上图&#xff0c;说明原理&#xff08;选做&#xff09; 5. 总结SGD、Momentum、AdaGrad、Adam的优缺点&#xff08;选做&#xff09; 6. Adam这么好&…

Python威布尔分布

文章目录威布尔分布及其性质在Python中生成威布尔分布的随机数指数分布和拉普拉斯分布的对比威布尔分布及其性质 威布尔分布&#xff0c;即Weibull distribution&#xff0c;又被译为韦伯分布、韦布尔分布等&#xff0c;是仅分布在正半轴的连续分布。 在numpy.random中&#…

python中urllib库的使用

1. 获取目标页面的源码 以获取百度页面源码为例 #使用urllib获取百度首页的源码 import urllib.request#1 定义一个url 作为需要访问的网址 url http://www.baidu.com#2 模拟浏览器向服务器发送请求 response响应 response urllib.request.urlopen(url)#3 获取响应中的页面…

Monkey测试

一、什么是 Monkey 测试 Monkey 测试是通过向系统发送伪随机的用户事件流&#xff08;如按键输入、触摸屏输入、手势输入等&#xff09;&#xff0c;实现对应用程序客户端的稳定性测试&#xff1b;通俗来说&#xff0c;Monkey 测试即“猴子测试”&#xff0c;是指像猴子一样&a…

JVM垃圾回收算法整理

JVM垃圾回收算法整理前言关键概念了解标记–清除算法复制算法标记–整理算法分代收集算法仰天大笑出门去&#xff0c;我辈岂是蓬蒿人前言 大概内容&#xff1a; jvm垃圾回收算法&#xff1a; 1、“标记–清除”算法&#xff1b;首先标记出所有需要被回收的对象&#xff0c;然…

搭建自己的SSR

Vue SSR介绍 是什么 官方文档&#xff1a;https://ssr.vuejs.org/Vue SSR&#xff08;Vue.js Server-Side Rendering&#xff09; 是 Vue.js 官方提供的一个服务端渲染&#xff08;同构应用&#xff09;解 决方案使用它可以构建同构应用还是基于原有的 Vue.js 技术栈 官方文档…

XXL-JOB逻辑自测及执行参数配置踩坑

概述 关于XXL-JOB的使用遇到的问题记录。对XXL-JOB不熟的&#xff0c;可以先参考分布式任务调度平台XXL-JOB深度实战 实战 业务DTO定义如下&#xff1a; Data public class AdAccountDTO {private String accountId;/*** yyyy-MM-dd HH:mm:ss*/private String startCreateT…

ThingBoard源码解析-缓存

配置 TB支持两种缓存&#xff1a;Caffeine和Redis,通过配置cache.type来指定使用哪种缓存。 位于 org.thingsboard.server.cache Caffeine 配置类&#xff1a;CaffeineCacheConfiguration Configuration ConditionalOnProperty(prefix "cache", value "t…

HTML CSS 个人网页设计 WEB前端大作业代码

&#x1f389;精彩专栏推荐&#x1f447;&#x1f3fb;&#x1f447;&#x1f3fb;&#x1f447;&#x1f3fb; ✍️ 作者简介: 一个热爱把逻辑思维转变为代码的技术博主 &#x1f482; 作者主页: 【主页——&#x1f680;获取更多优质源码】 &#x1f393; web前端期末大作业…

【计算机毕业设计】7.线上花店系统maven源码

一、系统截图&#xff08;需要演示视频可以私聊&#xff09; 摘 要 随着互联网突飞猛进的发展及其对人们的生活产生至关重要的影响&#xff0c;线上购花&#xff0c;送货到家的购物方式受到了越来越多顾客的接受与喜爱。线上花卉小铺的设计与实现不仅可以带来更广泛的选择与实…

餐饮业如何现业绩突破性增长?

疫情反复无常&#xff0c;餐饮人每天都面临着极大的挑战&#xff1a;无法预测的关店通知、突如其来的禁止堂食命令......餐饮店客流减少&#xff0c;业绩下滑成为不可避免的趋势。 在这种情形下&#xff0c;不少餐饮老板拒绝“躺平”&#xff0c;上演“花式自救”&#xff1a;…