[R] Why data manipulation is crucial and sensitive?

news2024/11/26 21:46:10

What does a data scientist really do?

Identifying the pattern in cultural consumption, making fancy graph, engage a dialogue between data and the existing literature, refining hypothesis….(done within one months with three to four online meetings with partners = no more than 35 hours to agree on the main assertions)

Litteratue review

60

20%

Primary definition of the hypothesis

5

2%

Getting familiar with the codebook and the survey

10

3%

Explore the potential variable of interest

20

7%

Rename the variable of interest

15

5%

Recode the variable of interest and translate in English

70

23%

Non answer cleaning

5

2%

Rename the labels (levels)

10

3%

Primary analysis of the outputs (inspect the recoded variable and bivariate an)

20

7%

Reformulation of some hypothesis

5

2%

Plotting the first MCA and analyze them

15

5%

Compare model strength and understand the primary outputs

5

2%

Refining hypothesis and assertions

15

5%

Writing the article

50

16%

305

100%

 What is “cleaning and organizing data”?

Definition:

  • Cleaning and organizing data refer to the process of preparing raw data for analysis by identifying and correcting errors, inconsistencies, and inaccuracies, and structuring it in a way that facilitates effective analysis.

Steps Involved:

  • Data Cleaning:
    • Handling missing values.
    • Removing duplicates.
    • Correcting errors and inconsistencies.
  • Data Organization:
    • Structuring data in a readable format.
    • Categorizing and labeling data.
    • Creating meaningful variables.

Removing Duplicates:

# Removing duplicate rows
unique_data <- unique(raw_data)

Correcting Errors and Inconsistencies

# Replacing incorrect values
corrected_data <- replace(raw_data, incorrect_condition, replacement_value)

Structuring Data:

# Creating a data frame
structured_data <- data.frame(variable1 = vector1, variable2 = vector2, ...)

Categorizing and Labeling Data:

# Creating factors for categorical variables
categorized_data <- factor(raw_data$variable, levels = c("Category1", "Category2", ...))

 Creating Meaningful Variables:

# Creating a new variable based on existing ones
raw_data$new_variable <- raw_data$variable1 + raw_data$variable2

Common issue with online survey

Data not writen in the good format: the typical issue with year of birth

家庭状况与教育经历
47、您的出生年份是?(请填写整数,例如:1984) (填空题 *必答)
________________________

Section Familial Situation and Education background
47. Which year are you born (Please write number such as 1984)

 

In the raw data, we have 2 people born in 1898 = 120 years old

Given the average age of the population they are likely to be born in 1998

25 respondents answered using the format Year/Month/Birth

Ex: 19940105

2 respondents answered using very original format

Ex: 930524 / 197674

1 respondent just answer 1

How to clean efficiently with R?

tidyR

= it is a very important package to transform a long table from a wide table

Will not be covered, but basic operation using tidyr are explained in this website: https://mgimond.github.io/ES218/Week03b.html

dplyr

dplyr is a very important package that enables you to select specific variable and data, and to transform them

dplyr Package in R:
  1. Selection of Specific Variables:

    • select() function: It allows you to choose specific columns (variables) from a data frame.
      # Example: Selecting columns "variable1" and "variable2"
      selected_data <- select(your_data_frame, variable1, variable2)
      

  2. Filtering Data:

     
    • filter() function: Enables you to subset your data based on specific conditions
    • # Example: Filtering data where "variable1" is greater than 10
      filtered_data <- filter(your_data_frame, variable1 > 10)
      

  3. Transformation (Mutating) Data:

     
    • mutate() function: Allows you to create new variables or modify existing ones.
    • # Example: Creating a new variable "new_variable" as a transformation of existing variables
      mutated_data <- mutate(your_data_frame, new_variable = variable1 + variable2)
      

  4. Arranging Data:

     
    • arrange() function: Sorts rows based on specified variables.
    • # Example: Sorting data based on "variable1" in ascending order
      sorted_data <- arrange(your_data_frame, variable1)
      

  5. Summarizing Data:

     
    • summarize() function: Aggregates data, often used with functions like mean, sum, etc.
    • # Example: Calculating the mean of "variable1"
      summary_stats <- summarize(your_data_frame, mean_variable1 = mean(variable1))
      

The magrittr package

The magrittr package offers a set of operators which make your code more readable by:

structuring sequences of data operations left-to-right, avoiding nested function calls, minimizing the need for local variables and function definitions, and making it easy to add steps anywhere in the sequence of operations.

The operators pipe their left-hand side values forward into expressions that appear on the right-hand side, i.e. one can replace f(x) with x %>% f(), where %>% is the (main) pipe-operator.

https://magrittr.tidyverse.org/

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1426061.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

漏洞01-目录遍历漏洞/敏感信息泄露/URL重定向

目录遍历漏洞/敏感信息泄露/URL重定向 文章目录 目录遍历敏感信息泄露URL重定向 目录遍历 敏感信息泄露 于后台人员的疏忽或者不当的设计&#xff0c;导致不应该被前端用户看到的数据被轻易的访问到。 比如&#xff1a; ---通过访问url下的目录&#xff0c;可以直接列出目录下…

【实战】使用Helm在K8S集群安装MySQL主从

文章目录 前言技术积累什么是HelmStorageClass使用的工具版本 helm 安装 MySQL 1主2从1. 添加 bitnami 的仓库2. 查询 MySQL 资源3. 拉取 MySQL chart 到本地4. 对chart 本地 values-test.yaml 修改5. 对本地 templates 模板 修改6. 安装 MySQL 集群7. 查看部署的 MySQL 集群8.…

算法--数论

这里写目录标题 质数&#xff08;素数&#xff09;定义判断是否为质数暴力写法&#xff0c;试除法基本思想具体写法 优化基本思想&#xff08;时间复杂度根号n&#xff09;具体写法 分解质因数分析题意暴力写法基本思想具体代码 优化基本思想&#xff08;时间复杂度小于等于根号…

聊聊ClickHouse MergeTree引擎的固定/自适应索引粒度

前言 我们在刚开始学习ClickHouse的MergeTree引擎时&#xff0c;就会发现建表语句的末尾总会有SETTINGS index_granularity 8192这句话&#xff08;其实不写也可以&#xff09;&#xff0c;表示索引粒度为8192。在每个data part中&#xff0c;索引粒度参数的含义有二&#xf…

Camera | 15.闪光灯SGM3141概述

芯片说明 SGM3141是一种电流调节降压/升压电荷泵LED驱动器&#xff0c;能够驱动700M输出电流。它非常适合为相机闪光灯应用的高亮度LED供电。SGM3141具有1/2操作模式&#xff0c;用于控制闪光和火炬模式的输出电流。 电源电压在2.7V到5.5V之间工作&#xff0c;非常适合由1芯锂…

CDS view与替代对象

一&#xff0c;简介 替代对象是指用一个CDS view指派给一个透明表或常规数据库视图&#xff0c;使得透明表或常规数据库视图的访问重定向到该CDS view。 替代有诸多要求&#xff1a; 字段数量一致且同名对应&#xff0c;顺序可以不一致对应的字段数据类型长度等必须一致CDS v…

文心一言APP上线新功能,一张照片、三句话即可生成专属数字分身

只需一张照片、录制三句话&#xff0c;就能拥有一个自己的数字分身&#xff1f;这不是科幻电影&#xff0c;而是文心一言APP上线的新功能 - 数字分身。 目前&#xff0c;文心一言APP正在内测数字分身新功能&#xff0c;明天起&#xff0c;iOS和Android用户升级新版本后&#xf…

超简单设置Windows共享文件夹,传输文件无烦恼

前言 开始之前&#xff0c;先让小白感叹一下科技发展真快呀&#xff01;&#xff08;这句话纯粹是为了凑点字数&#xff09; 随着科技的发展&#xff0c;人们手上总会有各种各样的电子设备&#xff1a;电脑、平板、手机、游戏机、电视盒子等等&#xff5e; 有时候想要传输个文…

【Docker】【深度学习算法】在Docker中使用gunicorn启动多个并行算法服务,优化算法服务:从单进程到并行化

文章目录 优化算法服务&#xff1a;从单进程到并行化单个服务架构多并行服务架构Docker化并指定并行服务数量 扩展知识 优化算法服务&#xff1a;从单进程到并行化 在实际应用中&#xff0c;单个算法服务的并发能力可能无法满足需求。为了提高性能和并发处理能力&#xff0c;我…

MySQL基础(三)-学习笔记

一.innodb引擎&#xff1a; 1). 表空间&#xff1a;表空间是InnoDB存储引擎逻辑结构的最高层&#xff0c;启用了参数 innodb_file_per_table(在 8.0版本中默认开启) &#xff0c;则每张表都会有一个表空间&#xff08;xxx.ibd&#xff09;&#xff0c;一个mysql实例可以对应多个…

figure方法详解之清除图形内容

figure方法详解之清除图形内容 一 clf():二 clear():三 clear()方法和clf()方法的区别&#xff1a; 前言 Hello 大家好&#xff01;我是甜美的江。 在数据可视化中&#xff0c;Matplotlib 是一个功能强大且广泛使用的库&#xff0c;它提供了各种方法来创建高质量的图形。在 Mat…

p2Cache: Exploring Tiered Memory for In-Kernel File Systems Caching——论文泛读

ATC 2023 Paper 分布式元数据论文汇总 问题 快速、字节寻址的持久性内存&#xff08;PM&#xff09;正在产品中变得越来越现实。然而&#xff0c;使传统的内核文件系统完全支持PM需要大量的工作&#xff0c;面临着在块级访问粒度和字节寻址之间转换的挑战。此外&#xff0c;新…

react 之 react.memo

React.memo 作用&#xff1a;允许组件在props没有改变的情况下跳过重新渲染 组件默认的渲染机制 默认机制&#xff1a;顶层组件发生重新渲染&#xff0c;这个组件树的子级组件都会被重新渲染 // memo // 作用&#xff1a;允许组件在props没有改变的情况下跳过重新渲染import…

UGUI中Text和TextMeshPro实现图文混排方式

一些项目中实现图文混排是自定义一个脚本去继承Text类&#xff0c;然后文本中用富文本的方式进行图片和超链接的定义&#xff0c;在代码中用正则表达式匹配的方式把文本中图片和超链接给替换&#xff0c;如下&#xff1a; TextMeshPro实现是生成SpriteAsset进行图文混排的&…

SpringCloud + Nacos配置文件加载顺序和优先级详解

文章目录 一、加载顺序与优先级1. 示例配置2. 配置文件分类3. 加载顺序4. 优先级 二、本地配置优先的设置结论 在微服务架构中&#xff0c;合理地管理和理解配置文件的加载顺序与优先级对于确保应用的稳定性和灵活性至关重要。特别是在使用 Spring Cloud Alibaba Nacos 作为配置…

数组与字符串深度巩固

经过再三思考觉得今天就写一篇关于数组与字符串相关的文章吧&#xff01;其中字符串主要通过练习来巩固知识亦或是获得新知识。好接下来将进行我们的学习时刻了。 首先我们来思考一个问题&#xff0c;你真的了解数组的数组名吗&#xff1f;数组名真的就单单一个名字而已吗&…

nodejs+vue+mysql校园失物招领网站38tp1

本高校失物招领平台是为了提高用户查阅信息的效率和管理人员管理信息的工作效率&#xff0c;可以快速存储大量数据&#xff0c;还有信息检索功能&#xff0c;这大大的满足了用户和管理员这两者的需求。操作简单易懂&#xff0c;合理分析各个模块的功能&#xff0c;尽可能优化界…

【unity小技巧】unity3d创建和实现破碎打破物品,万物可破碎

文章目录 破碎插件可破碎的物品代码控制加入破坏力完结 破碎插件 关于物品破碎&#xff0c;其实之前已经分享过一个免费插件&#xff0c;如果没有碎片化的模型&#xff0c;可以选择使用这个插件&#xff1a; OpenFracture插件实现unity3d物体破裂和切割 可破碎的物品 代码控制…

zsh: command not found: mysql (mac通过安装MySQL后终端cmd找不到mysql命令)

考虑是mysql环境变量没有配置的问题 1.查找mysql安装路径 ps -ef|grep mysql 2.先启动上安装的mysql 3. 查看 .bash_profile 文件 ls -al 查看是否有(.bash_profile)文件 如果没有就输入以下命令创建一个&#xff0c;再查看 touch .bash_profile 4.打开 .bash_profile 文件 …

Spring的事件监听机制

这里写自定义目录标题 1. 概述&#xff08;重点&#xff09;2. ApplicationEventMulticaster2.1 SimpleApplicationEventMulticaster2.2 AbstractApplicationEventMulticaster 3. ApplicationListener3.1 注册监听器3.2 自定义 4. SpringApplicationRunListeners 1. 概述&#…