诸神缄默不语-个人CSDN博文目录
最近更新日期:2023.6.7
最早更新日期:2023.6.7
文章目录
- 1. 司法判决预测
- 2. 通用语料
- 3. 其他集成项目
- 4. 推理
- 5. NLU
- 6. NLG
- 1 QA
- 2 文本摘要
- 7. 信息抽取
- 1 命名实体识别
- 2 句子边界检测(分句)
1. 司法判决预测
中文:
- CAIL2018
刑法- 原始论文:CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction
Overview of CAIL2018: Legal Judgment Prediction Competition - 数据下载地址:https://cail.oss-cn-qingdao.aliyuncs.com/CAIL2018_ALL_DATA.zip(对数据的具体介绍除上面的论文外,还可以参考:thunlp/CAIL: Chinese AI & Law Challenge)
- 任务:(分类)预测法条、罪名、刑期
- 原始论文:CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction
2. 通用语料
多语言:
- MultiLegalPile
- 原始论文:(2023) MultiLegalPile: A 689GB Multilingual Legal Corpus
- 数据下载地址:https://huggingface.co/datasets/joelito/Multi_Legal_Pile
- 项目包含的数据:
- https://huggingface.co/datasets/joelito/eurlex_resources
- https://huggingface.co/datasets/joelito/legal-mc4
- Pile of Law
- LexFiles
- 原始论文:(2023 ACL) LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development
西班牙语:
- Spanish Legal Domain Corpora
- 原始论文:(2021) Spanish Legalese Language Model and Corpora
- 数据下载地址:Spanish Legal Domain Corpora | Zenodo
英语:
- CaseHOLD
English Harvard Law case corpus (1965-2021)- 原始论文:(2021 ICAIL) When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings
- Pile of Law
- 原始论文:(2022 NeurIPS) Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
- 数据下载地址:https://huggingface.co/datasets/pile-of-law/pile-of-law
中文:
- 华律网法律咨询数据及论文所需语料库;同时发表的论文:法律咨询文本分类系统设计与研究
The legal consultation data and corpus of the thesis from China law network.Replication Data for: Design and research of legal consultation text classification system. - Data Driven Innovation Research Competition for University of China
3. 其他集成项目
多语言:
- LexGLUE
coastalcph/lex-glue: LexGLUE: A Benchmark Dataset for Legal Language Understanding in English- 原始论文:(2021) LexGLUE: A Benchmark Dataset for Legal Language Understanding in English
- LEXTREME
- 原始论文:(2023) LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain
- 数据下载地址:https://huggingface.co/datasets/joelito/lextreme
还没整理完的:
- https://github.com/neelguha/legal-ml-datasets
4. 推理
- legalbench
- 原始论文:(2022) LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning
- 数据下载地址:https://github.com/HazyResearch/legalbench
英语:
- SARA:大概来说就是推理某种情况是否适用于某某法律(美国税法中的9个Section)
- 原始论文:(2020) A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering
5. NLU
- SemEval 2023 Task 6: LegalEval - Understanding Legal Texts
- 任务:Rhetorical Roles Labeling,命名实体识别,可解释的司法判决预测
6. NLG
1 QA
中文:
- JEC-QA
法考数据集
https://jecqa.thunlp.org/- 原始论文:(2020 AAAI) JEC-QA: A Legal-Domain Question Answering Dataset
2 文本摘要
英文:
- BillSum
7. 信息抽取
1 命名实体识别
葡萄牙语(巴西):
- CDJUR-BR
- 原始论文:(2023) CDJUR-BR – A Golden Collection of Legal Document from Brazilian Justice with Fine-Grained Named Entities
2 句子边界检测(分句)
多语言:
- MultiLegalSBD(英语、西班牙语、德语、意大利语、葡萄牙语、法语)
- 原始论文:(2023 ICAIL) MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset
- 数据下载地址:https://huggingface.co/datasets/rcds/MultiLegalSBD