1 需求
需求1:from transformers import AutoTokenizer
需求2:from transformers import BertTokenizer
2 接口
关键参数
- text
- padding
- truncation
- return_tensors
3 示例
BertTokenizer.from_pretrained()
PreTrainedTokenizer
PreTrainedTokenizerBase.from_pretrained()
text = "今天天气不错"
# 第一步:数据预处理(Raw text -》Input IDs)
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('./model')
# 方法一:convert_tokens_to_ids函数
token = tokenizer.tokenize(text)
print(tokenizer.convert_tokens_to_ids(token))
# 方法二:encode函数
print(tokenizer.encode(text))
print(tokenizer.encode(text, add_special_tokens=False))
# 方法三:encode_plus函数
print(tokenizer.encode_plus(text, padding=True, truncation=True, return_tensors="pt"))
# 方法四:tokenizer函数
print(tokenizer(text, padding=True, truncation=True, return_tensors="pt"))
4 参考资料
https://huggingface.co/docs/transformers/main_classes/tokenizer
https://hf-mirror.com/docs/transformers/main_classes/tokenizer