下载数据:
aws s3 cp s3://applied-nlp-book/data/ data --recursive --no-sign-request
aws s3 cp s3://applied-nlp-book/models/ag_dataset/ models/ag_dataset --recursive --no-sign-request
上面第一份数据接近1GB,第二份接近3GB;
示例代码:
import spacy
# load pretrained transformer model, this model is Roberta-base of BERT-base arch
nlp = spacy.load("en_core_web_trf")
# tokenizer the sentence of parameter
sentence = nlp.tokenizer("We live in Paris.")
print("The tokens:")
for words in sentence:
print(words)
import pandas as pd
import os
cwd = os.getcwd()
# read the questions of csv format
data = pd.read_csv(cwd+'/data/jeopardy_questions/jeopardy_questions.csv')
data = pd.DataFrame(data=data)
data.columns = map(lambda x: x.lower().strip(), data.columns)
data = data[0:1000]
data["question_tokens"] = data["question"].apply(lambda x: nlp(x))
# 0-th item
example_question = data.question[0]
example_question_tokens = data.question_tokens[0]
print("The first questions is:")
print(example_question)
print("the tokens from the first question are:")
for tokens in example_question_tokens:
print(tokens)
文件中的部分内容
jeopardy_questions.csv:
Show Number, Air Date, Round, Category, Value, Question, Answer
4680,2004-12-31,Jeopardy!,"HISTORY","$200","For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory","Copernicus"
4680,2004-12-31,Jeopardy!,"ESPN's TOP 10 ALL-TIME ATHLETES","$200","No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves","Jim Thorpe"
4680,2004-12-31,Jeopardy!,"EVERYBODY TALKS ABOUT IT...","$200","The city of Yuma in this state has a record average of 4,055 hours of sunshine each year","Arizona"
4680,2004-12-31,Jeopardy!,"THE COMPANY LINE","$200","In 1963, live on ""The Art Linkletter Show"", this company served its billionth burger","McDonald's"
4680,2004-12-31,Jeopardy!,"EPITAPHS & TRIBUTES","$200","Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States","John Adams"
4680,2004-12-31,Jeopardy!,"3-LETTER WORDS","$200","In the title of an Aesop fable, this insect shared billing with a grasshopper","the ant"
4680,2004-12-31,Jeopardy!,"HISTORY","$400","Built in 312 B.C. to link Rome & the South of Italy, it's still in use today","the Appian Way"
4680,2004-12-31,Jeopardy!,"ESPN's TOP 10 ALL-TIME ATHLETES","$400","No. 8: 30 steals for the Birmingham Barons; 2,306 steals for the Bulls","Michael Jordan"
4680,2004-12-31,Jeopardy!,"EVERYBODY TALKS ABOUT IT...","$400","In the winter of 1971-72, a record 1,122 inches of snow fell at Rainier Paradise Ranger Station in this state","Washington"
4680,2004-12-31,Jeopardy!,"THE COMPANY LINE","$400","This housewares store was named for the packaging its merchandise came in & was first displayed on","Crate & Barrel"
4680,2004-12-31,Jeopardy!,"EPITAPHS & TRIBUTES","$400","""And away we go""","Jackie Gleason"
4680,2004-12-31,Jeopardy!,"3-LETTER WORDS","$400","Cows regurgitate this from the first stomach to the mouth & chew it again","the cud"
4680,2004-12-31,Jeopardy!,"HISTORY","$600","In 1000 Rajaraja I of the Cholas battled to take this Indian Ocean island now known for its tea","Ceylon (or Sri Lanka)"
4680,2004-12-31,Jeopardy!,"ESPN's TOP 10 ALL-TIME ATHLETES","$600","No. 1: Lettered in hoops, football & lacrosse at Syracuse & if you think he couldn't act, ask his 11 ""unclean"" buddies","Jim Brown"
4680,2004-12-31,Jeopardy!,"EVERYBODY TALKS ABOUT IT...","$600","On June 28, 1994 the nat'l weather service began issuing this index that rates the intensity of the sun's radiation","the UV index"
4680,2004-12-31,Jeopardy!,"THE COMPANY LINE","$600","This company's Accutron watch, introduced in 1960, had a guarantee of accuracy to within one minute a month","Bulova"
4680,2004-12-31,Jeopardy!,"EPITAPHS & TRIBUTES","$600","Outlaw: ""Murdered by a traitor and a coward whose name is not worthy to appear here""","Jesse James"
4680,2004-12-31,Jeopardy!,"3-LETTER WORDS","$600","A small demon, or a mischievous child (who might be a little demon!)","imp"
运行效果: