Sedíme v https://schizyfos.files.wordpress.com/2020/05/img_5045.jpg
a prezentujeme si technickú prezentáciu, príklad prevzatý z https://realpython.com/python-keras-text-classification/
surovinou dátovej vedy je DATASET, stiahnutý odtiaľto: https://realpython.com/python-keras-text-classification/ https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences Je to DATASET recenzii filmov. Uložíme ho do premennej typu pole a zobrazíme si ho.
import pandas as pd
filepath_dict = {'yelp': '~/sharedfolder/210219/Strom/Private/Ivan/Python/datasets/sentiment labelled sentences/sentiment labelled sentences/yelp_labelled.txt',
'amazon': '~/sharedfolder/210219/Strom/Private/Ivan/Python/datasets/sentiment labelled sentences/sentiment labelled sentences/amazon_cells_labelled.txt',
'imdb': '~/sharedfolder/210219/Strom/Private/Ivan/Python/datasets/sentiment labelled sentences/sentiment labelled sentences/imdb_labelled.txt'}
df_list = []
for source, filepath in filepath_dict.items():
df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
df['source'] = source # Add another column filled with the source name
df_list.append(df)
df = pd.concat(df_list)
print(df.iloc[0])
print(df_list)
[ sentence label source 0 Wow... Loved this place. 1 yelp 1 Crust is not good. 0 yelp 2 Not tasty and the texture was just nasty. 0 yelp 3 Stopped by during the late May bank holiday of... 1 yelp 4 The selection on the menu was great and so wer... 1 yelp 5 Now I am getting angry and I want my damn pho. 0 yelp 6 Honeslty it didn't taste THAT fresh.) 0 yelp 7 The potatoes were like rubber and you could te... 0 yelp 8 The fries were great too. 1 yelp 9 A great touch. 1 yelp 10 Service was very prompt. 1 yelp 11 Would not go back. 0 yelp 12 The cashier had no care what so ever on what I... 0 yelp 13 I tried the Cape Cod ravoli, chicken,with cran... 1 yelp 14 I was disgusted because I was pretty sure that... 0 yelp 15 I was shocked because no signs indicate cash o... 0 yelp 16 Highly recommended. 1 yelp 17 Waitress was a little slow in service. 0 yelp 18 This place is not worth your time, let alone V... 0 yelp 19 did not like at all. 0 yelp 20 The Burrittos Blah! 0 yelp 21 The food, amazing. 1 yelp 22 Service is also cute. 1 yelp 23 I could care less... The interior is just beau... 1 yelp 24 So they performed. 1 yelp 25 That's right....the red velvet cake.....ohhh t... 1 yelp 26 - They never brought a salad we asked for. 0 yelp 27 This hole in the wall has great Mexican street... 1 yelp 28 Took an hour to get our food only 4 tables in ... 0 yelp 29 The worst was the salmon sashimi. 0 yelp .. ... ... ... 970 I immediately said I wanted to talk to the man... 0 yelp 971 The ambiance isn't much better. 0 yelp 972 Unfortunately, it only set us up for disapppoi... 0 yelp 973 The food wasn't good. 0 yelp 974 Your servers suck, wait, correction, our serve... 0 yelp 975 What happened next was pretty....off putting. 0 yelp 976 too bad cause I know it's family owned, I real... 0 yelp 977 Overpriced for what you are getting. 0 yelp 978 I vomited in the bathroom mid lunch. 0 yelp 979 I kept looking at the time and it had soon bec... 0 yelp 980 I have been to very few places to eat that und... 0 yelp 981 We started with the tuna sashimi which was bro... 0 yelp 982 Food was below average. 0 yelp 983 It sure does beat the nachos at the movies but... 0 yelp 984 All in all, Ha Long Bay was a bit of a flop. 0 yelp 985 The problem I have is that they charge $11.99 ... 0 yelp 986 Shrimp- When I unwrapped it (I live only 1/2 a... 0 yelp 987 It lacked flavor, seemed undercooked, and dry. 0 yelp 988 It really is impressive that the place hasn't ... 0 yelp 989 I would avoid this place if you are staying in... 0 yelp 990 The refried beans that came with my meal were ... 0 yelp 991 Spend your money and time some place else. 0 yelp 992 A lady at the table next to us found a live gr... 0 yelp 993 the presentation of the food was awful. 0 yelp 994 I can't tell you how disappointed I was. 0 yelp 995 I think food should have flavor and texture an... 0 yelp 996 Appetite instantly gone. 0 yelp 997 Overall I was not impressed and would not go b... 0 yelp 998 The whole experience was underwhelming, and I ... 0 yelp 999 Then, as if I hadn't wasted enough of my life ... 0 yelp [1000 rows x 3 columns], sentence label source 0 So there is no way for me to plug it in here i... 0 amazon 1 Good case, Excellent value. 1 amazon 2 Great for the jawbone. 1 amazon 3 Tied to charger for conversations lasting more... 0 amazon 4 The mic is great. 1 amazon 5 I have to jiggle the plug to get it to line up... 0 amazon 6 If you have several dozen or several hundred c... 0 amazon 7 If you are Razr owner...you must have this! 1 amazon 8 Needless to say, I wasted my money. 0 amazon 9 What a waste of money and time!. 0 amazon 10 And the sound quality is great. 1 amazon 11 He was very impressed when going from the orig... 1 amazon 12 If the two were seperated by a mere 5+ ft I st... 0 amazon 13 Very good quality though 1 amazon 14 The design is very odd, as the ear "clip" is n... 0 amazon 15 Highly recommend for any one who has a blue to... 1 amazon 16 I advise EVERYONE DO NOT BE FOOLED! 0 amazon 17 So Far So Good!. 1 amazon 18 Works great!. 1 amazon 19 It clicks into place in a way that makes you w... 0 amazon 20 I went on Motorola's website and followed all ... 0 amazon 21 I bought this to use with my Kindle Fire and a... 1 amazon 22 The commercials are the most misleading. 0 amazon 23 I have yet to run this new battery below two b... 1 amazon 24 I bought it for my mother and she had a proble... 0 amazon 25 Great Pocket PC / phone combination. 1 amazon 26 I've owned this phone for 7 months now and can... 1 amazon 27 I didn't think that the instructions provided ... 0 amazon 28 People couldnt hear me talk and I had to pull ... 0 amazon 29 Doesn't hold charge. 0 amazon .. ... ... ... 970 I plugged it in only to find out not a darn th... 0 amazon 971 Excellent product. 1 amazon 972 Earbud piece breaks easily. 0 amazon 973 Lousy product. 0 amazon 974 This phone tries very hard to do everything bu... 0 amazon 975 It is the best charger I have seen on the mark... 1 amazon 976 SWEETEST PHONE!!! 1 amazon 977 :-)Oh, the charger seems to work fine. 1 amazon 978 It fits so securely that the ear hook does not... 1 amazon 979 Not enough volume. 0 amazon 980 Echo Problem....Very unsatisfactory 0 amazon 981 you could only take 2 videos at a time and the... 0 amazon 982 don't waste your money. 0 amazon 983 I am going to have to be the first to negative... 0 amazon 984 Adapter does not provide enough charging current. 0 amazon 985 There was so much hype over this phone that I ... 0 amazon 986 You also cannot take pictures with it in the c... 0 amazon 987 Phone falls out easily. 0 amazon 988 It didn't work, people can not hear me when I ... 0 amazon 989 The text messaging feature is really tricky to... 0 amazon 990 I'm really disappointed all I have now is a ch... 0 amazon 991 Painful on the ear. 0 amazon 992 Lasted one day and then blew up. 0 amazon 993 disappointed. 0 amazon 994 Kind of flops around. 0 amazon 995 The screen does get smudged easily because it ... 0 amazon 996 What a piece of junk.. I lose more calls on th... 0 amazon 997 Item Does Not Match Picture. 0 amazon 998 The only thing that disappoint me is the infra... 0 amazon 999 You can not answer calls with the unit, never ... 0 amazon [1000 rows x 3 columns], sentence label source 0 A very, very, very slow-moving, aimless movie ... 0 imdb 1 Not sure who was more lost - the flat characte... 0 imdb 2 Attempting artiness with black & white and cle... 0 imdb 3 Very little music or anything to speak of. 0 imdb 4 The best scene in the movie was when Gerardo i... 1 imdb 5 The rest of the movie lacks art, charm, meanin... 0 imdb 6 Wasted two hours. 0 imdb 7 Saw the movie today and thought it was a good ... 1 imdb 8 A bit predictable. 0 imdb 9 Loved the casting of Jimmy Buffet as the scien... 1 imdb 10 And those baby owls were adorable. 1 imdb 11 The movie showed a lot of Florida at it's best... 1 imdb 12 The Songs Were The Best And The Muppets Were S... 1 imdb 13 It Was So Cool. 1 imdb 14 This is a very "right on case" movie that deli... 1 imdb 15 It had some average acting from the main perso... 0 imdb 16 This review is long overdue, since I consider ... 1 imdb 17 I'll put this gem up against any movie in term... 1 imdb 18 It's practically perfect in all of them a tr... 1 imdb 19 The structure of this film is easily the most... 0 imdb 20 This if the first movie I've given a 10 to in ... 1 imdb 21 If there was ever a movie that needed word-of-... 1 imdb 22 Overall, the film is interesting and thought-p... 1 imdb 23 Plus, it was well-paced and suited its relativ... 1 imdb 24 Give this one a look. 1 imdb 25 I gave it a 10 1 imdb 26 The Wind and the Lion is well written and supe... 1 imdb 27 It is a true classic. 1 imdb 28 It actually turned out to be pretty decent as ... 1 imdb 29 Definitely worth checking out. 1 imdb .. ... ... ... 718 Enough can not be said of the remarkable anima... 1 imdb 719 The art style has the appearance of crayon/pen... 1 imdb 720 If you act in such a film, you should be glad ... 0 imdb 721 This one wants to surf on the small wave of sp... 0 imdb 722 If you haven't choked in your own vomit by the... 0 imdb 723 Still, it makes up for all of this with a supe... 1 imdb 724 Just consider the excellent story, solid actin... 1 imdb 725 Instead, we got a bore fest about a whiny, spo... 0 imdb 726 Then I watched it again two Sundays ago (March... 1 imdb 727 It is a very well acted and done TV Movie. 1 imdb 728 Judith Light is one of my favorite actresses a... 1 imdb 729 I keep watching it over and over. 1 imdb 730 It's a sad movie, but very good. 1 imdb 731 If you have not seen this movie, I definitely ... 1 imdb 732 She is as lovely as usual, this cutie! 1 imdb 733 Still it's quite interesting and entertaining ... 1 imdb 734 ;) Recommend with confidence! 1 imdb 735 This movie is well-balanced with comedy and dr... 1 imdb 736 It was a riot to see Hugo Weaving play a sex-o... 1 imdb 737 :) Anyway, the plot flowed smoothly and the ma... 1 imdb 738 The opening sequence of this gem is a classic,... 1 imdb 739 Fans of the genre will be in heaven. 1 imdb 740 Lange had become a great actress. 1 imdb 741 It looked like a wonderful story. 1 imdb 742 I never walked out of a movie faster. 0 imdb 743 I just got bored watching Jessice Lange take h... 0 imdb 744 Unfortunately, any virtue in this film's produ... 0 imdb 745 In a word, it is embarrassing. 0 imdb 746 Exceptionally bad! 0 imdb 747 All in all its an insult to one's intelligence... 0 imdb [748 rows x 3 columns]]
Na modelových vetách pochopíme, čo je to vektorizácia textu
from sklearn.feature_extraction.text import CountVectorizer
sentences = ['John likes ice cream because of ice cream and ice cream likes John', 'John hates chocolate because of chocolate and chocolate likes John.']
vectorizer.fit(sentences)
CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=False, max_df=1.0, max_features=None, min_df=0, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)
vectorizer = CountVectorizer(min_df=0, lowercase=False)
vectorizer.fit(sentences)
CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=False, max_df=1.0, max_features=None, min_df=0, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)
vectorizer.vocabulary_
{'John': 0, 'likes': 7, 'ice': 6, 'cream': 4, 'because': 2, 'of': 8, 'and': 1, 'hates': 5, 'chocolate': 3}
vectorizer.transform(sentences).toarray()
array([[2, 1, 1, 0, 3, 0, 3, 2, 1], [2, 1, 1, 3, 0, 1, 0, 1, 1]])
Jednoduchý NLP MODEL sa učí rozlišovať recenzie na pozitívne a negatívne. Potom sa samotestuje a samovaliduje.
from sklearn.model_selection import train_test_split
df_yelp = df[df['source'] == 'yelp']
sentences = df_yelp['sentence'].values
y = df_yelp['label'].values
sentences_train, sentences_test, y_train, y_test = train_test_split(
... sentences, y, test_size=0.25, random_state=1000)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)
CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)
X_train = vectorizer.transform(sentences_train)
X_test = vectorizer.transform(sentences_test)
X_train
<750x1714 sparse matrix of type '<class 'numpy.int64'>' with 7368 stored elements in Compressed Sparse Row format>
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='warn', tol=0.0001, verbose=0, warm_start=False)
score = classifier.score(X_test, y_test)
print(score)
0.796
for source in df['source'].unique():
df_source = df[df['source'] == source]
sentences = df_source['sentence'].values
y = df_source['label'].values
sentences_train, sentences_test, y_train, y_test = train_test_split(
sentences, y, test_size=0.25, random_state=1000)
vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_test = vectorizer.transform(sentences_test)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
score = classifier.score(X_test, y_test)
print('Accuracy for {} data: {:.4f}'.format(source, score))
Accuracy for yelp data: 0.7960 Accuracy for amazon data: 0.7960 Accuracy for imdb data: 0.7487
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning) /usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning) /usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning)
Zložitejší NLP MODEL v https://keras.io/
from keras.models import Sequential
Using TensorFlow backend.
from keras import layers
input_dim = X_train.shape[1]
model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
... optimizer='adam',
... metrics=['accuracy'])
model.summary()
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_1 (Dense) (None, 10) 25060 _________________________________________________________________ dense_2 (Dense) (None, 1) 11 ================================================================= Total params: 25,071 Trainable params: 25,071 Non-trainable params: 0 _________________________________________________________________
history = model.fit(X_train, y_train,
... epochs=100,
... verbose=False,
... validation_data=(X_test, y_test),
... batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
Overfitting je keď sa učiaci model učí úplne presne (training accuracy), ale nevie to dobre použiť (testing accuracy)
print("Training Accuracy: {:.4f}".format(accuracy))
Training Accuracy: 1.0000
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy: {:.4f}".format(accuracy))
Testing Accuracy: 0.7861
Tu bola KERAS chyba premenných https://towardsdatascience.com/fixing-the-keyerror-acc-and-keyerror-val-acc-errors-in-keras-2-3-x-or-newer-b29b52609af9
import matplotlib.pyplot as plt
plt.style.use('ggplot')
def plot_history(history):
accuracy = history.history['accuracy']
val_accuracy = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
x = range(1, len(accuracy) + 1)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(x, accuracy, 'b', label='Training accuracy')
plt.plot(x, val_accuracy, 'r', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(x, loss, 'b', label='Training loss')
plt.plot(x, val_loss, 'r', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plot_history(history)
Zmena vektorizačnej metódy mapovaním (transformáciou) do geometrického priestoru (embedding space)(tokenizing)
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(sentences_train)
X_train = tokenizer.texts_to_sequences(sentences_train)
X_test = tokenizer.texts_to_sequences(sentences_test)
vocab_size = len(tokenizer.word_index) + 1 # Adding 1 because of reserved 0 index
print(sentences_train[2])
print(X_train[2])
I am a fan of his ... This movie sucked really bad. [7, 150, 2, 932, 4, 49, 6, 11, 563, 45, 30]
for word in ['the', 'all', 'looks', 'good']:
print('{}: {}'.format(word, tokenizer.word_index[word]))
the: 1 all: 27 looks: 431 good: 33
from keras.preprocessing.sequence import pad_sequences
maxlen = 100
X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)
print(X_train[0, :])
[170 116 390 35 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Keras embeddovanie
from keras.models import Sequential
from keras import layers
embedding_dim = 50
model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size,
output_dim=embedding_dim,
input_length=maxlen))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.summary()
Model: "sequential_2" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 100, 50) 128750 _________________________________________________________________ flatten_1 (Flatten) (None, 5000) 0 _________________________________________________________________ dense_3 (Dense) (None, 10) 50010 _________________________________________________________________ dense_4 (Dense) (None, 1) 11 ================================================================= Total params: 178,771 Trainable params: 178,771 Non-trainable params: 0 _________________________________________________________________
history = model.fit(X_train, y_train,
epochs=20,
verbose=False,
validation_data=(X_test, y_test),
batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy: {:.4f}".format(accuracy))
plot_history(history)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/indexed_slices.py:434: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Training Accuracy: 1.0000 Testing Accuracy: 0.7112
from keras.models import Sequential
from keras import layers
embedding_dim = 50
model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size,
output_dim=embedding_dim,
input_length=maxlen))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.summary()
Model: "sequential_3" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_2 (Embedding) (None, 100, 50) 128750 _________________________________________________________________ global_max_pooling1d_1 (Glob (None, 50) 0 _________________________________________________________________ dense_5 (Dense) (None, 10) 510 _________________________________________________________________ dense_6 (Dense) (None, 1) 11 ================================================================= Total params: 129,271 Trainable params: 129,271 Non-trainable params: 0 _________________________________________________________________
history = model.fit(X_train, y_train,
epochs=50,
verbose=False,
validation_data=(X_test, y_test),
batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy: {:.4f}".format(accuracy))
plot_history(history)
Training Accuracy: 1.0000 Testing Accuracy: 0.7433
Keras konvolučný model
mbedding_dim = 100
model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
model.add(layers.Conv1D(128, 5, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.summary()
Model: "sequential_4" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_3 (Embedding) (None, 100, 50) 128750 _________________________________________________________________ conv1d_1 (Conv1D) (None, 96, 128) 32128 _________________________________________________________________ global_max_pooling1d_2 (Glob (None, 128) 0 _________________________________________________________________ dense_7 (Dense) (None, 10) 1290 _________________________________________________________________ dense_8 (Dense) (None, 1) 11 ================================================================= Total params: 162,179 Trainable params: 162,179 Non-trainable params: 0 _________________________________________________________________
history = model.fit(X_train, y_train,
epochs=10,
verbose=False,
validation_data=(X_test, y_test),
batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy: {:.4f}".format(accuracy))
plot_history(history)
Training Accuracy: 1.0000 Testing Accuracy: 0.7914
Už tretí model stále nevie zlepšiť svoje učenie. Môže to byť preto, lebo máme málo učiacich vzoriek, alebo data majú malý zovšeobecňovací potenciál, alebo potrebujeme vylaďovať hyperparametre
def create_model(num_filters, kernel_size, vocab_size, embedding_dim, maxlen):
model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
model.add(layers.Conv1D(num_filters, kernel_size, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
return model
param_grid = dict(num_filters=[32, 64, 128],
kernel_size=[3, 5, 7],
vocab_size=[5000],
embedding_dim=[50],
maxlen=[100])
Model, ktorý si vylaďuje hyperparametre
import time
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import RandomizedSearchCV
# Main settings
epochs = 20
embedding_dim = 50
maxlen = 100
output_file = 'data/output.txt'
# Run grid search for each source (yelp, amazon, imdb)
for source, frame in df.groupby('source'):
print('Running grid search for data set :', source)
sentences = df['sentence'].values
y = df['label'].values
# Train-test split
sentences_train, sentences_test, y_train, y_test = train_test_split(
sentences, y, test_size=0.25, random_state=1000)
# Tokenize words
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(sentences_train)
X_train = tokenizer.texts_to_sequences(sentences_train)
X_test = tokenizer.texts_to_sequences(sentences_test)
# Adding 1 because of reserved 0 index
vocab_size = len(tokenizer.word_index) + 1
# Pad sequences with zeros
X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)
# Parameter grid for grid search
param_grid = dict(num_filters=[32, 64, 128],
kernel_size=[3, 5, 7],
vocab_size=[vocab_size],
embedding_dim=[embedding_dim],
maxlen=[maxlen])
model = KerasClassifier(build_fn=create_model,
epochs=epochs, batch_size=10,
verbose=False)
grid = RandomizedSearchCV(estimator=model, param_distributions=param_grid,
cv=4, verbose=1, n_iter=5)
grid_result = grid.fit(X_train, y_train)
# Evaluate testing set
test_accuracy = grid.score(X_test, y_test)
# Save and evaluate results
# prompt = input(f'finished {source}; write to file and proceed? [y/n]')
# if prompt.lower() not in {'y', 'true', 'yes'}:
# break
# with open(output_file, 'a') as f:
s = ('Running {} data set\nBest Accuracy : '
'{:.4f}\n{}\nTest Accuracy : {:.4f}\n\n')
output_string = s.format(
source,
grid_result.best_score_,
grid_result.best_params_,
test_accuracy)
print(output_string)
time.sleep(5.5)
# f.write(output_string)
Running grid search for data set : amazon Fitting 4 folds for each of 5 candidates, totalling 20 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 7.7min finished
Running amazon data set Best Accuracy : 0.8151 {'vocab_size': 4603, 'num_filters': 128, 'maxlen': 100, 'kernel_size': 3, 'embedding_dim': 50} Test Accuracy : 0.8384 Running grid search for data set : imdb Fitting 4 folds for each of 5 candidates, totalling 20 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 7.1min finished
Running imdb data set Best Accuracy : 0.8185 {'vocab_size': 4603, 'num_filters': 32, 'maxlen': 100, 'kernel_size': 3, 'embedding_dim': 50} Test Accuracy : 0.8443 Running grid search for data set : yelp Fitting 4 folds for each of 5 candidates, totalling 20 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 7.2min finished
Running yelp data set Best Accuracy : 0.8161 {'vocab_size': 4603, 'num_filters': 64, 'maxlen': 100, 'kernel_size': 5, 'embedding_dim': 50} Test Accuracy : 0.8282
Zlepšený model
Running grid search for data set : amazon Fitting 4 folds for each of 5 candidates, totalling 20 fits [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 7.7min finished Running amazon data set Best Accuracy : 0.8151 {'vocab_size': 4603, 'num_filters': 128, 'maxlen': 100, 'kernel_size': 3, 'embedding_dim': 50} Test Accuracy : 0.8384
Running grid search for data set : imdb Fitting 4 folds for each of 5 candidates, totalling 20 fits [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 7.1min finished Running imdb data set Best Accuracy : 0.8185 {'vocab_size': 4603, 'num_filters': 32, 'maxlen': 100, 'kernel_size': 3, 'embedding_dim': 50} Test Accuracy : 0.8443
Running grid search for data set : yelp Fitting 4 folds for each of 5 candidates, totalling 20 fits [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 7.2min finished Running yelp data set Best Accuracy : 0.8161 {'vocab_size': 4603, 'num_filters': 64, 'maxlen': 100, 'kernel_size': 5, 'embedding_dim': 50} Test Accuracy : 0.8282