Sedíme v https://schizyfos.files.wordpress.com/2020/05/img_5045.jpg

a prezentujeme si technickú prezentáciu, príklad prevzatý z https://realpython.com/python-keras-text-classification/

surovinou dátovej vedy je DATASET, stiahnutý odtiaľto: https://realpython.com/python-keras-text-classification/ https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences Je to DATASET recenzii filmov. Uložíme ho do premennej typu pole a zobrazíme si ho.

In [ ]:

import pandas as pd

filepath_dict = {'yelp':   '~/sharedfolder/210219/Strom/Private/Ivan/Python/datasets/sentiment labelled sentences/sentiment labelled sentences/yelp_labelled.txt',
                 'amazon': '~/sharedfolder/210219/Strom/Private/Ivan/Python/datasets/sentiment labelled sentences/sentiment labelled sentences/amazon_cells_labelled.txt',
                 'imdb':   '~/sharedfolder/210219/Strom/Private/Ivan/Python/datasets/sentiment labelled sentences/sentiment labelled sentences/imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source  # Add another column filled with the source name
    df_list.append(df)

df = pd.concat(df_list)
print(df.iloc[0])

In [59]:

print(df_list)

[ sentence label source
0 Wow... Loved this place. 1 yelp
1 Crust is not good. 0 yelp
2 Not tasty and the texture was just nasty. 0 yelp
3 Stopped by during the late May bank holiday of... 1 yelp
4 The selection on the menu was great and so wer... 1 yelp
5 Now I am getting angry and I want my damn pho. 0 yelp
6 Honeslty it didn't taste THAT fresh.) 0 yelp
7 The potatoes were like rubber and you could te... 0 yelp
8 The fries were great too. 1 yelp
9 A great touch. 1 yelp
10 Service was very prompt. 1 yelp
11 Would not go back. 0 yelp
12 The cashier had no care what so ever on what I... 0 yelp
13 I tried the Cape Cod ravoli, chicken,with cran... 1 yelp
14 I was disgusted because I was pretty sure that... 0 yelp
15 I was shocked because no signs indicate cash o... 0 yelp
16 Highly recommended. 1 yelp
17 Waitress was a little slow in service. 0 yelp
18 This place is not worth your time, let alone V... 0 yelp
19 did not like at all. 0 yelp
20 The Burrittos Blah! 0 yelp
21 The food, amazing. 1 yelp
22 Service is also cute. 1 yelp
23 I could care less... The interior is just beau... 1 yelp
24 So they performed. 1 yelp
25 That's right....the red velvet cake.....ohhh t... 1 yelp
26 - They never brought a salad we asked for. 0 yelp
27 This hole in the wall has great Mexican street... 1 yelp
28 Took an hour to get our food only 4 tables in ... 0 yelp
29 The worst was the salmon sashimi. 0 yelp
.. ... ... ...
970 I immediately said I wanted to talk to the man... 0 yelp
971 The ambiance isn't much better. 0 yelp
972 Unfortunately, it only set us up for disapppoi... 0 yelp
973 The food wasn't good. 0 yelp
974 Your servers suck, wait, correction, our serve... 0 yelp
975 What happened next was pretty....off putting. 0 yelp
976 too bad cause I know it's family owned, I real... 0 yelp
977 Overpriced for what you are getting. 0 yelp
978 I vomited in the bathroom mid lunch. 0 yelp
979 I kept looking at the time and it had soon bec... 0 yelp
980 I have been to very few places to eat that und... 0 yelp
981 We started with the tuna sashimi which was bro... 0 yelp
982 Food was below average. 0 yelp
983 It sure does beat the nachos at the movies but... 0 yelp
984 All in all, Ha Long Bay was a bit of a flop. 0 yelp
985 The problem I have is that they charge $11.99 ... 0 yelp
986 Shrimp- When I unwrapped it (I live only 1/2 a... 0 yelp
987 It lacked flavor, seemed undercooked, and dry. 0 yelp
988 It really is impressive that the place hasn't ... 0 yelp
989 I would avoid this place if you are staying in... 0 yelp
990 The refried beans that came with my meal were ... 0 yelp
991 Spend your money and time some place else. 0 yelp
992 A lady at the table next to us found a live gr... 0 yelp
993 the presentation of the food was awful. 0 yelp
994 I can't tell you how disappointed I was. 0 yelp
995 I think food should have flavor and texture an... 0 yelp
996 Appetite instantly gone. 0 yelp
997 Overall I was not impressed and would not go b... 0 yelp
998 The whole experience was underwhelming, and I ... 0 yelp
999 Then, as if I hadn't wasted enough of my life ... 0 yelp

[1000 rows x 3 columns], sentence label source
0 So there is no way for me to plug it in here i... 0 amazon
1 Good case, Excellent value. 1 amazon
2 Great for the jawbone. 1 amazon
3 Tied to charger for conversations lasting more... 0 amazon
4 The mic is great. 1 amazon
5 I have to jiggle the plug to get it to line up... 0 amazon
6 If you have several dozen or several hundred c... 0 amazon
7 If you are Razr owner...you must have this! 1 amazon
8 Needless to say, I wasted my money. 0 amazon
9 What a waste of money and time!. 0 amazon
10 And the sound quality is great. 1 amazon
11 He was very impressed when going from the orig... 1 amazon
12 If the two were seperated by a mere 5+ ft I st... 0 amazon
13 Very good quality though 1 amazon
14 The design is very odd, as the ear "clip" is n... 0 amazon
15 Highly recommend for any one who has a blue to... 1 amazon
16 I advise EVERYONE DO NOT BE FOOLED! 0 amazon
17 So Far So Good!. 1 amazon
18 Works great!. 1 amazon
19 It clicks into place in a way that makes you w... 0 amazon
20 I went on Motorola's website and followed all ... 0 amazon
21 I bought this to use with my Kindle Fire and a... 1 amazon
22 The commercials are the most misleading. 0 amazon
23 I have yet to run this new battery below two b... 1 amazon
24 I bought it for my mother and she had a proble... 0 amazon
25 Great Pocket PC / phone combination. 1 amazon
26 I've owned this phone for 7 months now and can... 1 amazon
27 I didn't think that the instructions provided ... 0 amazon
28 People couldnt hear me talk and I had to pull ... 0 amazon
29 Doesn't hold charge. 0 amazon
.. ... ... ...
970 I plugged it in only to find out not a darn th... 0 amazon
971 Excellent product. 1 amazon
972 Earbud piece breaks easily. 0 amazon
973 Lousy product. 0 amazon
974 This phone tries very hard to do everything bu... 0 amazon
975 It is the best charger I have seen on the mark... 1 amazon
976 SWEETEST PHONE!!! 1 amazon
977 :-)Oh, the charger seems to work fine. 1 amazon
978 It fits so securely that the ear hook does not... 1 amazon
979 Not enough volume. 0 amazon
980 Echo Problem....Very unsatisfactory 0 amazon
981 you could only take 2 videos at a time and the... 0 amazon
982 don't waste your money. 0 amazon
983 I am going to have to be the first to negative... 0 amazon
984 Adapter does not provide enough charging current. 0 amazon
985 There was so much hype over this phone that I ... 0 amazon
986 You also cannot take pictures with it in the c... 0 amazon
987 Phone falls out easily. 0 amazon
988 It didn't work, people can not hear me when I ... 0 amazon
989 The text messaging feature is really tricky to... 0 amazon
990 I'm really disappointed all I have now is a ch... 0 amazon
991 Painful on the ear. 0 amazon
992 Lasted one day and then blew up. 0 amazon
993 disappointed. 0 amazon
994 Kind of flops around. 0 amazon
995 The screen does get smudged easily because it ... 0 amazon
996 What a piece of junk.. I lose more calls on th... 0 amazon
997 Item Does Not Match Picture. 0 amazon
998 The only thing that disappoint me is the infra... 0 amazon
999 You can not answer calls with the unit, never ... 0 amazon

[1000 rows x 3 columns], sentence label source
0 A very, very, very slow-moving, aimless movie ... 0 imdb
1 Not sure who was more lost - the flat characte... 0 imdb
2 Attempting artiness with black & white and cle... 0 imdb
3 Very little music or anything to speak of. 0 imdb
4 The best scene in the movie was when Gerardo i... 1 imdb
5 The rest of the movie lacks art, charm, meanin... 0 imdb
6 Wasted two hours. 0 imdb
7 Saw the movie today and thought it was a good ... 1 imdb
8 A bit predictable. 0 imdb
9 Loved the casting of Jimmy Buffet as the scien... 1 imdb
10 And those baby owls were adorable. 1 imdb
11 The movie showed a lot of Florida at it's best... 1 imdb
12 The Songs Were The Best And The Muppets Were S... 1 imdb
13 It Was So Cool. 1 imdb
14 This is a very "right on case" movie that deli... 1 imdb
15 It had some average acting from the main perso... 0 imdb
16 This review is long overdue, since I consider ... 1 imdb
17 I'll put this gem up against any movie in term... 1 imdb
18 It's practically perfect in all of them a tr... 1 imdb
19 The structure of this film is easily the most... 0 imdb
20 This if the first movie I've given a 10 to in ... 1 imdb
21 If there was ever a movie that needed word-of-... 1 imdb
22 Overall, the film is interesting and thought-p... 1 imdb
23 Plus, it was well-paced and suited its relativ... 1 imdb
24 Give this one a look. 1 imdb
25 I gave it a 10 1 imdb
26 The Wind and the Lion is well written and supe... 1 imdb
27 It is a true classic. 1 imdb
28 It actually turned out to be pretty decent as ... 1 imdb
29 Definitely worth checking out. 1 imdb
.. ... ... ...
718 Enough can not be said of the remarkable anima... 1 imdb
719 The art style has the appearance of crayon/pen... 1 imdb
720 If you act in such a film, you should be glad ... 0 imdb
721 This one wants to surf on the small wave of sp... 0 imdb
722 If you haven't choked in your own vomit by the... 0 imdb
723 Still, it makes up for all of this with a supe... 1 imdb
724 Just consider the excellent story, solid actin... 1 imdb
725 Instead, we got a bore fest about a whiny, spo... 0 imdb
726 Then I watched it again two Sundays ago (March... 1 imdb
727 It is a very well acted and done TV Movie. 1 imdb
728 Judith Light is one of my favorite actresses a... 1 imdb
729 I keep watching it over and over. 1 imdb
730 It's a sad movie, but very good. 1 imdb
731 If you have not seen this movie, I definitely ... 1 imdb
732 She is as lovely as usual, this cutie! 1 imdb
733 Still it's quite interesting and entertaining ... 1 imdb
734 ;) Recommend with confidence! 1 imdb
735 This movie is well-balanced with comedy and dr... 1 imdb
736 It was a riot to see Hugo Weaving play a sex-o... 1 imdb
737 :) Anyway, the plot flowed smoothly and the ma... 1 imdb
738 The opening sequence of this gem is a classic,... 1 imdb
739 Fans of the genre will be in heaven. 1 imdb
740 Lange had become a great actress. 1 imdb
741 It looked like a wonderful story. 1 imdb
742 I never walked out of a movie faster. 0 imdb
743 I just got bored watching Jessice Lange take h... 0 imdb
744 Unfortunately, any virtue in this film's produ... 0 imdb
745 In a word, it is embarrassing. 0 imdb
746 Exceptionally bad! 0 imdb
747 All in all its an insult to one's intelligence... 0 imdb

[748 rows x 3 columns]]

Na modelových vetách pochopíme, čo je to vektorizácia textu

In [4]:

from sklearn.feature_extraction.text import CountVectorizer

In [15]:

sentences = ['John likes ice cream because of ice cream and ice cream likes John', 'John hates chocolate because of chocolate and chocolate likes John.']

In [12]:

vectorizer.fit(sentences)

Out[12]:

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=0,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [7]:

vectorizer = CountVectorizer(min_df=0, lowercase=False)

In [16]:

vectorizer.fit(sentences)

Out[16]:

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=0,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [17]:

vectorizer.vocabulary_

Out[17]:

{'John': 0,
 'likes': 7,
 'ice': 6,
 'cream': 4,
 'because': 2,
 'of': 8,
 'and': 1,
 'hates': 5,
 'chocolate': 3}

In [18]:

vectorizer.transform(sentences).toarray()

Out[18]:

array([[2, 1, 1, 0, 3, 0, 3, 2, 1],
       [2, 1, 1, 3, 0, 1, 0, 1, 1]])

Jednoduchý NLP MODEL sa učí rozlišovať recenzie na pozitívne a negatívne. Potom sa samotestuje a samovaliduje.

In [19]:

from sklearn.model_selection import train_test_split

In [20]:

df_yelp = df[df['source'] == 'yelp']

In [21]:

sentences = df_yelp['sentence'].values

In [22]:

y = df_yelp['label'].values

In [25]:

sentences_train, sentences_test, y_train, y_test = train_test_split(
...    sentences, y, test_size=0.25, random_state=1000)

In [26]:

from sklearn.feature_extraction.text import CountVectorizer

In [27]:

vectorizer = CountVectorizer()

In [30]:

vectorizer.fit(sentences_train)

Out[30]:

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [31]:

X_train = vectorizer.transform(sentences_train)

In [32]:

X_test  = vectorizer.transform(sentences_test)

In [33]:

X_train

Out[33]:

<750x1714 sparse matrix of type '<class 'numpy.int64'>'
	with 7368 stored elements in Compressed Sparse Row format>

In [36]:

from sklearn.linear_model import LogisticRegression

In [37]:

classifier = LogisticRegression()

In [38]:

classifier.fit(X_train, y_train)

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Out[38]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [39]:

score = classifier.score(X_test, y_test)

In [40]:

print(score)

0.796

In [41]:

for source in df['source'].unique():
    df_source = df[df['source'] == source]
    sentences = df_source['sentence'].values
    y = df_source['label'].values

    sentences_train, sentences_test, y_train, y_test = train_test_split(
        sentences, y, test_size=0.25, random_state=1000)

    vectorizer = CountVectorizer()
    vectorizer.fit(sentences_train)
    X_train = vectorizer.transform(sentences_train)
    X_test  = vectorizer.transform(sentences_test)

    classifier = LogisticRegression()
    classifier.fit(X_train, y_train)
    score = classifier.score(X_test, y_test)
    print('Accuracy for {} data: {:.4f}'.format(source, score))

Accuracy for yelp data: 0.7960
Accuracy for amazon data: 0.7960
Accuracy for imdb data: 0.7487

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Zložitejší NLP MODEL v https://keras.io/

In [42]:

from keras.models import Sequential

Using TensorFlow backend.

In [43]:

from keras import layers

In [44]:

input_dim = X_train.shape[1]

In [45]:

model = Sequential()

In [46]:

model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))

In [47]:

model.add(layers.Dense(1, activation='sigmoid'))

In [48]:

model.compile(loss='binary_crossentropy', 
...               optimizer='adam', 
...               metrics=['accuracy'])

In [49]:

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 10)                25060     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 11        
=================================================================
Total params: 25,071
Trainable params: 25,071
Non-trainable params: 0
_________________________________________________________________

In [51]:

history = model.fit(X_train, y_train,
...                     epochs=100,
...                     verbose=False,
...                     validation_data=(X_test, y_test),
...                     batch_size=10)

In [52]:

loss, accuracy = model.evaluate(X_train, y_train, verbose=False)

Overfitting je keď sa učiaci model učí úplne presne (training accuracy), ale nevie to dobre použiť (testing accuracy)

In [53]:

print("Training Accuracy: {:.4f}".format(accuracy))

Training Accuracy: 1.0000

In [54]:

loss, accuracy = model.evaluate(X_test, y_test, verbose=False)

In [55]:

print("Testing Accuracy:  {:.4f}".format(accuracy))

Testing Accuracy:  0.7861

Tu bola KERAS chyba premenných https://towardsdatascience.com/fixing-the-keyerror-acc-and-keyerror-val-acc-errors-in-keras-2-3-x-or-newer-b29b52609af9

In [68]:

import matplotlib.pyplot as plt
plt.style.use('ggplot')

def plot_history(history):
    accuracy = history.history['accuracy']
    val_accuracy = history.history['val_accuracy']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    x = range(1, len(accuracy) + 1)

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.plot(x, accuracy, 'b', label='Training accuracy')
    plt.plot(x, val_accuracy, 'r', label='Validation accuracy')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'b', label='Training loss')
    plt.plot(x, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()

In [69]:

plot_history(history)

Zmena vektorizačnej metódy mapovaním (transformáciou) do geometrického priestoru (embedding space)(tokenizing)

In [72]:

from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(sentences_train)
X_train = tokenizer.texts_to_sequences(sentences_train)
X_test = tokenizer.texts_to_sequences(sentences_test)
vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index
print(sentences_train[2])
print(X_train[2])

I am a fan of his ... This movie sucked really bad.  
[7, 150, 2, 932, 4, 49, 6, 11, 563, 45, 30]

In [85]:

for word in ['the', 'all', 'looks', 'good']:
    print('{}: {}'.format(word, tokenizer.word_index[word]))

the: 1
all: 27
looks: 431
good: 33

In [86]:

from keras.preprocessing.sequence import pad_sequences

In [87]:

maxlen = 100

In [88]:

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)

In [89]:

X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

In [90]:

print(X_train[0, :])

[170 116 390  35   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0]

Keras embeddovanie

In [93]:

from keras.models import Sequential
from keras import layers

embedding_dim = 50

model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 100, 50)           128750    
_________________________________________________________________
flatten_1 (Flatten)          (None, 5000)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                50010     
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 11        
=================================================================
Total params: 178,771
Trainable params: 178,771
Non-trainable params: 0
_________________________________________________________________

In [94]:

history = model.fit(X_train, y_train,
                    epochs=20,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/indexed_slices.py:434: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

Training Accuracy: 1.0000
Testing Accuracy:  0.7112

In [95]:

from keras.models import Sequential
from keras import layers

embedding_dim = 50

model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 100, 50)           128750    
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 50)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 10)                510       
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 11        
=================================================================
Total params: 129,271
Trainable params: 129,271
Non-trainable params: 0
_________________________________________________________________

In [96]:

history = model.fit(X_train, y_train,
                    epochs=50,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)

Training Accuracy: 1.0000
Testing Accuracy:  0.7433

Keras konvolučný model

In [97]:

mbedding_dim = 100

model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
model.add(layers.Conv1D(128, 5, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, 100, 50)           128750    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 96, 128)           32128     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 128)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 10)                1290      
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 11        
=================================================================
Total params: 162,179
Trainable params: 162,179
Non-trainable params: 0
_________________________________________________________________

In [98]:

history = model.fit(X_train, y_train,
                    epochs=10,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)

Training Accuracy: 1.0000
Testing Accuracy:  0.7914

Už tretí model stále nevie zlepšiť svoje učenie. Môže to byť preto, lebo máme málo učiacich vzoriek, alebo data majú malý zovšeobecňovací potenciál, alebo potrebujeme vylaďovať hyperparametre

In [99]:

def create_model(num_filters, kernel_size, vocab_size, embedding_dim, maxlen):
    model = Sequential()
    model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
    model.add(layers.Conv1D(num_filters, kernel_size, activation='relu'))
    model.add(layers.GlobalMaxPooling1D())
    model.add(layers.Dense(10, activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

In [100]:

param_grid = dict(num_filters=[32, 64, 128],
                  kernel_size=[3, 5, 7],
                  vocab_size=[5000], 
                  embedding_dim=[50],
                  maxlen=[100])

Model, ktorý si vylaďuje hyperparametre

In [109]:

import time
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import RandomizedSearchCV

# Main settings
epochs = 20
embedding_dim = 50
maxlen = 100
output_file = 'data/output.txt'

# Run grid search for each source (yelp, amazon, imdb)
for source, frame in df.groupby('source'):
    print('Running grid search for data set :', source)
    sentences = df['sentence'].values
    y = df['label'].values

    # Train-test split
    sentences_train, sentences_test, y_train, y_test = train_test_split(
        sentences, y, test_size=0.25, random_state=1000)

    # Tokenize words
    tokenizer = Tokenizer(num_words=5000)
    tokenizer.fit_on_texts(sentences_train)
    X_train = tokenizer.texts_to_sequences(sentences_train)
    X_test = tokenizer.texts_to_sequences(sentences_test)

    # Adding 1 because of reserved 0 index
    vocab_size = len(tokenizer.word_index) + 1

    # Pad sequences with zeros
    X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
    X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

    # Parameter grid for grid search
    param_grid = dict(num_filters=[32, 64, 128],
                      kernel_size=[3, 5, 7],
                      vocab_size=[vocab_size],
                      embedding_dim=[embedding_dim],
                      maxlen=[maxlen])
    model = KerasClassifier(build_fn=create_model,
                            epochs=epochs, batch_size=10,
                            verbose=False)
    grid = RandomizedSearchCV(estimator=model, param_distributions=param_grid,
                              cv=4, verbose=1, n_iter=5)
    grid_result = grid.fit(X_train, y_train)

    # Evaluate testing set
    test_accuracy = grid.score(X_test, y_test)

    # Save and evaluate results
 #   prompt = input(f'finished {source}; write to file and proceed? [y/n]')
 #   if prompt.lower() not in {'y', 'true', 'yes'}:
 #       break
 #   with open(output_file, 'a') as f:
    s = ('Running {} data set\nBest Accuracy : '
             '{:.4f}\n{}\nTest Accuracy : {:.4f}\n\n')
    output_string = s.format(
            source,
            grid_result.best_score_,
            grid_result.best_params_,
            test_accuracy)
    print(output_string)
    time.sleep(5.5)
#        f.write(output_string)

Running grid search for data set : amazon
Fitting 4 folds for each of 5 candidates, totalling 20 fits

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:  7.7min finished

Running amazon data set
Best Accuracy : 0.8151
{'vocab_size': 4603, 'num_filters': 128, 'maxlen': 100, 'kernel_size': 3, 'embedding_dim': 50}
Test Accuracy : 0.8384


Running grid search for data set : imdb
Fitting 4 folds for each of 5 candidates, totalling 20 fits

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:  7.1min finished

Running imdb data set
Best Accuracy : 0.8185
{'vocab_size': 4603, 'num_filters': 32, 'maxlen': 100, 'kernel_size': 3, 'embedding_dim': 50}
Test Accuracy : 0.8443


Running grid search for data set : yelp
Fitting 4 folds for each of 5 candidates, totalling 20 fits

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:  7.2min finished

Running yelp data set
Best Accuracy : 0.8161
{'vocab_size': 4603, 'num_filters': 64, 'maxlen': 100, 'kernel_size': 5, 'embedding_dim': 50}
Test Accuracy : 0.8282

Zlepšený model

Running grid search for data set : amazon Fitting 4 folds for each of 5 candidates, totalling 20 fits [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 7.7min finished Running amazon data set Best Accuracy : 0.8151 {'vocab_size': 4603, 'num_filters': 128, 'maxlen': 100, 'kernel_size': 3, 'embedding_dim': 50} Test Accuracy : 0.8384

Running grid search for data set : imdb Fitting 4 folds for each of 5 candidates, totalling 20 fits [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 7.1min finished Running imdb data set Best Accuracy : 0.8185 {'vocab_size': 4603, 'num_filters': 32, 'maxlen': 100, 'kernel_size': 3, 'embedding_dim': 50} Test Accuracy : 0.8443

Running grid search for data set : yelp Fitting 4 folds for each of 5 candidates, totalling 20 fits [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 7.2min finished Running yelp data set Best Accuracy : 0.8161 {'vocab_size': 4603, 'num_filters': 64, 'maxlen': 100, 'kernel_size': 5, 'embedding_dim': 50} Test Accuracy : 0.8282

Koniec https://www.youtube.com/watch?v=NUbYXitLcBM&list=PLyogAS9FyNeQM1u2X4Dlop-bPoxqwe-Aj&index=6 https://www.youtube.com/watch?v=wJIGOuojvf8