Sedíme v https://schizyfos.files.wordpress.com/2020/05/img_5045.jpg

a prezentujeme si technickú prezentáciu, príklad prevzatý z https://realpython.com/python-keras-text-classification/

surovinou dátovej vedy je DATASET, stiahnutý odtiaľto: https://realpython.com/python-keras-text-classification/ https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences Je to DATASET recenzii filmov. Uložíme ho do premennej typu pole a zobrazíme si ho.

In [ ]:
import pandas as pd

filepath_dict = {'yelp':   '~/sharedfolder/210219/Strom/Private/Ivan/Python/datasets/sentiment labelled sentences/sentiment labelled sentences/yelp_labelled.txt',
                 'amazon': '~/sharedfolder/210219/Strom/Private/Ivan/Python/datasets/sentiment labelled sentences/sentiment labelled sentences/amazon_cells_labelled.txt',
                 'imdb':   '~/sharedfolder/210219/Strom/Private/Ivan/Python/datasets/sentiment labelled sentences/sentiment labelled sentences/imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source  # Add another column filled with the source name
    df_list.append(df)

df = pd.concat(df_list)
print(df.iloc[0])
In [59]:
print(df_list)
[                                              sentence  label source
0                             Wow... Loved this place.      1   yelp
1                                   Crust is not good.      0   yelp
2            Not tasty and the texture was just nasty.      0   yelp
3    Stopped by during the late May bank holiday of...      1   yelp
4    The selection on the menu was great and so wer...      1   yelp
5       Now I am getting angry and I want my damn pho.      0   yelp
6                Honeslty it didn't taste THAT fresh.)      0   yelp
7    The potatoes were like rubber and you could te...      0   yelp
8                            The fries were great too.      1   yelp
9                                       A great touch.      1   yelp
10                            Service was very prompt.      1   yelp
11                                  Would not go back.      0   yelp
12   The cashier had no care what so ever on what I...      0   yelp
13   I tried the Cape Cod ravoli, chicken,with cran...      1   yelp
14   I was disgusted because I was pretty sure that...      0   yelp
15   I was shocked because no signs indicate cash o...      0   yelp
16                                 Highly recommended.      1   yelp
17              Waitress was a little slow in service.      0   yelp
18   This place is not worth your time, let alone V...      0   yelp
19                                did not like at all.      0   yelp
20                                 The Burrittos Blah!      0   yelp
21                                  The food, amazing.      1   yelp
22                               Service is also cute.      1   yelp
23   I could care less... The interior is just beau...      1   yelp
24                                  So they performed.      1   yelp
25   That's right....the red velvet cake.....ohhh t...      1   yelp
26          - They never brought a salad we asked for.      0   yelp
27   This hole in the wall has great Mexican street...      1   yelp
28   Took an hour to get our food only 4 tables in ...      0   yelp
29                   The worst was the salmon sashimi.      0   yelp
..                                                 ...    ...    ...
970  I immediately said I wanted to talk to the man...      0   yelp
971                    The ambiance isn't much better.      0   yelp
972  Unfortunately, it only set us up for disapppoi...      0   yelp
973                              The food wasn't good.      0   yelp
974  Your servers suck, wait, correction, our serve...      0   yelp
975      What happened next was pretty....off putting.      0   yelp
976  too bad cause I know it's family owned, I real...      0   yelp
977               Overpriced for what you are getting.      0   yelp
978               I vomited in the bathroom mid lunch.      0   yelp
979  I kept looking at the time and it had soon bec...      0   yelp
980  I have been to very few places to eat that und...      0   yelp
981  We started with the tuna sashimi which was bro...      0   yelp
982                            Food was below average.      0   yelp
983  It sure does beat the nachos at the movies but...      0   yelp
984       All in all, Ha Long Bay was a bit of a flop.      0   yelp
985  The problem I have is that they charge $11.99 ...      0   yelp
986  Shrimp- When I unwrapped it (I live only 1/2 a...      0   yelp
987     It lacked flavor, seemed undercooked, and dry.      0   yelp
988  It really is impressive that the place hasn't ...      0   yelp
989  I would avoid this place if you are staying in...      0   yelp
990  The refried beans that came with my meal were ...      0   yelp
991         Spend your money and time some place else.      0   yelp
992  A lady at the table next to us found a live gr...      0   yelp
993            the presentation of the food was awful.      0   yelp
994           I can't tell you how disappointed I was.      0   yelp
995  I think food should have flavor and texture an...      0   yelp
996                           Appetite instantly gone.      0   yelp
997  Overall I was not impressed and would not go b...      0   yelp
998  The whole experience was underwhelming, and I ...      0   yelp
999  Then, as if I hadn't wasted enough of my life ...      0   yelp

[1000 rows x 3 columns],                                               sentence  label  source
0    So there is no way for me to plug it in here i...      0  amazon
1                          Good case, Excellent value.      1  amazon
2                               Great for the jawbone.      1  amazon
3    Tied to charger for conversations lasting more...      0  amazon
4                                    The mic is great.      1  amazon
5    I have to jiggle the plug to get it to line up...      0  amazon
6    If you have several dozen or several hundred c...      0  amazon
7          If you are Razr owner...you must have this!      1  amazon
8                  Needless to say, I wasted my money.      0  amazon
9                     What a waste of money and time!.      0  amazon
10                     And the sound quality is great.      1  amazon
11   He was very impressed when going from the orig...      1  amazon
12   If the two were seperated by a mere 5+ ft I st...      0  amazon
13                            Very good quality though      1  amazon
14   The design is very odd, as the ear "clip" is n...      0  amazon
15   Highly recommend for any one who has a blue to...      1  amazon
16                 I advise EVERYONE DO NOT BE FOOLED!      0  amazon
17                                    So Far So Good!.      1  amazon
18                                       Works great!.      1  amazon
19   It clicks into place in a way that makes you w...      0  amazon
20   I went on Motorola's website and followed all ...      0  amazon
21   I bought this to use with my Kindle Fire and a...      1  amazon
22            The commercials are the most misleading.      0  amazon
23   I have yet to run this new battery below two b...      1  amazon
24   I bought it for my mother and she had a proble...      0  amazon
25                Great Pocket PC / phone combination.      1  amazon
26   I've owned this phone for 7 months now and can...      1  amazon
27   I didn't think that the instructions provided ...      0  amazon
28   People couldnt hear me talk and I had to pull ...      0  amazon
29                                Doesn't hold charge.      0  amazon
..                                                 ...    ...     ...
970  I plugged it in only to find out not a darn th...      0  amazon
971                                 Excellent product.      1  amazon
972                        Earbud piece breaks easily.      0  amazon
973                                     Lousy product.      0  amazon
974  This phone tries very hard to do everything bu...      0  amazon
975  It is the best charger I have seen on the mark...      1  amazon
976                                  SWEETEST PHONE!!!      1  amazon
977             :-)Oh, the charger seems to work fine.      1  amazon
978  It fits so securely that the ear hook does not...      1  amazon
979                                 Not enough volume.      0  amazon
980                Echo Problem....Very unsatisfactory      0  amazon
981  you could only take 2 videos at a time and the...      0  amazon
982                            don't waste your money.      0  amazon
983  I am going to have to be the first to negative...      0  amazon
984  Adapter does not provide enough charging current.      0  amazon
985  There was so much hype over this phone that I ...      0  amazon
986  You also cannot take pictures with it in the c...      0  amazon
987                            Phone falls out easily.      0  amazon
988  It didn't work, people can not hear me when I ...      0  amazon
989  The text messaging feature is really tricky to...      0  amazon
990  I'm really disappointed all I have now is a ch...      0  amazon
991                                Painful on the ear.      0  amazon
992                   Lasted one day and then blew up.      0  amazon
993                                      disappointed.      0  amazon
994                              Kind of flops around.      0  amazon
995  The screen does get smudged easily because it ...      0  amazon
996  What a piece of junk.. I lose more calls on th...      0  amazon
997                       Item Does Not Match Picture.      0  amazon
998  The only thing that disappoint me is the infra...      0  amazon
999  You can not answer calls with the unit, never ...      0  amazon

[1000 rows x 3 columns],                                               sentence  label source
0    A very, very, very slow-moving, aimless movie ...      0   imdb
1    Not sure who was more lost - the flat characte...      0   imdb
2    Attempting artiness with black & white and cle...      0   imdb
3         Very little music or anything to speak of.        0   imdb
4    The best scene in the movie was when Gerardo i...      1   imdb
5    The rest of the movie lacks art, charm, meanin...      0   imdb
6                                  Wasted two hours.        0   imdb
7    Saw the movie today and thought it was a good ...      1   imdb
8                                 A bit predictable.        0   imdb
9    Loved the casting of Jimmy Buffet as the scien...      1   imdb
10                And those baby owls were adorable.        1   imdb
11   The movie showed a lot of Florida at it's best...      1   imdb
12   The Songs Were The Best And The Muppets Were S...      1   imdb
13                                   It Was So Cool.        1   imdb
14   This is a very "right on case" movie that deli...      1   imdb
15   It had some average acting from the main perso...      0   imdb
16   This review is long overdue, since I consider ...      1   imdb
17   I'll put this gem up against any movie in term...      1   imdb
18   It's practically perfect in all of them – a tr...      1   imdb
19    The structure of this film is easily the most...      0   imdb
20   This if the first movie I've given a 10 to in ...      1   imdb
21   If there was ever a movie that needed word-of-...      1   imdb
22   Overall, the film is interesting and thought-p...      1   imdb
23   Plus, it was well-paced and suited its relativ...      1   imdb
24                             Give this one a look.        1   imdb
25                                    I gave it a 10        1   imdb
26   The Wind and the Lion is well written and supe...      1   imdb
27                             It is a true classic.        1   imdb
28   It actually turned out to be pretty decent as ...      1   imdb
29                    Definitely worth checking out.        1   imdb
..                                                 ...    ...    ...
718  Enough can not be said of the remarkable anima...      1   imdb
719  The art style has the appearance of crayon/pen...      1   imdb
720  If you act in such a film, you should be glad ...      0   imdb
721  This one wants to surf on the small wave of sp...      0   imdb
722  If you haven't choked in your own vomit by the...      0   imdb
723  Still, it makes up for all of this with a supe...      1   imdb
724  Just consider the excellent story, solid actin...      1   imdb
725  Instead, we got a bore fest about a whiny, spo...      0   imdb
726  Then I watched it again two Sundays ago (March...      1   imdb
727       It is a very well acted and done TV Movie.        1   imdb
728  Judith Light is one of my favorite actresses a...      1   imdb
729                I keep watching it over and over.        1   imdb
730                 It's a sad movie, but very good.        1   imdb
731  If you have not seen this movie, I definitely ...      1   imdb
732           She is as lovely as usual, this cutie!        1   imdb
733  Still it's quite interesting and entertaining ...      1   imdb
734                    ;) Recommend with confidence!        1   imdb
735  This movie is well-balanced with comedy and dr...      1   imdb
736  It was a riot to see Hugo Weaving play a sex-o...      1   imdb
737  :) Anyway, the plot flowed smoothly and the ma...      1   imdb
738  The opening sequence of this gem is a classic,...      1   imdb
739             Fans of the genre will be in heaven.        1   imdb
740                Lange had become a great actress.        1   imdb
741                It looked like a wonderful story.        1   imdb
742            I never walked out of a movie faster.        0   imdb
743  I just got bored watching Jessice Lange take h...      0   imdb
744  Unfortunately, any virtue in this film's produ...      0   imdb
745                   In a word, it is embarrassing.        0   imdb
746                               Exceptionally bad!        0   imdb
747  All in all its an insult to one's intelligence...      0   imdb

[748 rows x 3 columns]]

Na modelových vetách pochopíme, čo je to vektorizácia textu

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
In [15]:
sentences = ['John likes ice cream because of ice cream and ice cream likes John', 'John hates chocolate because of chocolate and chocolate likes John.']
In [12]:
vectorizer.fit(sentences)
Out[12]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=0,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
In [7]:
vectorizer = CountVectorizer(min_df=0, lowercase=False)
In [16]:
vectorizer.fit(sentences)
Out[16]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=0,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
In [17]:
vectorizer.vocabulary_
Out[17]:
{'John': 0,
 'likes': 7,
 'ice': 6,
 'cream': 4,
 'because': 2,
 'of': 8,
 'and': 1,
 'hates': 5,
 'chocolate': 3}
In [18]:
vectorizer.transform(sentences).toarray()
Out[18]:
array([[2, 1, 1, 0, 3, 0, 3, 2, 1],
       [2, 1, 1, 3, 0, 1, 0, 1, 1]])

Jednoduchý NLP MODEL sa učí rozlišovať recenzie na pozitívne a negatívne. Potom sa samotestuje a samovaliduje.

In [19]:
from sklearn.model_selection import train_test_split
In [20]:
df_yelp = df[df['source'] == 'yelp']
In [21]:
sentences = df_yelp['sentence'].values
In [22]:
y = df_yelp['label'].values
In [25]:
sentences_train, sentences_test, y_train, y_test = train_test_split(
...    sentences, y, test_size=0.25, random_state=1000)
In [26]:
from sklearn.feature_extraction.text import CountVectorizer
In [27]:
vectorizer = CountVectorizer()
In [30]:
vectorizer.fit(sentences_train)
Out[30]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
In [31]:
X_train = vectorizer.transform(sentences_train)
In [32]:
X_test  = vectorizer.transform(sentences_test)
In [33]:
X_train
Out[33]:
<750x1714 sparse matrix of type '<class 'numpy.int64'>'
	with 7368 stored elements in Compressed Sparse Row format>
In [36]:
from sklearn.linear_model import LogisticRegression
In [37]:
classifier = LogisticRegression()
In [38]:
classifier.fit(X_train, y_train)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
Out[38]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)
In [39]:
score = classifier.score(X_test, y_test)
In [40]:
print(score)
0.796
In [41]:
for source in df['source'].unique():
    df_source = df[df['source'] == source]
    sentences = df_source['sentence'].values
    y = df_source['label'].values

    sentences_train, sentences_test, y_train, y_test = train_test_split(
        sentences, y, test_size=0.25, random_state=1000)

    vectorizer = CountVectorizer()
    vectorizer.fit(sentences_train)
    X_train = vectorizer.transform(sentences_train)
    X_test  = vectorizer.transform(sentences_test)

    classifier = LogisticRegression()
    classifier.fit(X_train, y_train)
    score = classifier.score(X_test, y_test)
    print('Accuracy for {} data: {:.4f}'.format(source, score))
Accuracy for yelp data: 0.7960
Accuracy for amazon data: 0.7960
Accuracy for imdb data: 0.7487
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Zložitejší NLP MODEL v https://keras.io/

In [42]:
from keras.models import Sequential
Using TensorFlow backend.
In [43]:
from keras import layers
In [44]:
input_dim = X_train.shape[1]
In [45]:
model = Sequential()
In [46]:
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
In [47]:
model.add(layers.Dense(1, activation='sigmoid'))
In [48]:
model.compile(loss='binary_crossentropy', 
...               optimizer='adam', 
...               metrics=['accuracy'])
In [49]:
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 10)                25060     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 11        
=================================================================
Total params: 25,071
Trainable params: 25,071
Non-trainable params: 0
_________________________________________________________________
In [51]:
history = model.fit(X_train, y_train,
...                     epochs=100,
...                     verbose=False,
...                     validation_data=(X_test, y_test),
...                     batch_size=10)
In [52]:
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)

Overfitting je keď sa učiaci model učí úplne presne (training accuracy), ale nevie to dobre použiť (testing accuracy)

In [53]:
print("Training Accuracy: {:.4f}".format(accuracy))
Training Accuracy: 1.0000
In [54]:
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
In [55]:
print("Testing Accuracy:  {:.4f}".format(accuracy))
Testing Accuracy:  0.7861
In [68]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')

def plot_history(history):
    accuracy = history.history['accuracy']
    val_accuracy = history.history['val_accuracy']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    x = range(1, len(accuracy) + 1)

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.plot(x, accuracy, 'b', label='Training accuracy')
    plt.plot(x, val_accuracy, 'r', label='Validation accuracy')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'b', label='Training loss')
    plt.plot(x, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()
In [69]:
plot_history(history)

Zmena vektorizačnej metódy mapovaním (transformáciou) do geometrického priestoru (embedding space)(tokenizing)

In [72]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(sentences_train)
X_train = tokenizer.texts_to_sequences(sentences_train)
X_test = tokenizer.texts_to_sequences(sentences_test)
vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index
print(sentences_train[2])
print(X_train[2])
I am a fan of his ... This movie sucked really bad.  
[7, 150, 2, 932, 4, 49, 6, 11, 563, 45, 30]
In [85]:
for word in ['the', 'all', 'looks', 'good']:
    print('{}: {}'.format(word, tokenizer.word_index[word]))
the: 1
all: 27
looks: 431
good: 33
In [86]:
from keras.preprocessing.sequence import pad_sequences
In [87]:
maxlen = 100
In [88]:
X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
In [89]:
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)
In [90]:
print(X_train[0, :])
[170 116 390  35   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0]

Keras embeddovanie

In [93]:
from keras.models import Sequential
from keras import layers

embedding_dim = 50

model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 100, 50)           128750    
_________________________________________________________________
flatten_1 (Flatten)          (None, 5000)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                50010     
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 11        
=================================================================
Total params: 178,771
Trainable params: 178,771
Non-trainable params: 0
_________________________________________________________________
In [94]:
history = model.fit(X_train, y_train,
                    epochs=20,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/indexed_slices.py:434: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Training Accuracy: 1.0000
Testing Accuracy:  0.7112
In [95]:
from keras.models import Sequential
from keras import layers

embedding_dim = 50

model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 100, 50)           128750    
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 50)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 10)                510       
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 11        
=================================================================
Total params: 129,271
Trainable params: 129,271
Non-trainable params: 0
_________________________________________________________________
In [96]:
history = model.fit(X_train, y_train,
                    epochs=50,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)
Training Accuracy: 1.0000
Testing Accuracy:  0.7433

Keras konvolučný model

In [97]:
mbedding_dim = 100

model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
model.add(layers.Conv1D(128, 5, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()
Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, 100, 50)           128750    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 96, 128)           32128     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 128)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 10)                1290      
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 11        
=================================================================
Total params: 162,179
Trainable params: 162,179
Non-trainable params: 0
_________________________________________________________________
In [98]:
history = model.fit(X_train, y_train,
                    epochs=10,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)
Training Accuracy: 1.0000
Testing Accuracy:  0.7914

Už tretí model stále nevie zlepšiť svoje učenie. Môže to byť preto, lebo máme málo učiacich vzoriek, alebo data majú malý zovšeobecňovací potenciál, alebo potrebujeme vylaďovať hyperparametre

In [99]:
def create_model(num_filters, kernel_size, vocab_size, embedding_dim, maxlen):
    model = Sequential()
    model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
    model.add(layers.Conv1D(num_filters, kernel_size, activation='relu'))
    model.add(layers.GlobalMaxPooling1D())
    model.add(layers.Dense(10, activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model
In [100]:
param_grid = dict(num_filters=[32, 64, 128],
                  kernel_size=[3, 5, 7],
                  vocab_size=[5000], 
                  embedding_dim=[50],
                  maxlen=[100])

Model, ktorý si vylaďuje hyperparametre

In [109]:
import time
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import RandomizedSearchCV

# Main settings
epochs = 20
embedding_dim = 50
maxlen = 100
output_file = 'data/output.txt'

# Run grid search for each source (yelp, amazon, imdb)
for source, frame in df.groupby('source'):
    print('Running grid search for data set :', source)
    sentences = df['sentence'].values
    y = df['label'].values

    # Train-test split
    sentences_train, sentences_test, y_train, y_test = train_test_split(
        sentences, y, test_size=0.25, random_state=1000)

    # Tokenize words
    tokenizer = Tokenizer(num_words=5000)
    tokenizer.fit_on_texts(sentences_train)
    X_train = tokenizer.texts_to_sequences(sentences_train)
    X_test = tokenizer.texts_to_sequences(sentences_test)

    # Adding 1 because of reserved 0 index
    vocab_size = len(tokenizer.word_index) + 1

    # Pad sequences with zeros
    X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
    X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

    # Parameter grid for grid search
    param_grid = dict(num_filters=[32, 64, 128],
                      kernel_size=[3, 5, 7],
                      vocab_size=[vocab_size],
                      embedding_dim=[embedding_dim],
                      maxlen=[maxlen])
    model = KerasClassifier(build_fn=create_model,
                            epochs=epochs, batch_size=10,
                            verbose=False)
    grid = RandomizedSearchCV(estimator=model, param_distributions=param_grid,
                              cv=4, verbose=1, n_iter=5)
    grid_result = grid.fit(X_train, y_train)

    # Evaluate testing set
    test_accuracy = grid.score(X_test, y_test)

    # Save and evaluate results
 #   prompt = input(f'finished {source}; write to file and proceed? [y/n]')
 #   if prompt.lower() not in {'y', 'true', 'yes'}:
 #       break
 #   with open(output_file, 'a') as f:
    s = ('Running {} data set\nBest Accuracy : '
             '{:.4f}\n{}\nTest Accuracy : {:.4f}\n\n')
    output_string = s.format(
            source,
            grid_result.best_score_,
            grid_result.best_params_,
            test_accuracy)
    print(output_string)
    time.sleep(5.5)
#        f.write(output_string)
Running grid search for data set : amazon
Fitting 4 folds for each of 5 candidates, totalling 20 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:  7.7min finished
Running amazon data set
Best Accuracy : 0.8151
{'vocab_size': 4603, 'num_filters': 128, 'maxlen': 100, 'kernel_size': 3, 'embedding_dim': 50}
Test Accuracy : 0.8384


Running grid search for data set : imdb
Fitting 4 folds for each of 5 candidates, totalling 20 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:  7.1min finished
Running imdb data set
Best Accuracy : 0.8185
{'vocab_size': 4603, 'num_filters': 32, 'maxlen': 100, 'kernel_size': 3, 'embedding_dim': 50}
Test Accuracy : 0.8443


Running grid search for data set : yelp
Fitting 4 folds for each of 5 candidates, totalling 20 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:  7.2min finished
Running yelp data set
Best Accuracy : 0.8161
{'vocab_size': 4603, 'num_filters': 64, 'maxlen': 100, 'kernel_size': 5, 'embedding_dim': 50}
Test Accuracy : 0.8282


Zlepšený model

Running grid search for data set : amazon Fitting 4 folds for each of 5 candidates, totalling 20 fits [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 7.7min finished Running amazon data set Best Accuracy : 0.8151 {'vocab_size': 4603, 'num_filters': 128, 'maxlen': 100, 'kernel_size': 3, 'embedding_dim': 50} Test Accuracy : 0.8384

Running grid search for data set : imdb Fitting 4 folds for each of 5 candidates, totalling 20 fits [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 7.1min finished Running imdb data set Best Accuracy : 0.8185 {'vocab_size': 4603, 'num_filters': 32, 'maxlen': 100, 'kernel_size': 3, 'embedding_dim': 50} Test Accuracy : 0.8443

Running grid search for data set : yelp Fitting 4 folds for each of 5 candidates, totalling 20 fits [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 7.2min finished Running yelp data set Best Accuracy : 0.8161 {'vocab_size': 4603, 'num_filters': 64, 'maxlen': 100, 'kernel_size': 5, 'embedding_dim': 50} Test Accuracy : 0.8282