Multi-class categorization

Que ce soit binary ou multi-class categorization, le problème ne change pas fondamentalement.

L’exemple donné est celui de la classification de textes de la BBC en 5 catégories différentes ‘tech’, ‘business’, ‘sport’, ‘entertainment’, ‘politics’.

import csv
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv \
    -O /tmp/bbc-text.csv
--2020-05-15 13:32:57--  https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.195.128, 2607:f8b0:400e:c05::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.195.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5057493 (4.8M) [application/octet-stream]
Saving to: ‘/tmp/bbc-text.csv’

/tmp/bbc-text.csv   100%[===================>]   4.82M  --.-KB/s    in 0.03s   

2020-05-15 13:32:57 (161 MB/s) - ‘/tmp/bbc-text.csv’ saved [5057493/5057493]
vocab_size = 10000
embedding_dim = 16
max_length = 250
trunc_type = 'post'
padding_type = 'post'
oov_tok = '<OOV>'
training_portion = 0.8

Ce qu’il y a de nouveau dans cet exemple c’est l’ajout de stopwords, des mots qui sont ignorés.

sentences = []
labels = []
stopwords = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]
print(len(stopwords))
with open("/tmp/bbc-text.csv", 'r') as csvfile:
  reader = csv.reader(csvfile, delimiter=',')
  next(reader)
  for row in reader:
    labels.append(row[0])
    sentence = row[1]
    for word in stopwords:
      token = " " + word + " "
      sentence = sentence.replace(token, " ")
    sentences.append(sentence)

print(len(labels))
print(len(sentences))
print(sentences[0])
2225
2225
tv future hands viewers home theatre systems  plasma high-definition tvs  digital video recorders moving living room  ...
import pandas as pd
dfSentences=pd.DataFrame(sentences)
dfSentences.head()

0
0	tv future hands viewers home theatre systems ...
1	worldcom boss left books alone former worldc...
2	tigers wary farrell gamble leicester say wil...
3	yeading face newcastle fa cup premiership side...
4	ocean s twelve raids box office ocean s twelve...
dfSentences.describe()

0
count	2225
unique	2123
top	kennedy questions trust blair lib dem leader c...
freq	2
dfLabels=pd.DataFrame(labels)
dfLabels[0].unique()
array(['tech', 'business', 'sport', 'entertainment', 'politics'],
      dtype=object)
dfLabels.head(10)

0
0	tech
1	business
2	sport
3	sport
4	entertainment
5	politics
6	politics
7	sport
8	sport
9	entertainment
train_size = int(training_portion * len(labels))

train_sentences = sentences[:train_size]
train_labels = labels[:train_size]

validation_sentences = sentences[train_size:]
validation_labels = labels[train_size:]

print(train_size)
print(len(train_sentences))
print(len(train_labels))
print(len(validation_sentences))
print(len(validation_labels))
1780
1780
1780
445
445
tokenizer = Tokenizer(num_words = vocab_size, oov_token = oov_tok)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index

train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, padding=padding_type, maxlen=max_length)

print(len(train_sequences[0]))
print(len(train_padded[0]))

print(len(train_sequences[1]))
print(len(train_padded[1]))

print(len(train_sequences[10]))
print(len(train_padded[10]))
449
250
200
250
192
250
validation_sequences = tokenizer.texts_to_sequences(validation_sentences)
validation_padded = pad_sequences(validation_sequences, padding = padding_type, maxlen=max_length)

print(len(validation_labels))
print(validation_padded.shape)
445
(445, 250)
label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)

training_label_seq = np.array(label_tokenizer.texts_to_sequences(train_labels))
validation_label_seq = np.array(label_tokenizer.texts_to_sequences(validation_labels))
label_tokenizer.index_word
{1: 'sport', 2: 'business', 3: 'politics', 4: 'tech', 5: 'entertainment'}
print(training_label_seq[0])
print(training_label_seq[1])
print(training_label_seq[2])
print(training_label_seq.shape)

print(validation_label_seq[0])
print(validation_label_seq[1])
print(validation_label_seq[2])
print(validation_label_seq.shape)
[4]
[2]
[1]
(1780, 1)
[5]
[4]
[3]
(445, 1)
model = tf.keras.Sequential([
                             tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
                             tf.keras.layers.GlobalAveragePooling1D(),
                             tf.keras.layers.Dense(24, activation='relu'),
                             tf.keras.layers.Dense(6, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_5 (Embedding)      (None, 250, 16)           160000    
_________________________________________________________________
global_average_pooling1d_5 ( (None, 16)                0         
_________________________________________________________________
dense_10 (Dense)             (None, 24)                408       
_________________________________________________________________
dense_11 (Dense)             (None, 6)                 150       
=================================================================
Total params: 160,558
Trainable params: 160,558
Non-trainable params: 0
_________________________________________________________________
num_epochs = 20
history = model.fit(train_padded, training_label_seq, epochs = num_epochs, validation_data=(validation_padded, validation_label_seq), verbose=2)

Il y a du sur-apprentissage, un problème que nous étudierons plus tard.

Epoch 1/20
56/56 - 0s - loss: 1.7662 - accuracy: 0.2551 - val_loss: 1.7294 - val_accuracy: 0.2966
Epoch 2/20
56/56 - 0s - loss: 1.6599 - accuracy: 0.4000 - val_loss: 1.5865 - val_accuracy: 0.4157
Epoch 3/20
56/56 - 0s - loss: 1.4570 - accuracy: 0.5652 - val_loss: 1.3540 - val_accuracy: 0.6315
Epoch 4/20
56/56 - 0s - loss: 1.1777 - accuracy: 0.7034 - val_loss: 1.0757 - val_accuracy: 0.7281
Epoch 5/20
56/56 - 0s - loss: 0.8998 - accuracy: 0.8197 - val_loss: 0.8481 - val_accuracy: 0.8315
Epoch 6/20
56/56 - 0s - loss: 0.6894 - accuracy: 0.8871 - val_loss: 0.6863 - val_accuracy: 0.8652
Epoch 7/20
56/56 - 0s - loss: 0.5346 - accuracy: 0.9354 - val_loss: 0.5708 - val_accuracy: 0.8697
Epoch 8/20
56/56 - 0s - loss: 0.4141 - accuracy: 0.9478 - val_loss: 0.4696 - val_accuracy: 0.9056
Epoch 9/20
56/56 - 0s - loss: 0.3157 - accuracy: 0.9697 - val_loss: 0.3930 - val_accuracy: 0.9258
Epoch 10/20
56/56 - 0s - loss: 0.2396 - accuracy: 0.9820 - val_loss: 0.3326 - val_accuracy: 0.9393
Epoch 11/20
56/56 - 0s - loss: 0.1816 - accuracy: 0.9888 - val_loss: 0.2854 - val_accuracy: 0.9461
Epoch 12/20
56/56 - 0s - loss: 0.1398 - accuracy: 0.9927 - val_loss: 0.2518 - val_accuracy: 0.9528
Epoch 13/20
56/56 - 0s - loss: 0.1093 - accuracy: 0.9949 - val_loss: 0.2289 - val_accuracy: 0.9528
Epoch 14/20
56/56 - 0s - loss: 0.0870 - accuracy: 0.9961 - val_loss: 0.2101 - val_accuracy: 0.9528
Epoch 15/20
56/56 - 0s - loss: 0.0702 - accuracy: 0.9966 - val_loss: 0.1969 - val_accuracy: 0.9551
Epoch 16/20
56/56 - 0s - loss: 0.0574 - accuracy: 0.9978 - val_loss: 0.1851 - val_accuracy: 0.9551
Epoch 17/20
56/56 - 0s - loss: 0.0473 - accuracy: 0.9994 - val_loss: 0.1778 - val_accuracy: 0.9551
Epoch 18/20
56/56 - 0s - loss: 0.0397 - accuracy: 0.9994 - val_loss: 0.1706 - val_accuracy: 0.9551
Epoch 19/20
56/56 - 0s - loss: 0.0335 - accuracy: 1.0000 - val_loss: 0.1630 - val_accuracy: 0.9528
Epoch 20/20
56/56 - 0s - loss: 0.0286 - accuracy: 1.0000 - val_loss: 0.1595 - val_accuracy: 0.9551