Build natural language processing systems using TensorFlow

Le tutoriel de référence pour cette première partie concernant NLP est le cours en ligne : Natural Language Processing in TensorFlow de Laurence Moroney (Google Brain) présenté en semaine #2 de la formation.

Le même code est utilisé pour divers tutoriels.

Le jeu de données de cet tutoriel est imdb_reviews.

We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing

Ces données, réparties en 2 lots de 25 000 enregistrements (training/test) sont des critiques (binaires) de films (policiers) d’avant 2011.

Le but de l’exercice est de classer un film à partir des commentaires apportés.

Le principe est simple, tout d’abord on crée un lexique (à partir du jeu de données), on tokenise (pour tous les enregistrements, on remplace chaque mot par le nombre qui lui est associé dans le lexique), on fait du padding sur les séquences pour les avoir toutes de la même longueur. Ensuite on décide de la dimension de l’embedding et on crée notre réseau. Le reste ressemble à ce qui a été vu précédemment.

import tensorflow as tf
print(tf.__version__)
2.2.0-rc3
import tensorflow_datasets as tfds

Lecture des données

imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=False)

with_info=True permet d’obtenir les informations sur le jeu de données, as_supervised=False pour retourner un  dictionary.

Downloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...
Dl Completed...: 100%
1/1 [00:09<00:00, 9.99s/ url]
Dl Size...: 100%
80/80 [00:09<00:00, 8.04 MiB/s]




Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete9U88OW/imdb_reviews-train.tfrecord
37%
9199/25000 [00:00<00:00, 91988.33 examples/s]
Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete9U88OW/imdb_reviews-test.tfrecord
30%
7391/25000 [00:00<00:00, 73900.20 examples/s]
Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete9U88OW/imdb_reviews-unsupervised.tfrecord
90%
44837/50000 [00:07<00:00, 54557.99 examples/s]
Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.
info
tfds.core.DatasetInfo(
    name='imdb_reviews',
    version=1.0.0,
    description='Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.',
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
        'text': Text(shape=(), dtype=tf.string),
    }),
    total_num_examples=100000,
    splits={
        'test': 25000,
        'train': 25000,
        'unsupervised': 50000,
    },
    supervised_keys=('text', 'label'),
    citation="""@InProceedings{maas-EtAl:2011:ACL-HLT2011,
      author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
      title     = {Learning Word Vectors for Sentiment Analysis},
      booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
      month     = {June},
      year      = {2011},
      address   = {Portland, Oregon, USA},
      publisher = {Association for Computational Linguistics},
      pages     = {142--150},
      url       = {http://www.aclweb.org/anthology/P11-1015}
    }""",
    redistribution_info=,
)

Lecture avec as_supervised = True

imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)
import numpy as np

train_data, test_data = imdb['train'], imdb['test']

training_sentences = []
training_labels = []

testing_sentences = []
testing_labels = []
for s,l in train_data:
  training_sentences.append(str(s.numpy()))
  training_labels.append(l.numpy())
  
for s,l in test_data:
  testing_sentences.append(str(s.numpy()))
  testing_labels.append(l.numpy())
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

Préparation des données

  • vocab_size définit la taille du vocabulaire
  • embedding_dim est commenté plus loin dans le notebook
  • max_length est la longueur maximale autorisée pour un texte
  • trunc_type sert à indiquer si on tronque avant ou après 
  • oov_tok est la chaîne de caractères pour out of vocabulary
vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type='post'
oov_tok = "<OOV>"
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
len(word_index)
86539

Regardons les données avec Pandas.

import pandas as pd
word_index_df=pd.DataFrame(word_index.items())
word_index_df.head(10)

0	1
0	<OOV>	1
1	the	2
2	and	3
3	a	4
4	of	5
5	to	6
6	is	7
7	br	8
8	in	9
9	it	10
sequences = tokenizer.texts_to_sequences(training_sentences)
sequences_df=pd.DataFrame(sequences)
sequences_df.head(5)

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36	37	38	39	...	2487	2488	2489	2490	2491	2492	2493	2494	2495	2496	2497	2498	2499	2500	2501	2502	2503	2504	2505	2506	2507	2508	2509	2510	2511	2512	2513	2514	2515	2516	2517	2518	2519	2520	2521	2522	2523	2524	2525	2526
0	59	12	14	35	439	400	18	174	29	1	9	33.0	1378.0	3401.0	42.0	496.0	1.0	197.0	25.0	88.0	156.0	19.0	12.0	211.0	340.0	29.0	70.0	248.0	213.0	9.0	486.0	62.0	70.0	88.0	116.0	99.0	24.0	5740.0	12.0	3317.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	256	28	78	585	6	815	2383	317	109	19	12	7.0	643.0	696.0	6.0	4.0	2249.0	5.0	183.0	599.0	68.0	1483.0	114.0	2289.0	3.0	4005.0	22.0	2.0	1.0	3.0	263.0	43.0	4754.0	4.0	173.0	190.0	22.0	12.0	4126.0	11.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	1	6175	2	1	4916	4029	9	4	912	1622	3	1969.0	1307.0	3.0	2384.0	8836.0	201.0	746.0	361.0	15.0	34.0	208.0	308.0	6.0	83.0	8.0	8.0	19.0	214.0	22.0	352.0	4.0	1.0	990.0	2.0	82.0	5.0	3608.0	545.0	1.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	360	7	2	239	5	20	16	4	8837	2705	2679	55.0	2.0	367.0	5.0	2.0	179.0	58.0	141.0	1419.0	17.0	94.0	203.0	980.0	15.0	23.0	1.0	86.0	4.0	193.0	3134.0	3069.0	3.0	1.0	16.0	4.0	383.0	5.0	640.0	395.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	3049	414	28	1058	31	2	370	13	141	2541	9	12.0	20.0	25.0	677.0	439.0	1517.0	2.0	115.0	54.0	1.0	287.0	2.0	1.0	5.0	2.0	674.0	1.0	55.0	347.0	25.0	187.0	34.0	182.0	6.0	29.0	7038.0	19.0	55.0	61.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5 rows × 2527 columns
padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type)
sequences_df=pd.DataFrame(padded)
sequences_df.head(5)

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36	37	38	39	...	80	81	82	83	84	85	86	87	88	89	90	91	92	93	94	95	96	97	98	99	100	101	102	103	104	105	106	107	108	109	110	111	112	113	114	115	116	117	118	119
0	0	0	59	12	14	35	439	400	18	174	29	1	9	33	1378	3401	42	496	1	197	25	88	156	19	12	211	340	29	70	248	213	9	486	62	70	88	116	99	24	5740	...	3401	14	163	19	4	1253	927	7986	9	4	18	13	14	4200	5	102	148	1237	11	240	692	13	44	25	101	39	12	7232	1	39	1378	1	52	409	11	99	1214	874	145	10
1	0	0	0	0	0	0	0	256	28	78	585	6	815	2383	317	109	19	12	7	643	696	6	4	2249	5	183	599	68	1483	114	2289	3	4005	22	2	1	3	263	43	4754	...	11	200	28	1059	171	5	2	20	19	11	298	2	2182	5	10	3	285	43	477	6	602	5	94	203	1	206	102	148	4450	16	228	336	11	2510	392	12	20	32	31	47
2	1	6175	2	1	4916	4029	9	4	912	1622	3	1969	1307	3	2384	8836	201	746	361	15	34	208	308	6	83	8	8	19	214	22	352	4	1	990	2	82	5	3608	545	1	...	2	3652	317	2	1	1835	3445	451	4030	3	1168	985	6	28	4091	3608	545	16	1	2	2297	2430	16	2	299	1357	1259	8	8	2297	803	29	2871	16	4	1	3028	564	5	746
3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	360	7	2	239	5	20	16	4	8837	2705	...	2	115	376	44	25	61	1	6	1681	61	1846	4127	43	4	2289	3	1963	1	145	159	784	113	32	94	120	4	215	20	9	175	282	3	30	13	1027	2	2846	10	2020	47
4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3049	414	...	187	34	421	2	1	5	4	2436	281	154	430	3	2	430	469	4	129	68	713	75	144	31	29	37	2071	32	12	568	27	95	212	57	2	3184	6	6665	26	284	119	47
5 rows × 120 columns

Modèle

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 120, 16)           160000    
_________________________________________________________________
flatten (Flatten)            (None, 1920)              0         
_________________________________________________________________
dense (Dense)                (None, 6)                 11526     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 7         
=================================================================
Total params: 171,533
Trainable params: 171,533
Non-trainable params: 0
_________________________________________________________________
num_epochs = 10
model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))
Epoch 1/10
782/782 [==============================] - 7s 9ms/step - loss: 0.4882 - accuracy: 0.7537 - val_loss: 0.3452 - val_accuracy: 0.8480
Epoch 2/10
782/782 [==============================] - 7s 8ms/step - loss: 0.2444 - accuracy: 0.9032 - val_loss: 0.3724 - val_accuracy: 0.8376
Epoch 3/10
782/782 [==============================] - 7s 9ms/step - loss: 0.1054 - accuracy: 0.9710 - val_loss: 0.4594 - val_accuracy: 0.8211
Epoch 4/10
782/782 [==============================] - 7s 9ms/step - loss: 0.0276 - accuracy: 0.9963 - val_loss: 0.5343 - val_accuracy: 0.8233
Epoch 5/10
782/782 [==============================] - 7s 9ms/step - loss: 0.0065 - accuracy: 0.9996 - val_loss: 0.6053 - val_accuracy: 0.8230
Epoch 6/10
782/782 [==============================] - 7s 8ms/step - loss: 0.0021 - accuracy: 1.0000 - val_loss: 0.6627 - val_accuracy: 0.8234
Epoch 7/10
782/782 [==============================] - 7s 8ms/step - loss: 9.7176e-04 - accuracy: 1.0000 - val_loss: 0.6994 - val_accuracy: 0.8262
Epoch 8/10
782/782 [==============================] - 7s 9ms/step - loss: 5.1851e-04 - accuracy: 1.0000 - val_loss: 0.7461 - val_accuracy: 0.8256
Epoch 9/10
782/782 [==============================] - 7s 8ms/step - loss: 2.9132e-04 - accuracy: 1.0000 - val_loss: 0.7815 - val_accuracy: 0.8262
Epoch 10/10
782/782 [==============================] - 7s 8ms/step - loss: 1.7517e-04 - accuracy: 1.0000 - val_loss: 0.8207 - val_accuracy: 0.8258
<tensorflow.python.keras.callbacks.History at 0x7f8dbcc1db38>
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)
(10000, 16)