close
The Wayback Machine - https://web.archive.org/web/20240304055523/https://resources.wolframcloud.com/NeuralNetRepository/resources/ELMo-Contextual-Word-Representations-Trained-on-1B-Word-Benchmark/

ELMo Contextual Word Representations Trained on 1B Word Benchmark

Represent words as contextual word-embedding vectors

Released in 2018 by the research team of the Allen Institute for Artificial Intelligence (AI2), this representation was trained using a deep bidirectional language model. It produces three vectors per token, two of which are contextual, meaning that they depend on the entire sentence in which they are used. These word vectors are aimed at being linearly combined. They are based on the characters and case-sensitive, so there is no token dictionary.

Number of layers: 127 | Parameter count: 93,600,864 | Trained size: 375 MB |

Training Set Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"]
Out[1]=
BERJAYA

Basic usage

For each token, the net produces three length-1024 feature vectors: one that is context-independent (port "Embedding") and two that are contextual (ports "ContextualEmbedding/1" and "ContextualEmbedding/2").

Input strings are tokenized, meaning they are split into tokens that are words and punctuation marks:

In[2]:=
embeddings = NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"]["Hello world"]
Out[2]=
BERJAYA

Pre-tokenized inputs can be given using TextElement:

In[3]:=
embeddings = NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"][TextElement[{"Hello", "world"}]]
Out[3]=
BERJAYA

The representation of the same word in two different sentences is different. Extract the embeddings for a different sentence:

In[4]:=
embeddings2 = NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"][TextElement[{"Hello", "neighbor"}]]
Out[4]=
BERJAYA

The context-independent embedding for the same word is the same, whatever the surrounding text is. For instance, for the word "Hello":

In[5]:=
embeddings[["Embedding", 1]] == embeddings2[["Embedding", 1]]
Out[5]=
BERJAYA

The context-dependent embeddings are different for the same word in two different sentences:

In[6]:=
embeddings[["ContextualEmbedding/1", 1]] == embeddings2[["ContextualEmbedding/1", 1]]
Out[6]=
BERJAYA
In[7]:=
embeddings[["ContextualEmbedding/2", 1]] == embeddings2[["ContextualEmbedding/2", 1]]
Out[7]=
BERJAYA

The recommended usage is to take a (possibly weighted) average of the embeddings:

In[8]:=
Mean@Values[embeddings]
Out[8]=
BERJAYA

Word analogies without context

Extract the non-contextual part of the net:

In[9]:=
netNonContextual = NetTake[NetModel[
   "ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"], {NetPort["Input"], "embedding"}]
Out[9]=
BERJAYA

Precompute the context-independent embeddings for a list of common words (if available, set TargetDevice -> "GPU" for faster evaluation time):

In[10]:=
word2vec = With[{words = WordList[]}, AssociationThread[
    words -> netNonContextual[words, TargetDevice -> "CPU"][[All, 1]]]];

Find the five nearest words to "king":

In[11]:=
Nearest[word2vec, word2vec["king"], 5]
Out[11]=
BERJAYA

Man is to king as woman is to:

In[12]:=
Nearest[word2vec, word2vec["king"] - word2vec["man"] + word2vec["woman"], 5]
Out[12]=
BERJAYA

Visualize the similarity between the words using the net as a feature extractor:

In[13]:=
animals = {"alligator", "bear", Sequence[
   "bird", "bee", "camel", "zebra", "crocodile", "rhinoceros", "giraffe", "dolphin", "duck", "eagle", "elephant", "fish", "fly"]};
In[14]:=
fruits = {"apple", "apricot", Sequence[
   "avocado", "banana", "blackberry", "cherry", "coconut", "cranberry", "grape", "mango", "melon", "papaya", "peach", "pineapple", "raspberry", "strawberry", "fig"]};
In[15]:=
FeatureSpacePlot[Join[animals, fruits], FeatureExtractor -> Function[w, word2vec[w]]]
Out[15]=
BERJAYA

Word analogies in context

Define a function that shows the word in context along with the average of its embeddings:

In[16]:=
netevaluateWithContext[sentence_String] := With[{tokenizedSentence = TextElement[StringSplit[sentence]]},
  AssociationThread[
   Thread[{First[tokenizedSentence], sentence}],
   Mean@Values@(NetModel[
        "ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"][tokenizedSentence])
   ]
  ]

Check the result on a sentence:

In[17]:=
Dataset[netevaluateWithContext["I play the piano"]]
Out[17]=
BERJAYA

Define a function to find the nearest word in context in a set of sentences, for a given word in context:

In[18]:=
findSemanticNearestWord[{word_, context_}, otherSentences_] := First@Nearest[
   Association[Join @@ Map[netevaluateWithContext, otherSentences]],
   netevaluateWithContext[context][{word, context}]
   ]

Find the semantically nearest word to the word "play" in "I play the piano":

In[19]:=
findSemanticNearestWord[{"play", "I play the piano"},
 {"This was a nice play", "Guitar can be played with a pick"}
 ]
Out[19]=
BERJAYA

Find the semantically nearest word to the word "set" in "The set of values higher than a threshold":

In[20]:=
findSemanticNearestWord[{"set", "The set of values higher than a threshold"},
 {"They set the clock", "This ensemble of items belongs to her"}
 ]
Out[20]=
BERJAYA

Train a model with the word embeddings

Take text-processing dataset:

In[21]:=
trainingData = ExampleData[{"MachineLearning", "MovieReview"}, "TrainingData"];
validationData = ExampleData[{"MachineLearning", "MovieReview"}, "TestData"];

Pre-compute the ELMo vectors on the training and the validation dataset (if available, GPU is recommended):

In[22]:=
trainingDataELMo = Total[Values[
      NetModel[
        "ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"][Keys[trainingData], TargetDevice -> "CPU"]]/3.] -> Values[trainingData];
In[23]:=
validationDataELMo = Total[Values[
      NetModel[
        "ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"][Keys[validationData], TargetDevice -> "CPU"]]/3.] -> Values[validationData];

Define a network that takes word vectors instead of strings for the text-processing task:

In[24]:=
netArchitecture = NetChain[{DropoutLayer[], NetMapOperator[2], AggregationLayer[Max, 1], SoftmaxLayer[]}, "Output" -> NetDecoder[{"Class", {"negative", "positive"}}]]
Out[24]=
BERJAYA

Train the network on the pre-computed ELMo vectors:

In[25]:=
trainResultsELMo = NetTrain[netArchitecture, trainingDataELMo, All, ValidationSet -> validationDataELMo, MaxTrainingRounds -> 20]
Out[25]=
BERJAYA

Check the classification error rate on the validation data:

In[26]:=
Min@trainResultsELMo["ValidationMeasurementsLists", "ErrorRate"]
Out[26]=
BERJAYA

Compare the results with the performance of the same model trained on context-independent embeddings:

In[27]:=
trainingDataGlove = Thread@Rule[
    NetModel[
      "GloVe 300-Dimensional Word Vectors Trained on Wikipedia and \
Gigaword 5 Data"][Keys[trainingData]],
    Values[trainingData]
    ];
In[28]:=
validationDataGlove = Thread@Rule[
    NetModel[
      "GloVe 300-Dimensional Word Vectors Trained on Wikipedia and \
Gigaword 5 Data"][Keys[validationData]],
    Values[validationData]
    ];
In[29]:=
trainResultsGlove = NetTrain[netArchitecture, trainingDataGlove, All, ValidationSet -> validationDataGlove, MaxTrainingRounds -> 20]
Out[29]=
BERJAYA
In[30]:=
Min@trainResultsGlove["ValidationMeasurementsLists", "ErrorRate"]
Out[30]=
BERJAYA

Net information

Inspect the number of parameters of all arrays in the net:

In[31]:=
NetInformation[
 NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"], "ArraysElementCounts"]
Out[31]=
BERJAYA

Obtain the total number of parameters:

In[32]:=
NetInformation[
 NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"], "ArraysTotalElementCount"]
Out[32]=
BERJAYA

Obtain the layer type counts:

In[33]:=
NetInformation[
 NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"], "LayerTypeCounts"]
Out[33]=
BERJAYA

Display the summary graphic:

In[34]:=
NetInformation[
 NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"], "SummaryGraphic"]
Out[34]=
BERJAYA

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[35]:=
jsonPath = Export[FileNameJoin[{$TemporaryDirectory, "net.json"}], NetModel["ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"], "MXNet"]
Out[35]=
BERJAYA

Export also creates a net.params file containing parameters:

In[36]:=
paramPath = FileNameJoin[{DirectoryName[jsonPath], "net.params"}]
Out[36]=
BERJAYA

Get the size of the parameter file:

In[37]:=
FileByteCount[paramPath]
Out[37]=
BERJAYA

The size is similar to the byte count of the resource object:

In[38]:=
ResourceObject[
  "ELMo Contextual Word Representations Trained on 1B Word \
Benchmark"]["ByteCount"]
Out[38]=
BERJAYA

Requirements

Wolfram Language 11.3 (March 2018) or above

Resource History

Reference

  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, "Deep Contextualized Word Representations," arXiv:1802.05365 NAACL(2018)
  • Available from: http://allennlp.org/elmo
  • Rights: Apache 2.0 License