Code & Models – Clément Dalloux

ESSAI & CAS experiments

French word2vec model & French fastText model

These 100 dimensions word embedding models were trained using the Skip-Gram algorithm, a window of 5 words left and right, a minimum count of five occurrences for each word and negative sampling. The training data was composed of the French Wikipedia articles and biomedical data. The latter includes the ESSAI and CAS corpora, the French Medical Corpus from CRTT and the Corpus QUAERO Médical du français.

CD-SCO experiments

We experimented with several bidirectional recurrent neural networks on the *SEM-2012 shared task data. Our results were published in the Natural Language Engineering article. Our code is available on my GitHub. Our fastText model is available below.

fastText-Doyle

100 dimensions fastText model that we trained on Conan Doyle’s novels for 100 epochs, the assumption being that domain specific word embeddings will outperform those trained on generic data. It was trained using CBOW and hierarchical softmax.