Spanish Sentence Embeddings

Spanish Sentence Embeddings trained using sent2vec on the Spanish Unannotated Corpora.

Pre-Processing

The data used was already preprocessed in Spanish Unannotated Corpora to lowercase, remove multiple spaces, remove urls and others. We also used the script to split on punctuation included in the previous repository.

According to that tokenization, the 2.6B words corpus got into 3.4B tokens.

sent2vec Parameters

We set default parameters of sent2vec to train a unigram + bigram model.

Download

Spanish sent2vec (700 dim sentence embeddings, unigram+bigram model, 14.4 GB)

References

Matteo Pagliardini, Prakhar Gupta, Martin Jaggi, Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features NAACL 2018

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE.md

LICENSE.md

README.md

README.md

Repository files navigation

Spanish Sentence Embeddings

Pre-Processing

sent2vec Parameters

Download

References

About

Releases

Packages

License

BotCenter/spanish-sent2vec

Folders and files

Latest commit

History

LICENSE.md

LICENSE.md

README.md

README.md

Repository files navigation

Spanish Sentence Embeddings

Pre-Processing

sent2vec Parameters

Download

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages