Word embeddings with code2vec, GloVe and spaCy

MM
Maria Malitckaya

Data Scientist

24th Mar 2020

This article was originally published in Towards Data Science on March 18, 2020.

source: lukasbieri via pixabay (CC0)

One powerful way to improve your machine learning model is to use word embeddings. With word embeddings, you're able to capture the context of the word in the document and then find semantic and syntactic similarities.

In this post, we'll cover an unusual application of the word embeddings techniques. We'll try to find the best word embedding techniques for OpenAPI specifications. As anexample of openAPI specification , we'll use a free source of OpenAPI specifications from apis-guru😎.

The biggest challenge is that OpenAPI specifications are neither a natural language or code. But this also means that we're free to use any of the available embeddings models. For this experiment, we'll look into three possible candidates that may work: code2vec, GloVe, and spaCy.

code2vec is a neural model that learns analogies relevant to source code. The model was trained on the Java code database but you can apply it to any codebase.

Then there's GloVe. GloVe is a commonly used algorithm for natural language processing (NLP). It was trained on Wikipedia and Gigawords.

Finally, we have spaCy. While spaCy was only recently developed, the algorithm already has a reputation for being the fastest word embedding in the world.

Let's see which of these algorithms is better for OpenAPI datasets and which one works faster for OpenAPI specifications👀. I divided this post into seven sections, each of them will contain code examples and some tips for future use, plus a conclusion.

  1. Download the dataset
  2. Download vocabularies
  3. Extract the field names
  4. Tokenize keys
  5. Create a dataset of the field names
  6. Test embeddings
  7. Conclusion

Now, we can start.

1. Download the dataset✅#

First, we'll need to download the whole apis-guru database.

You'll notice that most of the apis-guru specifications are in the Swagger 2.0 format. But.. the latest version of OpenAPI specification is OpenAPI 3.0. So let's convert the whole dataset to this format by using Unmock scripts! You can follow the instructions for how to complete this on the unmock-openapi-scripts README.

This may take a while (you won’t become 🧓, but we’re talking hours ⏰) and in the end, you will get a big dataset with various specifications🎓.

2. Download vocabularies✅#

code2vec#

  1. Download Code2vec model from their github page. Follow instructions in README.md in the section Quickstart and then export trained tokens.
  2. Load by using the gensim library.

GloVe#

  1. Download one of the GloVe vocabularies from the website. We took the largest one because then there's a higher chance of it finding all of our words. You can choose where you want to download it but, for convenience, it's better to store it in the working directory.
  2. Load GloVe vocabulary manually.

spaCy#

Load the large spaCy vocabulary:

3. Extract the field names✅#

The whole list of OpenAPI specification names can be obtained from the scripts/fetch-list.sh file or by using the following function (for Windows):

Another big deal is to get the field names out of our OpenAPI specifications. For this purpose, we'll use openapi-typed library.

Let's define a get_fields function that takes the OpenAPI specification and returns a list of field names:

Congrats! Now our dataset is ready . ## 4. Tokenize keys✅ The field names may contain punctuation, such as \\\\\\\\\_ and - symbols, or camel case words. We can chop these words up into pieces called tokens. The following camel-case function identifies these camel case words. First, it checks if there's any punctuation. If yes, then it's not a camel case. Then, it checks if there are any capital letters inside the word (excluding the first and last characters).

The next function (camel_case_split) splits the camel case word into pieces. For this purpose, we should identify the upper case and mark places where the case changes. The function returns a list of the words after splitting. For example, the field name BodyAsJson transforms to a list ['Body', 'As', 'Json'].

This camel_case_split function is then used in the following tokenization algorithm. Here, we first check if there's punctuation in the word. Then, we split the word into pieces. There's a chance that these pieces are camel case words. If this is the case, we can split it into smaller pieces. Finally, after splitting each element, the entire list is converted to lower case.

5. Create a dataset of the field names✅#

Now, let's create a big dataset with fields name from all the specifications. The following dict_dataset function takes a list of the file's name and path and opens each specification file. For each file, the get_field function returns a list of the field names. Some of the field names may repeat in one specification. To get rid of this repetition, let's convert the list of field names from the list to the dictionary and back by using list(dict.fromkeys(col)).Then we can tokenize the list. In the end, we create a dictionary with a file name as a key and list of field names as a value.

6. Test embeddings✅ ### code2vec and GloVe#

Now we can find out-of-vocabulary words(not_identified_c2v) and count the percentage of these words for code2vec vocabulary.

The previous code will also work for GloVe. ### spaCy spaCy vocabulary is different, so we need to modify our code accordingly:

The resulting percentages of not identified words are 3.39, 2.33, 2.09 for code2vec, GloVe, and spaCy, respectively. Since the percentages are relatively small and similar for each algorithm, we can make another test. First, let's create a test dictionary with the words that should be similar across all API specifications:

For GloVe and code2vec, we can use the similar_by_vector method provided by the gensim library. spaCy doesn't implement this method yet - but we can find the most similar words on our own. To do this, we need to format the input vector for use in the distance function. We'll create each key in the dictionary and check whether the corresponding value is in the 100 most similar words. To start, we'll format the vocabulary for use in a distance.cdist function. This is the function that computes the distance between each pair of vectors in the vocabulary. Then, we'll sort the list from the smallest distance to largest and take the first 100 words.

The results are summarized in the following table. spaCy shows that the word ‘client’ is in the first 100 most similar words for the word ‘user’. It is useful for almost all of the OpenAPI specifications and can be used for the future analysis of OpenAPI specification similarity. The vector for the word ‘balance’ is close to the vector for the word ‘amount’. We find it especially useful for payment API. table

Conclusion#

We've tried three different word embeddings algorithms for OpenAPI specification. Despite the fact that all three perform quite well on this dataset, an extra comparison of the most similar words shows that spaCy works better for our case. spaCy is faster than other algorithms. The spaCy vocabulary can be upload five times faster in comparison to GloVe or code2vec vocabularies. However, the lack of built-in functions - such as similar_by_vector and similar_word - is an obstacle when using this algorithm. Also, the fact that spaCy works well with our dataset doesn't mean that spaCy will be better for every dataset in the world. So, feel free to try different word embeddings for your own dataset and let us know which one works better for you in the comments! Thanks for reading!

Newer postOlder post

Company

ContactPricingAbout usT&CDocs