gatelfdata.vocab module

Module for the Vocab class

class gatelfdata.vocab.Vocab(counts=None, max_size=None, emb_id=None, emb_train=None, emb_dims=0, emb_file=None, emb_minfreq=1, no_special_indices=False, pad_index_only=False, emb_dir=None, pad_string=’‘, oov_string=’<<oov>>’)[source]

Bases: object

From the counter object, create string to id and id to string mappings.

add_counts(counts)[source]

Incrementally add additional counts to the vocabulary. This can be done only before the finish method is called

check_finished(method=’method’)[source]
check_nonfinished(method=’method’)[source]
count(strng)[source]

Return the count/frequency for the given word. NOTE: after finish() this will return 0 for any words that have been removed because of one of the filter criteria!!

embs4line(line, fromidx, dims)[source]
finish(remove_counts=True, remove_embs=True)[source]

Build the actual vocab instance, it can only be used properly to look-up things after calling this method, but no parameters can be changed nor counts added after this.

get_embeddings()[source]

Return a numpy matrix of the embeddings in the order of the indices. If this is called before finish() an exception is raised

idx2string(idx)[source]

Return the string for this index

load_embeddings(emb_file, filterset=None)[source]

Load pre-calculated embeddings from the given file. This will update embd_dim as needed! Currently only supports text format, compressed text format or a two file format where the file with extension “.vocab” has one word per line and the file with extension “.npy” is a matrix with as many rows as there are words and as many columns as there are dimensions. The format is identified by the presence of one of the extensions “.txt”, “.vec”, “.txt.gz”, or “.vocab” and “.npy” in the emb_file given. (“.vec” is an alias for “.txt”) The text formats may or may not have a first line that indicates the number of words and number of dimensions. If filterset is non-empty, all embeddings not in the set are loaded, otherwise all embeddings which are also already in the vocabulary are loaded. NOTE: this will not check if the case conventions or other conventions (e.g. hyphens) for the tokens in our vocabulary are compatible with the conventions used for the embeddings.

onehot2string(vec)[source]
static rnd_vec(dims, strng=None, as_numpy=True)[source]

Returns a random vector of the given dimensions where each dimension is from a gaussian(0,1) If str is None, the vector is dependent on the current numpy random state. If a string is given, then the random state is seeded with a number derived from the string first, so the random vector will always be the same for that string and number of dimensions.

set_emb_dims(dim)[source]
set_emb_file(file)[source]
set_emb_id(embid)[source]
set_emb_minfreq(min_freq=1)[source]
set_max_size(max_size=None)[source]
size()[source]

Return the total number of entries in the vocab, including any special symbols

string2emb(string)[source]
string2idx(string)[source]
string2onehot(thestring)[source]

return a one-hot vector for the string. If we have an oov index, return that for unknown words, otherwise raise and exception. If the string is the padding string, return an all zero vector. NOTE: this can be called even if the emb_train parameter was not equal to ‘onehot’ when creating the vocabulary. In that case, there may be an OOV symbol in the vocab and the onehot vector generated will contain it as its first dimension.

zero_onehotvec()[source]
zero_vec(as_numpy=True)[source]