Open AccessFeature PaperArticle

Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model

Taejoon Kim

¹ and

Haiyan Wang

^2,*

Department of Statistics and Biostatistics, California State University East Bay, Hayward, CA 94542, USA

Department of Statistics, Kansas State University, Manhattan, KS 66506, USA

Author to whom correspondence should be addressed.

Mathematics 2025, 13(4), 612; https://doi.org/10.3390/math13040612

Submission received: 31 December 2024 / Revised: 4 February 2025 / Accepted: 10 February 2025 / Published: 13 February 2025

(This article belongs to the Special Issue High-Dimensional Data Analysis and Applications)

Download

Browse Figures

Versions Notes

Abstract

In this article, we present a model for analyzing the co-occurrence count data derived from practical fields such as user–item or item–item data from online shopping platforms and co-occurring word–word pairs in sequences of texts. Such data contain important information for developing recommender systems or studying the relevance of items or words from non-numerical sources. Different from traditional regression models, there are no observations for covariates. Additionally, the co-occurrence matrix is typically of such high dimension that it does not fit into a computer’s memory for modeling. We extract numerical data by defining windows of co-occurrence using weighted counts on the continuous scale. Positive probability mass is allowed for zero observations. We present the Shared Parameter Alternating Tweedie (SA-Tweedie) model and an algorithm to estimate the parameters. We introduce a learning rate adjustment used along with the Fisher scoring method in the inner loop to help the algorithm stay on track with optimizing direction. Gradient descent with the Adam update was also considered as an alternative method for the estimation. Simulation studies showed that our algorithm with Fisher scoring and learning rate adjustment outperforms the other two methods. We applied SA-Tweedie to English-language Wikipedia dump data to obtain dense vector representations for WordPiece tokens. The vector representation embeddings were then used in an application of the Named Entity Recognition (NER) task. The SA-Tweedie embeddings significantly outperform GloVe, random, and BERT embeddings in the NER task. A notable strength of the SA-Tweedie embedding is that the number of parameters and training cost for SA-Tweedie are only a tiny fraction of those for BERT.

Keywords:

NLP; word embedding; tweedie distribution; high-dimensional co-occurrence matrix; matrix factorization; adam; recommender systems

MSC:

62-08; 62J99

1. Introduction

Data tables often summarize relationships between response and explanatory variables. In regression analysis, rows represent observations, while columns list covariates explaining variations in the response. This article explores modeling with data matrices that aggregate information on pairs of members, differing from traditional regression matrices.

For instance, in product recommendation systems, a matrix may represent user-product interactions, such as Amazon ratings. Rows correspond to users, columns to products, and entries capture ratings or browsing times. Here, users and products are not explanatory variables but entities requiring analysis to infer patterns for recommendations.

Another example is the word–word co-occurrence matrix in NLP. Each entry captures the frequency of a target word co-occurring with a context word within a defined window. These sparse matrices are typically large with many zeros. They can be used to uncover relationships between words. Learned dense vectors, or word embeddings, enable clustering similar words and support downstream tasks like sentiment analysis and named entity recognition.

In both examples, co-occurrence counts summarize data effectively but are not the only approach. For instance, Word2Vec (Mikolov et al. [1,2]) models each word in a context window as a binary variable, treating the vocabulary as multi-class data. It uses multinomial logistic regression to estimate the probability of a word being a context or center word, using the dot product of their dense representations as the input for a softmax function. This approach faces challenges with large vocabularies, necessitating techniques like hierarchical softmax and negative sampling to reduce computational demands. In contrast, GloVe [3] aggregates co-occurrence counts and uses weighted least squares regression to model them. However, GloVe does not consider the fact that the variation of such counts also changes with the mean.

To handle unseen words and rare cases, Facebook’s FAIR lab developed FastText [4], which extends Word2Vec by representing words as bags of character n-grams, effectively addressing challenges in morphologically rich languages. FastText breaks each word into overlapping character n-grams (typically three to six characters long). Instead of assigning a single vector per word, it learns embeddings for each n-gram during training. The embedding of a word, even if it is unseen, is then computed as the sum of its subword embeddings. FastText and Word2Vec share a common foundation in their approach to learning word embeddings. Both models are based on the distributional hypothesis and use similar training architectures, specifically the CBOW (Continuous Bag of Words) and Skip-gram models. These architectures rely on a sliding window to model token pair interactions. However, due to the vast number of possible interactions, both models employ negative sampling and hierarchical softmax to approximate probability computations.

Efficiency can be further improved by modeling aggregated co-occurrence counts using suitable probability distributions. While GloVe takes this approach, it does not effectively handle skewness and zero counts in the data. This limitation motivates the research presented in this paper.

1.1. Recent Advances in NLP and Efficient Embedding Learning

Recent advancements in natural language processing (NLP) have focused on contextualized word embeddings that capture word meanings based on their surrounding context within entire sentences or passages. For instance, models like ELMo [5] use deep bidirectional LSTMs, while BERT [6] employs transformer-based architectures pretrained on large corpora to achieve deep contextual embeddings. These innovations have significantly advanced various NLP applications by offering richer, context-aware representations of language.

Recent industry efforts focus on scaling up large language models (LLMs) based on transformer architectures [7]. Notable examples like GPT-3 [8] have demonstrated that larger models, when trained on massive datasets, deliver superior performance on a wide range of tasks without the need for fine-tuning. Despite the impressive performance of LLMs, there are critical limitations that pose challenges to their widespread adoption. The author of [9] compared state-of-the-art LLMs, such as LLaMA2-7B [10], ADA-002 [11], and PaLM2 [12], with classical models like Universal Sentence Encoder [13] and BERT. While LLMs generally outperformed classical methods in analogy tasks, their heavy reliance on enormous computational resources and large training datasets remains a significant bottleneck. The sheer size of modern LLMs, such as GPT-4, with billions of parameters, demands extensive hardware capabilities (e.g., high-performance pods of GPUs or TPUs), making them costly to train, deploy, and maintain, not to mention their environmental impact.

With the rise of LLMs, there has been increasing interest in applying these models to improve the accuracy of recommender systems. While LLMs have the potential to enhance recommender systems when large-scale text data is available, they are less effective in handling sparse user–item interaction data that is characteristic of many recommender systems [14]. Their high computational cost makes them impractical for real-time recommendation systems, especially on platforms with a high number of users and items. In fact, when applied to collaborative filtering-based systems, LLMs often fail to capture the latent factors that underlie user–item interactions in a way that other specialized models, like matrix factorization or hybrid models, can.

To reduce resource demand, model distillation and compression techniques have been used in the literature. DistilBERT [15], for example, uses a technique called knowledge distillation, where a smaller model (the student) learns to mimic the behavior of a larger model (the teacher, in this case, BERT). BERT was trained with masked language modeling and next sentence prediction (NSP). DistilBERT is trained without the NSP objective, making training more efficient and more practical for deployment in real-time applications. Similarly, T5 (Text-to-Text Transfer Transformer) [16] reframes all NLP tasks as text-to-text problems. Instead of using different architectures for different tasks (e.g., classification, translation, summarization), T5 converts all tasks into a text generation format to simplify deployment.

Despite the impressive performance of large-scale models, their computational cost remains a significant challenge in deep neural network models. Pruning [17] and quantization [18] are two common techniques to compress them while maintaining performance. Pruning removes unnecessary weights or connections in a neural network to reduce its size. Quantization reduces the precision of the numerical values (weights and activations) in a neural network, decreasing memory and computation requirements. Both pruning and quantization are often used together for maximum efficiency. If applied carefully, they enable efficient deep learning models with minimal accuracy loss. These methods are essential for deploying transformer-based architectures in resource-constrained environments. However, even with such optimizations, these models remain computationally expensive compared with traditional embedding techniques based on word co-occurrence matrices.

1.2. Motivation for the SA-Tweedie Model

The increasing resource demands of large-scale models highlight the need for alternative solutions that maintain strong performance while minimizing computational overhead. Transformer-based models, which rely on global word embeddings to understand language in context, could potentially be optimized to improve both efficiency and accessibility. Enhancing these embeddings offers an opportunity to create lightweight models capable of delivering competitive performance while mitigating resource consumption.

One promising approach is Matryoshka Representation Learning (MRL), introduced by [19], which focuses on creating flexible embedding models that can shrink in size for specific tasks, preserving high performance while reducing computational demands. Building on this idea, the Matryoshka-Adaptor framework has been successfully applied to LLMs, resulting in significant reductions in dimensionality and improved computational efficiency. These advancements underscore the need for more efficient word representation models that strike a balance between performance and resource consumption, particularly in scenarios where deploying large models is impractical.

Global word representations have a wide range of applications that can benefit from these advances. For instance, in fields such as deepfake detection (Heidari et al. [20,21]), medical informatics (Amiri et al. [22,23]), and climate research [24], efficient word embeddings are essential for processing large volumes of textual data, extracting actionable insights, and making decisions in real-time. Additionally, emerging applications in Internet of Things (IoT) systems (Heidari et al. [25,26]) and blockchain-based technologies [27] also leverage word embeddings for tasks like pattern recognition, anomaly detection, and performance optimization. The versatility of these embeddings across domains highlights their fundamental role in natural language understanding and text-driven decision-making.

Given these pressing challenges and opportunities, this paper proposes a foundational approach for modeling global word representations from a probabilistic perspective.

1.3. SA-Tweedie Versus SA-ZIG as Probabilistic Models for Word Representations

The SA-ZIG model proposed by [28] also addresses this problem using a zero-inflated gamma (ZIG) distribution. However, SA-ZIG relies on a Bernoulli distribution and a logistic model to differentiate between zero and positive observations. As a result, it inherits a potential non-convergence issue—which occurs when the maximum likelihood estimate (MLE) does not exist or the solution becomes unbounded. This problem arises in cases of complete or quasi-complete separation within the convex cones formed by the data.

In contrast, the Tweedie distribution offers a more flexible approach to handling both zero and positive counts within a single probabilistic framework. A particularly useful member of this family is the Compound Poisson Gamma distribution, which effectively models zero values when the Poisson count is zero. When the Poisson count is nonzero, positive values arise as the sum of a random number of Gamma-distributed variables. This makes the Tweedie model particularly well suited for datasets containing a mixture of true zeros and continuous positive values.

In the context of word representations, many word pairs never co-occur (resulting in zeros), but when they do, their co-occurrence frequencies vary across different magnitudes. The Tweedie family of distributions provides a natural way to handle this zero-versus-positive data structure without suffering from the non-convergence issues associated with logistic models using Bernoulli distributions.

1.4. Contributions and Organization of the Paper

The main contribution of this paper is the SA-Tweedie model, which is capable of modeling weighted co-occurrence counts from very large data. The SA-Tweedie model was trained on an English-language Wikipedia dump to generate dense vector representations for WordPiece tokens. These embeddings were then utilized in a Named Entity Recognition (NER) task. The SA-Tweedie embeddings demonstrated significantly better performance than GloVe and random embeddings. Moreover, the performance was competitive with embeddings from BERT in the NER task. This underscores the strength of the SA-Tweedie model, as it achieves comparable results while requiring only a small fraction of the parameters and training costs associated with BERT.

The rest of the paper is organized as follows: Section 2 introduces the proposed SA-Tweedie model, detailing its formulation and probability foundation. Section 3 presents a simulation study to explore the model’s characteristics and validate its performance. Section 4 discusses scalability issues and demonstrates the model’s ability to train on large datasets, such as the entire Wikipedia corpus, with an application to Named Entity Recognition. Finally, the paper concludes with a summary of the findings and implications for future research.

2. The Probability Distribution for the Proposed SA-Tweedie Model

2.1. MLE for Alternating Tweedie Regression

In NLP tasks, the probability of one word appearing in the context of another is typically very small, and the large volume of text results in sparse occurrences. While the count of such co-occurrences is a binomial random variable, it can often be approximated by a Poisson distribution, though the data frequently contains excessive zeros.

The Compound Poisson Gamma distribution is well suited for this scenario. This distribution models the sum of independent Gamma random variables with identical shape and scale parameters, where the number of Gamma variables follows a Poisson distribution. Belonging to the Tweedie family, this distribution allows for shared shape and scale parameters across identical item–item pairs from different sources, while different pairs may have distinct parameters. These parameters are ultimately estimated within a regression framework.

In this section, we introduce the alternating Tweedie regression model with shared parameters and detail its parameter estimation. Before presenting the regression model, we explore the Tweedie distribution in greater depth and explain the relationship between the parameters of the Gamma and Tweedie distributions.

The Tweedie distribution family contains several commonly used distributions, which are determined by the power parameter p.

The Tweedie distribution belongs to the exponential family. All exponential dispersion models that are closed to scale transformation belong to the Tweedie distribution family. In particular, if X has a Tweedie distribution with mean $μ$ and dispersion parameter $ϕ$ , then $c X$ has a Tweedie distribution with mean $c μ$ and dispersion parameter $c^{2 - p} ϕ$ . It has a mean and variance relationship as follows:

$E (Y) = μ Var (Y) = ϕ μ^{p},$

where $ϕ$ is the dispersion parameter having a value greater than 0.
The Compound Poisson Gamma distribution corresponds to the case that the power parameter satisfies $p \in (1, 2)$ . In this case, the distribution has non-negative support and can have a discrete mass at zero, making it useful to model responses that are a mixture of zeros and positive values.
The distributions that correspond to other power values of p:
–
When $p < 0$ , the distribution is the Extreme stable distribution.
–
When $p = 0$ , the distribution is the Normal distribution.
–
When $p = 1$ , the distribution coincides with the Poisson distribution.
–
When $p = 2$ , the distribution is the Gamma distribution.
–
When $2 3$ , the distribution is the Positive stable distribution.
–
When $p = \infty$ , the distribution is the Extreme stable distribution.
The mean domain is R if p is 0 or ∞, and is $R_{+}$ for all other $p \neq 0$ . Therefore, except when the distribution is Normal or Extreme stable with $p = \infty$ , the log link is frequently used as the link function for the Tweedie regression.
The positive zero mass is only allowed when $1 \leq p < 2$ . Since our co-occurrence count has many zeros, we focus our attention on the Compound Poisson Gamma distribution, which is the case when $1 < p < 2$ .

The Gamma and Poisson distributions together have parameters

α

β

, and

λ

. The Tweedie distribution uses parameters

μ

ϕ

, and p. Next, we will see how they are related in the case of Compound Poisson Gamma.

Let

N \sim Poisson (λ), X_{i} \overset{i . i . d .}{\sim} Gamma (α, β) .

Assume N and

X_{i}

’s are independent for all i. For

Y = 0

, the probability comes from the Poisson distribution

Pr [Y = 0] = \frac{λ^{y} e^{- λ}}{y!} = e^{- λ} .

For

Y > 0

, note that

Y ∣ N = n \sim Gamma (n α, β)

by the additive property of the Gamma distribution. Then the distribution of Y when

Y > 0

can be obtained with the law of total probability as the product of the conditional distribution of

Y ∣ N

and the marginal distribution of N,

\begin{matrix} f_{Y} (y) = \sum_{n = 0}^{\infty} f_{Y, N} (y, n) = \sum_{n = 0}^{\infty} f_{Y I N} (y ∣ n) \cdot f_{N} (n) . \end{matrix}

Recall,

N \sim Poisson (λ)

and

Y ∣ N = n \sim Gamma (n α, β)

. Therefore, for

y > 0

\begin{matrix} f_{Y} (y) & = & \sum_{n = 0}^{\infty} \frac{β^{n α}}{Γ (n α)} y^{n α - 1} e^{- β y} \cdot \frac{λ^{n} e^{- λ}}{n!} \\ = & e^{- β y} \cdot e^{- λ} \cdot \sum_{n = 0}^{\infty} \frac{β^{n x}}{Γ (n α)} y^{n α - 1} \cdot \frac{λ^{n}}{n!} \\ = & exp \{- β y - λ + log (\sum_{n = 0}^{\infty} \frac{β^{n α}}{Γ (n α)} \cdot y^{n α - 1} \cdot \frac{λ^{n}}{n!})\} . \end{matrix}

(1)

The above form was derived by using the Poisson distribution and the conditional distribution of Gamma. Meanwhile, the Tweedie distribution has its own canonical form in exponential family format when

Y > 0

\begin{matrix} f (y) = exp \{\frac{y θ - κ (θ)}{ϕ} + c (y, ϕ, p)\}, \end{matrix}

(2)

where

θ = \frac{μ^{1 - p}}{1 - p}

and

κ (θ) = \frac{μ^{2 - p}}{2 - p}

. This implies that

\begin{matrix} λ = \frac{μ^{2 - p}}{ϕ (2 - p)}, α = \frac{2 - p}{p - 1}, \frac{1}{β} = ϕ (p - 1) μ^{p - 1}, \end{matrix}

and the

c (y, ϕ, p)

corresponds to the log sum in (1).

Once we observe the co-occurrence count from the data and assuming the count is from the Compound Poisson Gamma, we could express the probability density function in canonical form for the co-occurrence counts using a generalized linear model format. Figure 1 illustrates the model input and desired output. The input data is the co-occurrence matrix as illustrated in the heatmap for the most frequently observed 300 words extracted from 2000 Reuter business news. The zero entries are represented with white color in the heatmap. The outputs of the model are the dense vector representations estimated for each word from the model.

The model is as follows:

For

y_{i j} = 0

case,

f_{0} (0) = Pr [Y_{i j} = 0] = exp (- λ_{i j}),

where

λ_{i j}

is defined as (see page 5 of Bonat and Kokonendji [29])

λ_{i j} = \frac{μ_{i j}^{2 - p_{i j}}}{ϕ_{i j} (2 - p_{i j})}

For

y_{i j} > 0

case,

f (y_{i j}) = exp \{\frac{y_{i j} θ_{i j} - κ (θ_{i j})}{ϕ_{i j}} + c (y_{i j}, ϕ_{i j}, p_{i j})\}

where

\begin{matrix} θ_{i j} = \frac{μ_{i j}^{1 - p_{i j}}}{1 - p_{i j}} = \frac{exp \{(1 - p_{i j}) (w_{i}^{⊤} {\tilde{w}}_{j} + b_{i} + {\tilde{b}}_{j})\}}{(1 - p_{i j})}, \\ κ (θ_{i j}) = \frac{μ_{i j}^{2 - p_{i j}}}{2 - p_{i j}} = \frac{exp \{(2 - p_{i j}) (w_{i}^{⊤} {\tilde{w}}_{j} + b_{i} + {\tilde{b}}_{j})\}}{(2 - p_{i j})}, \end{matrix}

\begin{matrix} c (y_{i j}, ϕ_{i j}, p_{i j}) = log (\sum_{k = 1}^{\infty} \frac{γ_{i j}^{k a_{i j}}}{Γ (k a_{i j})} \cdot y_{i j}^{k a_{i j} - 1} \cdot \frac{λ_{i j}^{k}}{k!}), if 1 < p_{i j} < 2, \end{matrix}

(3)

\begin{matrix} c (y_{i j}, ϕ_{i j}, p_{i j}) = \frac{1}{π y_{i j}} \sum_{n = 1}^{\infty} \frac{Γ (1 - a_{i j} n) ϕ_{i j}^{n (- a_{i j} - 1)} {(p_{i j} - 1)}^{- a_{i j} n}}{{(- 1)}^{n} Γ (1 + n) {(p_{i j} - 2)}^{n} y_{i j}^{- a_{i j} n}} sin (n π a_{i j}), if p_{i j} > 2, \end{matrix}

(4)

\begin{matrix} c (y_{i j}, ϕ_{i j}, p_{i j}) = \frac{1}{π y_{i j}} \sum_{k = 1}^{\infty} \frac{Γ (1 + \frac{k}{a_{i j}}) {(- y_{i j})}^{k}}{k! λ_{i j}^{\frac{k}{a_{i j}}} κ_{p_{i j}}^{\frac{k}{a_{i j}}} (1)} sin (\frac{- k π}{a_{i j}}), if p_{i j} < 0, \end{matrix}

(5)

with

λ_{i j} = \frac{μ_{i j}^{2 - p_{i j}}}{ϕ_{i j} (2 - p_{i j})}, a_{i j} = \frac{2 - p_{i j}}{p_{i j} - 1}, \frac{1}{γ_{i j}} = ϕ_{i j} (p_{i j} - 1) μ_{i j}^{p_{i j} - 1}, κ_{p_{i j}} (1) = \frac{{(p_{i j} - 1)}^{- a_{i j}}}{p_{i j} - 2} {(- 1)}^{\frac{1}{p_{i j} - 1}} .

We use the commonly used log link function, which accounts for the mean domain well, as mentioned earlier.

log μ_{i j} = w_{i}^{T} {\tilde{w}}_{j} + b_{i} + {\tilde{b}}_{j}, i, j = 1, \dots, n and w_{i}, {\tilde{w}}_{j} \in R^{d}, b_{i}, {\tilde{b}}_{j} \in R .

Thus, the log-likelihood function for

i^{t h}

row of data limited to the element

y_{i j} > 0

case while

\tilde{w}

and

\tilde{b}

are held fixed is

\begin{matrix} ℓ_{i}^{(1)} (w_{i}, b_{i}; \tilde{w}, \tilde{b}) = \sum_{j = 1}^{n} [\frac{y_{i j} θ_{i j} - κ (θ_{i j})}{ϕ_{i j}} + c (y_{i j}, ϕ_{i j}, p_{i j})] I (y_{i j} > 0), \end{matrix}

(6)

and the log-likelihood function for

j^{t h}

column of data limited to the element

y_{i j} > 0

case while

w

and

b

are held fixed is

\begin{matrix} {\tilde{ℓ}}_{j}^{(1)} ({\tilde{w}}_{j}, {\tilde{b}}_{j}; w, b) = \sum_{i = 1}^{n} [\frac{y_{i j} θ_{i j} - κ (θ_{i j})}{ϕ_{i j}} + c (y_{i j}, ϕ_{i j}, p_{i j})] I (y_{i j} > 0), \end{matrix}

(7)

where

p_{i j} = p_{i} + {\tilde{p}}_{j}, ϕ_{i j} = ϕ_{i} \cdot {\tilde{ϕ}}_{j} .

Here, the power

p_{i j}

is decomposed into two terms additively; one depends on the row and the other depends on the column index. The dispersion parameter is a scaling parameter, so we wrote it as the product of two dispersion parameters; again, one depends on the row and the other depends on the column. This decomposition is inspired by the fact that the item–item co-occurrence count data matrix is symmetric. If the data matrix is a user–item matrix, this additive effect may not be sufficient. Unfortunately, if each

y_{i j}

has its own

p_{i j}

ϕ_{i j}

, and

μ_{i j}

, then we do not have enough degrees of freedom to estimate those parameters.

Holding

\tilde{w}

and

\tilde{b}

fixed, the total log likelihood ℓ for all rows is

\begin{matrix} ℓ (w, b; \tilde{w}, \tilde{b}) & = & \sum_{i = 1}^{n} \sum_{j = 1}^{n} \{log f (y_{i j}) \cdot I (y_{i j} > 0) + log f_{0} (y_{i j}) \cdot I (y_{i j} = 0)\} \\ = & \sum_{i = 1}^{n} ℓ_{i}^{(1)} (w_{i}, b_{i}; \tilde{w}, \tilde{b}) + \sum_{i = 1}^{n} \sum_{j = 1}^{n} ℓ_{i j}^{(0)} (w_{i}, b_{i}; {\tilde{w}}_{j}, {\tilde{b}}_{j}), \end{matrix}

where

\begin{matrix} ℓ_{i j}^{(0)} (w_{i}, b_{i}; {\tilde{w}}_{j}, {\tilde{b}}_{j}) = - λ_{i j} I (y_{i j} = 0) and λ_{i j} = \frac{μ_{i j}^{2 - p_{i j}}}{ϕ_{i j} (2 - p_{i j})} . \end{matrix}

This is because

Pr [Y_{i j} = 0] = exp (- λ_{i j})

. For the

y_{i j} = 0

case, we also have the log link. Therefore,

log μ_{i j} = η_{i j}

and

μ_{i j} = exp (η_{i j})

Similarly, holding

w

and

b

fixed, the total log likelihood

\tilde{ℓ}

for all columns is

\begin{matrix} \tilde{ℓ} (\tilde{w}, \tilde{b}; w, b) & = & \sum_{j = 1}^{n} {\tilde{ℓ}}_{j}^{(1)} ({\tilde{w}}_{j}, {\tilde{b}}_{j}; w, b) + \sum_{i = 1}^{n} \sum_{j = 1}^{n} {\tilde{ℓ}}_{i j}^{(0)} ({\tilde{w}}_{j}, {\tilde{b}}_{j}; w_{i}, b_{i}), \end{matrix}

where

\begin{matrix} {\tilde{ℓ}}_{i j}^{(0)} ({\tilde{w}}_{j}, {\tilde{b}}_{j}; w_{i}, b_{i}) = - λ_{i j} I (y_{i j} = 0) and λ_{i j} = \frac{μ_{i j}^{2 - p_{i j}}}{ϕ_{i j} (2 - p_{i j})} . \end{matrix}

To derive partial derivatives for the case

y_{i j} > 0

, note that

\begin{matrix} μ_{i j} = κ^{'} (θ_{i j}) = exp (η_{i j}) \Rightarrow \frac{\partial μ_{i j}}{\partial η_{i j}} = exp (η_{i j}), Var (Y_{i j}) = ϕ_{i j} μ_{i j}^{p_{i j}}, \\ η_{i j} = log (μ_{i j}), \frac{\partial θ_{i j}}{\partial μ_{i j}} = \frac{1}{\partial μ_{i j} / \partial θ_{i j}} = \frac{1}{κ^{''} (θ_{j})} = \frac{1}{V_{i j}} = \frac{1}{μ_{i j}^{p_{i j}}} . \end{matrix}

The first-order derivatives of the

i^{t h}

row log-likelihood and

j^{t h}

column log-likelihood, which are components of the Score functions in the case

y_{i j} > 0

, are

\begin{matrix} \frac{\partial ℓ_{i}^{(1)}}{\partial w_{i_{1} k}} & = \sum_{j = 1}^{n} \frac{\partial ℓ_{i j}^{(1)}}{\partial θ_{i j}} \cdot \frac{\partial θ_{i j}}{\partial μ_{i j}} \cdot \frac{\partial μ_{i j}}{\partial η_{i j}} \cdot \frac{\partial η_{i j}}{\partial w_{i_{1} k}} = \sum_{j = 1}^{n} \frac{y_{i j} - μ_{i j}}{ϕ_{i j}} \cdot \frac{1}{V_{i j}} \cdot e^{η_{i j}} \cdot {\tilde{w}}_{j k} \cdot I (i = i_{1}) I (y_{i j} > 0) \\ = \sum_{j = 1}^{n} \frac{y_{i_{1} j} - μ_{i_{1} j}}{ϕ_{i_{1} j}} \cdot \frac{1}{V_{i_{1} j}} \cdot μ_{i_{1} j} \cdot {\tilde{w}}_{j k} \cdot I (i = i_{1}) I (y_{i j} > 0), \\ \frac{\partial ℓ_{i}^{(1)}}{\partial b_{i_{1}}} & = \sum_{j = 1}^{n} \frac{\partial ℓ_{i j}^{(1)}}{\partial θ_{i j}} \cdot \frac{\partial θ_{i j}}{\partial μ_{i j}} \cdot \frac{\partial μ_{i j}}{\partial η_{i j}} \cdot \frac{\partial η_{j}}{\partial b_{i_{1}}} = \sum_{j = 1}^{n} \frac{y_{i j} - μ_{i j}}{ϕ_{i j}} \cdot \frac{1}{V_{i j}} \cdot μ_{i j} \cdot I (i = i_{1}) I (y_{i j} > 0), \\ \frac{\partial {\tilde{ℓ}}_{j}^{(1)}}{\partial {\tilde{w}}_{j_{1} k}} & = \sum_{i = 1}^{n} \frac{\partial {\tilde{ℓ}}_{i j}^{(1)}}{\partial θ_{i j}} \cdot \frac{\partial θ_{i j}}{\partial μ_{i j}} \cdot \frac{\partial μ_{i j}}{\partial η_{i j}} \cdot \frac{\partial η_{i j}}{\partial {\tilde{w}}_{j_{1} k}} = \sum_{i = 1}^{n} \frac{y_{i j} - μ_{i j}}{ϕ_{i j}} \cdot \frac{1}{V_{i j}} \cdot μ_{i j} \cdot w_{i k} \cdot I (j = j_{1}) I (y_{i j} > 0), \\ \frac{\partial {\tilde{ℓ}}_{j}^{(1)}}{\partial {\tilde{b}}_{j_{1}}} & = \sum_{i = 1}^{n} \frac{\partial {\tilde{ℓ}}_{i j}^{(1)}}{\partial θ_{i j}} \cdot \frac{\partial θ_{i j}}{\partial μ_{i j}} \cdot \frac{\partial μ_{i j}}{\partial η_{i j}} \cdot \frac{\partial η_{j}}{\partial {\tilde{b}}_{j_{1}}} = \sum_{i = 1}^{n} \frac{y_{i j} - μ_{i j}}{ϕ_{i j}} \cdot \frac{1}{V_{i j}} \cdot μ_{i j} \cdot I (j = j_{1}) I (y_{i j} > 0) . \end{matrix}

The first order derivatives for

ℓ_{i j}^{(0)}

and

{\tilde{ℓ}}_{i j}^{(0)}

are

\begin{matrix} \frac{\partial ℓ_{i j}^{(0)}}{\partial w_{i_{1} k}} & = & - \frac{\partial λ_{i j}}{\partial μ_{i j}} \frac{\partial μ_{i j}}{\partial η_{i j}} \frac{\partial η_{i j}}{\partial w_{i_{1} k}} = - \frac{μ_{i j}^{2 - p_{i j}}}{ϕ_{i j}} {\tilde{w}}_{j k} I (i = i_{1}) I (y_{i j} = 0), \\ \frac{\partial ℓ_{i j}^{(0)}}{\partial b_{i_{1}}} & = & - \frac{\partial λ_{i j}}{\partial μ_{i j}} \frac{\partial μ_{i j}}{\partial η_{i j}} \frac{\partial η_{i j}}{\partial b_{i_{1}}} = - \frac{μ_{i j}^{2 - p_{i j}}}{ϕ_{i j}} I (i = i_{1}) I (y_{i j} = 0), \\ \frac{\partial {\tilde{ℓ}}_{i j}^{(0)}}{\partial {\tilde{w}}_{j_{1} k}} & = & - \frac{\partial λ_{i j}}{\partial μ_{i j}} \frac{\partial μ_{i j}}{\partial η_{i j}} \frac{\partial η_{i j}}{\partial {\tilde{w}}_{j_{1} k}} = - \frac{μ_{i j}^{2 - p_{i j}}}{ϕ_{i j}} w_{i k} I (j = j_{1}) I (y_{i j} = 0), \\ \frac{\partial {\tilde{ℓ}}_{i j}^{(0)}}{\partial {\tilde{b}}_{j_{1}}} & = & - \frac{\partial λ_{i j}}{\partial μ_{i j}} \frac{\partial μ_{i j}}{\partial η_{i j}} \frac{\partial η_{i j}}{\partial {\tilde{b}}_{j_{1}}} = - \frac{μ_{i j}^{2 - p_{i j}}}{ϕ_{i j}} I (j = j_{1}) I (y_{i j} = 0) . \end{matrix}

Hence, the

k^{t h}

component,

k = 1, \dots, d

, of the Score functions corresponding to

i^{t h}

row and

j^{t h}

column of the log likelihood are

\begin{matrix} {[U_{w_{i}}]}_{k} & = & \frac{\partial ℓ}{\partial w_{i k}} = \sum_{j = 1}^{n} \{\frac{y_{i j} - μ_{i j}}{ϕ_{i j}} \cdot \frac{1}{V_{i j}} \cdot μ_{i j} \cdot {\tilde{w}}_{j k} \cdot I (y_{i j} > 0) + \frac{- μ_{i j}^{2 - p_{i j}}}{ϕ_{i j}} \cdot {\tilde{w}}_{j k} \cdot I (y_{i j} = 0)\}, \end{matrix}

\begin{matrix} {[U_{{\tilde{w}}_{j}}]}_{k} & = & \frac{\partial \tilde{ℓ}}{\partial {\tilde{w}}_{j k}} = \sum_{i = 1}^{n} \{\frac{y_{i j} - μ_{i j}}{ϕ_{i j}} \cdot \frac{1}{V_{i j}} \cdot μ_{i j} \cdot w_{i k} \cdot I (y_{i j} > 0) + \frac{- μ_{i j}^{2 - p_{i j}}}{ϕ_{i j}} \cdot w_{i k} \cdot I (y_{i j} = 0)\}, \\ and \\ U_{b_{i}} & = & \frac{\partial ℓ}{\partial b_{i}} = \sum_{j = 1}^{n} \{\frac{y_{i j} - μ_{i j}}{ϕ_{i j}} \cdot \frac{1}{V_{i j}} \cdot μ_{i j} \cdot I (y_{i j} > 0) + \frac{- μ_{i j}^{2 - p_{i j}}}{ϕ_{i j}} \cdot I (y_{i j} = 0)\}, \\ U_{{\tilde{b}}_{j}} & = & \frac{\partial \tilde{ℓ}}{\partial {\tilde{b}}_{j}} = \sum_{i = 1}^{n} \{\frac{y_{i j} - μ_{i j}}{ϕ_{i j}} \cdot \frac{1}{V_{i j}} \cdot μ_{i j} \cdot I (y_{i j} > 0) + \frac{- μ_{i j}^{2 - p_{i j}}}{ϕ_{i j}} \cdot I (y_{i j} = 0)\} . \end{matrix}

Let

β_{i} = {(w_{i}^{⊤}, b_{i})}^{⊤}

and

{\tilde{β}}_{j} = {({\tilde{w}}_{j}^{⊤}, {\tilde{b}}_{j})}^{⊤}

. From the partial derivatives of the log likelihoods, we can see that the second-order partial derivatives with respect to

β_{i}

and

β_{i_{1}}

are zero when i is different from

i_{1}

. That is, the cross terms

\frac{\partial^{2} ℓ}{\partial w_{i} \partial w_{i_{1}}} = 0

and

\frac{\partial^{2} \tilde{ℓ}}{\partial {\tilde{w}}_{i} \partial {\tilde{w}}_{i_{1}}} = 0

and

\frac{\partial^{2} ℓ}{\partial w_{i} \partial b_{i_{1}}} = 0

and

\frac{\partial^{2} \tilde{ℓ}}{\partial {\tilde{w}}_{i} \partial {\tilde{b}}_{i_{1}}} = 0

when

i \neq i_{1}

. Therefore, the estimation of

β_{i}

and

β_{i 1}

can be conducted separately. Denote

\begin{matrix} I_{β_{i}} = [\begin{matrix} I_{w_{i} w_{i}} & I_{w_{i} b_{i}} \\ I_{b_{i} w_{i}} & I_{b_{i} b_{i}} \end{matrix}], I_{{\tilde{β}}_{j}} = [\begin{matrix} I_{{\tilde{w}}_{j} {\tilde{w}}_{j}} & I_{{\tilde{w}}_{j} {\tilde{b}}_{j}} \\ I_{{\tilde{b}}_{j} {\tilde{w}}_{j}} & I_{{\tilde{b}}_{j} {\tilde{b}}_{j}} \end{matrix}], \end{matrix}

where the individual components in the matrix

I_{β_{i}}

and

I_{{\tilde{β}}_{j}}

are negative expectations of the second-order partial derivatives. We list them below, and detailed derivation is omitted.

The

{(k, r)}^{t h}

entry,

k, r = 1, \dots, d

, of the Information matrix for

w_{i}

{\tilde{w}}_{j}

b_{i}

, and

{\tilde{b}}_{j}

, respectively, are

\begin{matrix} {[I_{w_{i_{1}} w_{i_{2}}}]}_{k, r} & = & - E [\frac{\partial^{2} ℓ}{\partial w_{i_{1} k} \partial w_{i_{2} r}}] = - \sum_{i = 1}^{n} \sum_{j = 1}^{n} E [\frac{\partial^{2} (ℓ_{i j}^{(0)} + ℓ_{i j}^{(1)})}{\partial w_{i_{1} k} \partial w_{i_{2} r}}] \\ = & \sum_{j = 1}^{n} \frac{μ_{i_{1} j}^{2}}{ϕ_{i_{1} j} V_{i_{1} j}} \cdot {\tilde{w}}_{j k} \cdot {\tilde{w}}_{j r} \cdot I (i_{1} = i_{2}) \cdot I (y_{i_{1} j} > 0) + \\ \sum_{j = 1}^{n} \frac{(2 - p_{i_{1} j})}{ϕ_{i_{1} j}} \cdot μ_{i_{1} j}^{2 - p_{i_{1} j}} \cdot {\tilde{w}}_{j k} \cdot {\tilde{w}}_{j r} \cdot I (i_{1} = i_{2}) \cdot I (y_{i_{1} j} = 0), \\ {[I_{{\tilde{w}}_{j_{1}} {\tilde{w}}_{j_{2}}}]}_{k, r} & = & - E [\frac{\partial^{2} \tilde{ℓ}}{\partial {\tilde{w}}_{j_{1} k} \partial {\tilde{w}}_{j_{2} r}}] = - \sum_{i = 1}^{n} \sum_{j = 1}^{n} E [\frac{\partial^{2} ({\tilde{ℓ}}_{i j}^{(0)} + {\tilde{ℓ}}_{i j}^{(1)})}{\partial {\tilde{w}}_{j_{1} k} \partial {\tilde{w}}_{j_{2} r}}] \\ = & \sum_{i = 1}^{n} \frac{μ_{i j_{1}}^{2}}{ϕ_{i j_{1}} V_{i j_{1}}} \cdot w_{i k} \cdot w_{i r} \cdot I (j_{1} = j_{2}) \cdot I (y_{i j_{1}} > 0) + \\ \sum_{i = 1}^{n} \frac{(2 - p_{i j_{1}})}{ϕ_{i j_{1}}} \cdot μ_{i j_{1}}^{2 - p_{i j_{1}}} \cdot w_{i k} \cdot w_{i r} \cdot I (j_{1} = j_{2}) \cdot I (y_{i j_{1}} = 0), \end{matrix}

\begin{matrix} I_{b_{i_{1}} b_{i_{2}}} & = & - E [\frac{\partial^{2} ℓ}{\partial b_{i_{1}} \partial b_{i_{2}}}] = - \sum_{i = 1}^{n} \sum_{j = 1}^{n} E [\frac{\partial^{2} (ℓ_{i j}^{(0)} + ℓ_{i j}^{(1)})}{\partial b_{i_{1}} \partial b_{i_{2}}}] \\ = & \sum_{j = 1}^{n} \frac{μ_{i_{1} j}^{2}}{ϕ_{i_{1} j} V_{i_{1} j}} \cdot I (i_{1} = i_{2}) \cdot I (y_{i_{1} j} > 0) + \sum_{j = 1}^{n} \frac{(2 - p_{i_{1} j})}{ϕ_{i_{1} j}} \cdot μ_{i_{1} j}^{2 - p_{i_{1} j}} \cdot I (i_{1} = i_{2}) \cdot I (y_{i_{1} j} = 0), \end{matrix}

\begin{matrix} I_{{\tilde{b}}_{j_{1}} {\tilde{b}}_{j_{2}}} & = & - E [\frac{\partial^{2} \tilde{ℓ}}{\partial b_{j_{1}} \partial b_{j_{2}}}] = - \sum_{i = 1}^{n} \sum_{j = 1}^{n} E [\frac{\partial^{2} ({\tilde{ℓ}}_{i j}^{(0)} + {\tilde{ℓ}}_{i j}^{(1)})}{\partial b_{j_{1}} \partial b_{j_{2}}}] \\ = & \sum_{i = 1}^{n} \frac{μ_{i j_{1}}^{2}}{ϕ_{i j_{1}} V_{i j_{1}}} \cdot I (j_{1} = j_{2}) \cdot I (y_{i j_{1}} > 0) + \sum_{i = 1}^{n} \frac{(2 - p_{i j_{1}})}{ϕ_{i j_{1}}} \cdot μ_{i j_{1}}^{2 - p_{i j_{1}}} \cdot I (j_{1} = j_{2}) \cdot I (y_{i j_{1}} = 0) . \end{matrix}

The off diagonal components of

I_{β_{i}}

I_{{\tilde{β}}_{j}}

are

\begin{matrix} {[I_{w_{i_{1}} b_{i_{2}}}]}_{k} = - E [\frac{\partial^{2} ℓ}{\partial w_{i_{1} k} \partial b_{i_{2}}}] = - \sum_{i = 1}^{n} \sum_{j = 1}^{n} E [\frac{\partial^{2} (ℓ_{i j}^{(0)} + ℓ_{i j}^{(1)})}{\partial w_{i_{1} k} \partial b_{i_{2}}}] \\ = & \{\sum_{j = 1}^{n} \frac{μ_{i_{1} j}^{2}}{ϕ_{i_{1} j}} \cdot \frac{{\tilde{w}}_{j k}}{V_{i_{1} j}} \cdot I (y_{i_{1} j} > 0) + \sum_{j = 1}^{n} \frac{(2 - p_{i_{1} j})}{ϕ_{i_{1} j}} \cdot μ_{i_{1} j}^{2 - p_{i_{1} j}} {\tilde{w}}_{j k} \cdot I (y_{i_{1} j} = 0)\} \cdot I (i_{1} = i_{2}), \end{matrix}

and

\begin{matrix} {[I_{{\tilde{w}}_{j_{1}} {\tilde{b}}_{j_{2}}}]}_{k} = - E [\frac{\partial^{2} \tilde{ℓ}}{\partial {\tilde{w}}_{j_{1} k} \partial {\tilde{b}}_{j_{2}}}] = - \sum_{i = 1}^{n} \sum_{j = 1}^{n} E [\frac{\partial^{2} ({\tilde{ℓ}}_{i j}^{(0)} + {\tilde{ℓ}}_{i j}^{(1)})}{\partial {\tilde{w}}_{j_{1} k} \partial {\tilde{b}}_{j_{2}}}] \\ = & \{\sum_{i = 1}^{n} \frac{μ_{i j_{1}}^{2}}{ϕ_{i j_{1}}} \cdot \frac{w_{i k}}{V_{i j_{1}}} \cdot I (y_{i j_{1}} > 0) + \sum_{i = 1}^{n} \frac{(2 - p_{i j_{1}})}{ϕ_{i j_{1}}} \cdot μ_{i j_{1}}^{2 - p_{i j_{1}}} w_{i k} \cdot I (y_{i j_{1}} = 0)\} \cdot I (j_{1} = j_{2}) . \end{matrix}

Note that the components

- E [\frac{\partial^{2} ℓ}{\partial w_{i k} \partial b_{i}}]

are non-zero, so we cannot separately estimate

w_{i}

and

b_{i}

. They have to be estimated together. Similarly,

{\tilde{w}}_{j}

and

{\tilde{b}}_{j}

have to be estimated together.

Based on the Fisher scoring algorithm, the updating formulae for

β_{i} = {(w_{i}^{⊤}, b_{i})}^{⊤}

and

{\tilde{β}}_{j} = {({\tilde{w}}_{j}^{⊤}, {\tilde{b}}_{j})}^{⊤}

, respectively, are

\begin{matrix} β_{i}^{(t + 1)} = β_{i}^{(t)} + I_{β_{i}^{(t)}}^{- 1} \cdot U_{β_{i}^{(t)}}, {\tilde{β}}_{j}^{(t + 1)} = {\tilde{β}}_{j}^{(t)} + I_{{\tilde{β}}_{j}^{(t)}}^{- 1} \cdot U_{{\tilde{β}}_{j}^{(t)}} . \end{matrix}

(8)

Note that

β_{i}

and

{\tilde{β}}_{j}

cannot be estimated simultaneously because the estimation of

β_{i}

requires the use of the entire set of

{{\tilde{β}}_{j}, j = 1, \dots, n}

as fixed values. Reversely,

{β_{i}, i = 1, \dots, n}

serve as data when estimating

{\tilde{β}}_{j}

. Since the estimate of

β_{i}

relies on the current value of

\tilde{β}

, which is not necessarily equal to the true value of the parameter, the

β_{i}

’s estimate is not final even if the first updating equation in (8) converged while the same value of

\tilde{β}

is used. When

\tilde{β}

changes in further iterations, the estimate of

β_{i}

will change too. Running either of the equations in (8) is generally referred to as the iteratively reweighted least squares (IRLS) algorithm. In our case, we have to alternatingly rerun the IRLS update by switching what is held fixed. In addition, since all the

\tilde{β}

are needed when estimating

β_{i}

but the IRLS update only works with one component at a time,

{\tilde{β}}_{j}

, the process internally depends on each other when we loop through i or j over different iterations. Below we will describe the algorithm for the estimating process.

The algorithm starts with setting the following initial values:

Specify values of $ϕ$ and p. ▹ (see Section 2.2)
Initialize elements of $β_{i}^{(0)}$ and ${\tilde{β}}_{i}^{(0)}$ , for example, with i.i.d. Uniform(-0.5, 0.5) values.
It is natural to use the negative log likelihood as the loss for fitting each row and column. However, the log-likelihood functions ℓ and $\tilde{ℓ}$ both contain the $c (y_{i j}, ϕ_{i j}, p_{i j})$ . This term does not involve regression parameters $β$ and $\tilde{β}$ . Therefore, it is redundant to compute the term $c (y_{i j}, ϕ_{i j}, p_{i j})$ when estimating $β$ and $\tilde{β}$ . This is especially the case for Tweedie regression because $c (y_{i j}, ϕ_{i j}, p_{i j})$ has a log sum of infinitely many terms. In the Python package sklearn.linear_model. TweedieRegressor, the loss is defined as Half Tweedie deviance $m a x {(y_{i}, 0)}^{2 - p} / (1 - p) / (2 - p) - y_{i} μ_{i}^{1 - p} / (1 - p) + μ_{i}^{2 - p} / (2 - p)$ when the observed response is $y_{i}$ (see the Python source code at https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/_loss/loss.py accessed on 30 December 2024). The first term in it does not depend on the regression parameter $β$ . The other two terms are the negative value of $(y θ - κ (θ))$ in the log likelihood (6). Therefore, using the Half Tweedie deviance as the loss is equivalent to using the $- (y θ - κ (θ))$ . We define our loss as follows:

$\begin{matrix} L o s s (β, \tilde{β}) = - \sum_{i = 1}^{n} \sum_{j = 1}^{n} (y_{i j} θ_{i j} - κ (θ_{i j})) . \end{matrix}$

Note that the loss is a function of $β$ and $\tilde{β}$ , which will change as the algorithm proceeds.
Set a convergence threshold such as $ϵ = 10^{- 4}$ and a maximum number of iterations ( $m a x i t$ ). We use relative convergence criteria on the loss function by checking if the relative change in loss defined below is less than the threshold. This is consistent with the $g l m 2$ package in R, which uses the relative convergence criteria on the log likelihood.

$\begin{matrix} Relative change in Loss = \frac{| L o s s (β^{(t + 1)}, {\tilde{β}}^{(t + 1)}) - L o s s (β^{(t)}, {\tilde{β}}^{(t)}) |}{| L o s s (β^{(t + 1)}, {\tilde{β}}^{(t + 1)}) | + 0.1} . \end{matrix}$

(9)

When the algorithm stops, either the maximum number of iterations is reached or the algorithm converges. If t is less than

m a x i t

, the algorithm has converged. With our loss definition, we are making an assumption that

ϕ

and p are not involved in likelihood comparison. This makes sense if the likelihood function is evaluated at the same set of

ϕ

and p values. If the likelihood has to be compared with different

ϕ

and p pairs, the likelihood corresponds to different pairs that cannot be compared because the infinite summation in (3)–(5) is different.

Also, note that the updating formula (8) does not have a learning rate. If the model is completely correct, this updating formula will lead to a good estimate of

β

and

\tilde{β}

. However, if there is any disturbance or model misspecification, it is likely that

I^{- 1} U

might be updating the estimated

β

\tilde{β}

in a wrong direction. This is particularly prone to happen in the case that the dimension of the parameters is large. Therefore, we would like to consider an adaptive update such that the pace of the update is reducing as the number of iterations increases. Specifically, we introduce a learning rate and adjust it as the iteration number grows in the following manner:

\begin{matrix} \begin{matrix} β_{i}^{(t + 1)} & = β_{i}^{(t)} + \frac{l r}{t^{1 / 4}} \cdot I_{β_{i}^{(t)}}^{- 1} \cdot U_{β_{i}^{(t)}}, t = 1, 2, \dots, \\ {\tilde{β}}_{j}^{(t + 1)} & = {\tilde{β}}_{j}^{(t)} + \frac{l r}{t^{1 / 4}} \cdot I_{{\tilde{β}}_{j}^{(t)}}^{- 1} \cdot U_{{\tilde{β}}_{j}^{(t)}}, t = 1, 2, \dots . \end{matrix} \end{matrix}

(10)

where

l r

is a starting learning rate that could be set to

l r = 0.5

. As the iterations increase, the effective learning rate becomes

\frac{l r}{t^{1 / 4}}

, which allows the algorithm to slowly approach the target in later iterations.

Our learning rate adjustment is just a step size change in the Fisher scoring algorithm. The

g l m 2

package in R halves the step size when estimated model components become infinite or out of their range or the algorithm diverges. As the Fisher scoring algorithm belongs to a bigger category of gradient descent/ascent algorithms, we can look more into the bigger category. There are many variants of the gradient descent adjustments, such as Momentum, Adagrad, Adadelta, RMSprop, Adam, etc. The Adam is the most popular update and becomes the default parameter update method for many machine learning algorithms. With Adam, the direction of the update is set to be the ratio of two terms: the exponentially decaying moving average of past gradients in the numerator and the square root of the exponentially decaying moving average of past squared gradients in the denominator. Adam combines the ideas of Momentum and Adadelta. The moving averages of the gradient and squared gradient reflect the mean and variance of the gradient. The update is defined as follows:

\begin{matrix} g_{t, j} = \nabla_{θ} ℓ {(θ_{t - 1})}_{j}; m_{t, j} = B_{1} m_{t - 1, j} + (1 - B_{1}) g_{t, j}, v_{t, j} = B_{2} v_{t - 1, j} + (1 - B_{2}) g_{t, j}^{2} \\ m_{0 j} = v_{0 j} = 0; {\hat{m}}_{t, j} = \frac{m_{t, j}}{1 - B_{1}^{t}}; {\hat{v}}_{t, j} = \frac{v_{t, j}}{1 - B_{2}^{t}}; θ_{t, j} = θ_{t - 1, j} - \frac{η}{\sqrt{{\hat{v}}_{t, j} + ϵ}} {\hat{m}}_{t, j} \end{matrix}

The authors of this method recommended setting

B_{1} = 0.9, B_{2} = 0.999, ϵ = 10^{- 8}

. The second row of the equations makes corrections for total coefficients because the coefficient behind

m_{0, j}

B_{1}^{t}

and

m_{0, j} = 0

, which leads to the sum of coefficients of non-zero terms being

1 - B_{1}^{t}

. The same justification holds for

v_{t, j}

. This update has a step size that takes the steepness into account, as in Adadelta, but also tends to move in the same direction, as in Momentum.

Most of the gradient descent variants do not use the second derivative of the objective function, which is because it is often difficult to compute the second derivative. We believe the use of an adaptive learning rate together with the Fisher information matrix in our algorithm provides more benefit than the variants of the gradient descent algorithms. This is because the variants of the gradient descent algorithm either ignore the second derivative or use the squared first derivative to approximate the second derivative. On the other hand, the Fisher information provides the variance of the first derivative (i.e., Score function) of the objective function, and the inverse of the Fisher information matrix provides the asymptotic variance for the parameters to be estimated according to likelihood inference. The plots in Figure 2 show how the three updating schemes differ when they are applied to a simulated dataset generated with an alternating Tweedie regression model (see Section 3). We could see from the left panel, which shows the log(loss) for only one row of data with repeating updates over 10 epochs, that the Adam update is less effective in reducing the loss for each epoch compared with the update with or without a learning rate adjustment. When the algorithm continuously updates for over 100 iterations, the log (overall loss) in the right panel also shows that the Adam update is not reducing the loss as fast as the other two updating schemes. Avoiding the usage of second derivatives of the objective function not only shows no advantage in computation time but also is less effective in reducing the loss. Deriving the second derivatives and implementing them in the Fisher scoring algorithm achieves better results faster.

Note that in the while loop for i in Algorithm 1, we update

β_{i}

and

{\tilde{β}}_{i}

for the same i. The

β_{i}

’s estimate depends on the current value of

\tilde{β}

, which means the recently updated value of

{\tilde{β}}_{j_{1}}, j_{1} = 1, \dots, i

are used in the calculation. Due to such dependence, all parameter updates rely on each other. As a result, the updates of different parameters can only be performed sequentially. In addition, it assumes that the data are symmetric so that the

i^{t h}

row and

i^{t h}

column are the same. An advantage of this is that the

i^{t h}

row of data is only loaded once. This saves some computation time. If the data matrix is not symmetric, then the algorithm needs to be adjusted by adding another for loop of j such that the update for

β_{i}, i = 1, \dots, n

is performed consecutively, followed by the update of

{\tilde{β}}_{j}, j = 1, \dots, n

. In this case, we need to read in data for each row while updating the

β_{i}

’s, and then we will need to read in data again for updating

{\tilde{β}}_{j}

. This potentially wastes some computational time. However, there is some potential benefit in doing so: successive updates for different i’s (or different j’s) are not interdependent on each other. This makes parallel updating a possible solution. More specifically, when considering

β_{i}

’s as the parameter while holding

{{\tilde{β}}_{j}, j = 1, \dots, n}

as the data, the update for different i’s is totally free from each other since the calculation of the Information matrix and the Score function only depends on

{{\tilde{β}}_{j}, j = 1, \dots, n}

It is tempting to use an existing parallel computing algorithm to conduct the update. The Hogwild update scheme is an example that allows performing parameter updates on CPUs in parallel. It works with multiple CPU processors by allowing them to access shared memory without locking the parameters. For this to work, the optimization problems have to be sparse in that each update only modifies a fraction of all parameters. Under this scenario, they showed that the Hogwild update scheme achieves almost an optimal rate of convergence because the possibility of the processors overwriting useful information is small. Our updating Equation (10) only modifies part of the model parameters as i or j points to a specific row or column. However, that is not a sparse update. Even though we are working with only a block out of all model parameters at a time, there is no guarantee that the parameter update will be sparse. In fact, the parameters in our model are dense representations of words, items, or users, and they are not meant to be sparse. We will explain computational strategies in Section 4.

Algorithm 1 Alternating Tweedie regression

Require:: $I_{β_{i}^{(t)}}^{- 1}$ and $I_{{\tilde{β}}_{j}^{(t)}}^{- 1}$ exist.

$t \leftarrow 0$ , $L o s s^{(t)} \leftarrow 10^{20}$
$c o n v e r g e d = F a l s e$
while $t \leq m a x i t$ do
while $c o n v e r g e d = F a l s e$ do
$i \leftarrow 1$
while $i \leq n$ do
retrieve ith row of data from SQLite database.
for epoch ∈ 1,…, n_epoch do
compute $U_{β_{i}^{(t)}}$ and $I_{β_{i}^{(t)}}$ .
update $β_{i}^{(t + 1)}$ using current value of { ${\tilde{β}}_{j}^{(t)}, j = 1, \dots, n$ }, and $β_{i}^{(t)}$
based on formulae in (8) or (10) or Adam update.
for epoch ∈ 1,…, n_epoch do
compute $U_{{\tilde{β}}_{i}^{(t)}}$ and $I_{{\tilde{β}}_{i}^{(t)}}$ .
update ${\tilde{β}}_{i}^{(t + 1)}$ using current value of { $β_{j}^{(t)}, j = 1, \dots, n$ }, and ${\tilde{β}}_{i}^{(t)}$
based on formulae in (8) or (10) or Adam update.
$i \leftarrow i + 1$
for $k = 1, \dots, n$ do
retrieve kth row of data from SQLite database.
recompute $U_{β_{k}^{(t + 1)}}$ , $U_{{\tilde{β}}_{k}^{(t + 1)}}$ and their $L_{2}$ norms using $β^{(t + 1)}$ and ${\tilde{β}}^{(t + 1)}$ values.
recompute $l o s s_{k}$ and ${\tilde{l o s s}}_{k}$ using $β^{(t + 1)}$ and ${\tilde{β}}^{(t + 1)}$ values.
compute the $L o s s (β^{(t + 1)}, {\tilde{β}}^{(t + 1)})$ .
check if the Relative change in Loss (9) is less than $ϵ$ .
$t \leftarrow t + 1$

2.2. Impact of the Parameters p and $ϕ$

The aforementioned formulae assume

p_{i j}

and

ϕ_{i j}

are known. When they are not known, estimation is needed. The estimation of the dispersion parameters

p_{i j}

and

ϕ_{i j}

is intractable when we use the likelihood approach.

\begin{matrix} \frac{\partial ℓ_{i}^{(1)}}{\partial ϕ_{i j}} & = & \sum_{j = 1}^{n} - \frac{1}{ϕ_{i j}^{2}} (y_{i j} θ_{i j} - κ (θ_{i j})) + \frac{\partial c (y_{i j}, ϕ_{i j}, p_{i j})}{\partial ϕ_{i j}}, \\ \frac{\partial ℓ_{i}^{(1)}}{\partial p_{i j}} & = & \sum_{j = 1}^{n} \frac{1}{ϕ_{i j}} [y_{i j} \frac{\partial θ_{i j}}{\partial p_{i j}} + \frac{\partial κ (θ_{i j})}{\partial p_{i j}}] + \frac{\partial c (y_{i j}, ϕ_{i j}, p_{i j})}{\partial p_{i j}} \end{matrix}

where

c (y, ϕ, p) = log (\sum_{k = 1}^{\infty} \frac{β^{k α}}{Γ (k α)} \cdot y^{k α - 1} \cdot \frac{λ^{k}}{k!})

. And

ϕ

and p are related in the following manner:

λ = \frac{μ^{2 - p}}{ϕ (2 - p)}, α = \frac{2 - p}{p - 1}, \frac{1}{β} = ϕ (p - 1) μ^{p - 1} .

Due to the excessive calculation involved in the infinite sum, the likelihood estimate of

ϕ_{i j}

and

p_{i j}

is intractable when they need to be estimated for all

i = 1, . . ., n

and

j = 1, . . ., n

. Ref. [29] suggested using profile likelihood to estimate the

ϕ

and p. In their small-scale simulation studies, they find that the profile likelihood approach works well, particularly when true p is 0 or 2, in which case the likelihood approach fails to provide appropriate estimates with good coverage. However, the profile likelihood uses the derivative-free Nelder–Mead algorithm, which is really slow for large n.

For estimating power p from the Tweedie model, the traditional practice is to first train a Tweedie model with an arbitrarily specified p and estimate the

μ

of all observations. The inverse weight and residuals from the model fit were then used to obtain an estimate of the dispersion parameter (scale). Then the

μ

along with inverse weight, residuals, and scale estimate (

ϕ

) were used to estimate p. See the source code of the function estimated_tweedie_power at https://www.statsmodels.org/devel/_modules/statsmodels/genmod/generalized_linear_model.html#GLM.estimate_tweedie_power (accessed on 30 December 2024). In our case, we have no covariates to fit the GLM model. The

\tilde{β}

and

β

were alternately updated, so we cannot use them as the covariates to estimate the power p even when they were held fixed because the estimate of p will change with the estimate of

β

and

\tilde{β}

and the loss functions correspond to different ranges of p and are no longer comparable.

Instead, we will make use of the mean variance relationship

Var (Y) = ϕ μ^{p}

for the Tweedie distribution. That is,

log (Var (Y)) = log ϕ + p log μ

. Then we check the log of the mean and the log of the variance of the observed response variable over different rows. The two appear to have a piecewise linear relationship. See Figure 3 for an example, which was generated for the January 2022 Wikipedia dump with a vocabulary size of 50K. The majority of

log μ

and

log (Var (Y))

pairs show a clear piecewise linear relationship when

log μ

is between −4 and 2. For log means smaller than −4 or greater than 1, there are a limited number of data points. Based on such a piecewise linear relationship, we regressed the log of the sample variance against the log of the sample mean of the response variable on each interval. The resulting slope and intercept estimates presented in Table 1 give us estimates of p and

δ = log (ϕ)

for each interval shown in Figure 3.

In Table 1, we have to carefully consider the estimated parameter

\hat{p}

. The range of p for the Tweedie distribution to be well defined is

p \leq 0

1 \leq p < 2

, or

p \geq 2

, with

p = 0

representing a Gaussian distribution,

p = 1

representing a Poisson distribution, and

p = 2

representing a Gamma distribution. The interval 0 having

\hat{p}

between 1 and 2 corresponds to the Compound Poisson Gamma distribution. For the interval indices 4, 5, and 6, those could be considered as Poisson distribution having

p = 1

. The interval 8 could be viewed as a Normal distribution with

p = 0

. On the other hand, the intervals 1, 2, 3, and 7 have

\hat{p}

between 0 and 1, where the Tweedie distribution is not well defined. Therefore, we might need to consider other distributions, such as the zero-inflated Tweedie distribution.

3. A Small Simulation Study

In this section, we present a small simulation study using generated data from the Tweedie regression model. The goal of this study is to examine how fast the algorithms converge. Specifically, we would like to see how well the Fisher scoring algorithm with and without learning rate adjustment and the gradient descent algorithm with Adam update model parameters. The data generation is described below.

We first retrieved the most frequently used 300 words in 2000 Reuters business news articles. For each word, a 50 dim vector ( $w$ ) needs to be generated for usage in the Tweedie regression model. The model components are as follows:

$\begin{matrix} η_{i j} = w_{i}^{⊤} \cdot {\tilde{w}}_{j}, μ_{i j} = exp η_{i j} \end{matrix}$

where $w_{i} = g_{i} / | g_{i} |$ and ${\tilde{w}}_{j} = g_{j} / | g_{j} |$ with $g_{i}$ and $g_{j}$ being the 50 dim word vector representations from the pretrained GloVe model for word_i and word_j, respectively (https://nlp.stanford.edu/data/glove.6B.zip accessed on 30 December 2024). Note that these $w_{i}$ ’s can be arbitrarily generated from a distribution. Here, we use the GloVe word vectors so that the simulation is reproducible. Below are details:
The Tweedie power parameter $p_{i j}$ was generated as $p_{i} + p_{j}$ , where $p_{i}$ and $p_{j}$ are independently sampled from Uniform(0.5, 1). With this generation mechanism, the power $p_{i j}$ is between 1 and 2 so that the count data follows a Compound Poisson Gamma distribution.
The dispersion parameter $ϕ_{i j}$ was generated as $ϕ_{i} \cdot ϕ_{j}$ , where $ϕ_{i}$ and $ϕ_{j}$ were taken from the estimated value of the Reuters news dataset from Table 2.
The count data $Y_{i j} = Y_{j i}$ were independently generated from the Tweedie distribution with mean $μ_{i j}$ , power parameter $p_{i j}$ , and the dispersion parameter $ϕ_{i j}$ .

Table 2. The estimated

δ

and p from fitting piecewise linear regression on the log of the sample mean and the log of the sample variance on each interval from the Reuters news small dataset. The

{\hat{δ}}_{h i g h}

and

{\hat{δ}}_{l o w}

are intercepts from the data points that are having maximum or minimum signed distance from the fitted piecewise regression line while having the same corresponding slope.

Table 2. The estimated

δ

and p from fitting piecewise linear regression on the log of the sample mean and the log of the sample variance on each interval from the Reuters news small dataset. The

{\hat{δ}}_{h i g h}

and

{\hat{δ}}_{l o w}

are intercepts from the data points that are having maximum or minimum signed distance from the fitted piecewise regression line while having the same corresponding slope.

Interval Index	$\hat{δ}$	${\hat{δ}}_{high}$	${\hat{δ}}_{low}$	$log (μ)$ Lower Bound	$log (μ)$ Upper Bound	$\hat{p}$
0	2.955	2.955	2.955	−1.500	−1.000	2.264 *
1	−1.009	−0.896	−1.125	−1.000	−0.500	−1.501
2	−0.202	0.152	−0.450	−0.500	0.000	−0.842
3	−0.127	0.416	−0.432	0.000	0.500	0.561
4	0.077	0.480	−0.237	0.500	1.000	0.197
5	0.776	1.129	0.503	1.000	1.500	−0.475
6	−1.133	−1.039	−1.288	1.500	2.180	0.744

Note: * indicates that the estimated value 2.264 is based on regression with only two observations, which is highly unreliable, and we could ignore the interval.

We applied the alternating Tweedie regression algorithm to the generated data. For comparison, the gradient descent algorithm with Adam update was applied to the data. For the Adam update, we used the Adam optimizer (torch.optim.Adam) from the PyTorch package. A learning rate scheduler was used to reduce the learning rate based on ReduceLROnPlateau from torch.optim.lr_scheduler. The default setting of this function is used, which specifies the loss does not improve in 10 epochs; the learning rate is reduced by a factor of 0.1. With the Adam update, computationally we only need to compute the loss and keep track of the gradient of the loss with respect to parameters being updated by setting .requires_grad_(True) for the parameter. The gradient was computed with the .backward() method of the loss tensor. The parameter update was conducted with the optimizer’s .step() method. The optimizer’s gradient was reset to zero within each epoch with the .zero_grad() method. For fair comparison, the innermost for loops of the epoch are set to be 10 for all three update methods: with and without learning rate adjustment and Adam update.

Figure 4 shows how different parameter update methods perform within the innermost loop in our alternating Tweedie regression. The three different update methods are Fisher scoring type update with and without learning rate adjustment and gradient descent Adam update. The loss reductions presented are from the first iteration and the first row of the data matrix in Algorithm 1. All three lead to a quick reduction in the loss within five epochs. The updates with or without learning rate adjustment reach a stable loss value quickly. On the other hand, the Adam update does not stabilize as fast as the other two methods. Further, note that the with or without learning rate adjustment updates’ loss values are small compared with that of the Adam update. Taking into account the fact that all three updates used the same simulated data and started with identical initial values, the difference in the loss values is purely due to different updating schemes. Deriving the first and second derivatives manually and using the Fisher scoring algorithm makes a difference in finding the right direction of the update. This point is even more obvious when we look at the overall loss over many iterations of updates in Figure 5. This figure contains the loss of all rows and columns of the data matrix from over 120 iterations. The Adam algorithm decreased the overall loss as the iteration increased, but it is not as effective as the other two updating methods. Since the Adam update is inefficient in finding the right optimizing direction in the alternating Tweedie regression algorithm, we exclude it from further discussion and only compare the Fisher scoring type update with or without learning rate adjustment in more detail.

Figure 6 and Figure 7 show a more detailed examination of the Fisher scoring update with or without a learning rate adjustment. We can see from Figure 6 that during the first 20 iterations, the two cases behave similarly in the norm of the Score vector. However, from the 20th to around the 130th iteration, the update with no learning rate is more effective in reducing the overall loss than the one with the learning rate. This is not the case for the norm of the score vector. From the 20th to around the 80th iteration, the update with no learning rate is a little less effective in reducing the norm of the score vector than the update with the learning rate. However, from the 80th to the 120th iteration, the case with no learning rate is more aggressive than the one with a learning rate. After the 120th iteration until convergence, the no-learning-rate case starts to sacrifice

U_{\tilde{β}}

severely to improve

U_{β}

while maintaining the overall loss in the same neighborhood. On the other hand, small updates with learning rate adjustment help to make the algorithm achieve lower overall loss. Figure 7 shows how the log loss changes in the last 60 iterations. The one with no learning rate fluctuates a little bit around the same value, 617,109.2, and was trapped there, causing the algorithm to believe the convergence criterion is satisfied. The one with the learning rate consistently reduces the overall loss and achieves a smaller overall loss of 617,105.8 when the convergence criterion is satisfied. To verify this pattern is not happening by chance, we generated eight more datasets and applied the algorithm to them. The results are summarized in Figure 8. Basically, the behavior of the algorithm is the same as we illustrated above.

The first column of Figure 8 shows the

L_{2}

norm of the score vector

U_{β}

{log}_{10}

scale over all iterations for both with and without learning rate. Both cases perform well during the first 20 iterations by reducing the

L_{2}

norm of the score vector

U_{β}

quickly. It is a little surprising that the one with learning rate adjustment was also able to achieve this because with learning rate adjustment

\frac{l r}{i t^{1 / 4}}

, the updated amount in

\hat{β}

only obtains between a half and a quarter of the amount received from the update without learning rate adjustment. During iterations 20 to 200, the algorithm with learning rate adjustment updates the

\hat{β}

by a much smaller fraction (23.6% to 13.3%) of the amount updated using the algorithm with no learning rate adjustment. As a result, the algorithm with a learning rate adjustment cautiously stays around the estimated value from the previous iteration and approaches the target little by little. The result of this is that reducing the learning rate at later iterations optimizes the objective function more efficiently and consistently compared with the algorithm without learning rate adjustment. We see the norm of

U_{β}

achieves a lower value faster in the case with a learning rate adjustment than in the other case. Further, there are fewer fluctuations when learning rate adjustment is used. This means the algorithm without learning rate adjustment often over-corrects its estimate for the model parameters in later iterations. The second column illustrates the

L_{2}

norm of score vector

U_{\tilde{β}}

{log}_{10}

scale over all iterations for both with and without learning rate. For the

U_{\tilde{β}}

part, a very similar pattern was observed as in the first column.

The third column of Figure 8 shows overall loss in

{log}_{10}

over all iterations. The overall trend is that it both reduces the loss quickly in early iterations and gradually slows down toward the optimal loss. It seems like both with and without learning rates perform similarly well. However, if we look in more detail, we could observe some differences as shown in the last column of the figure. Compared with the reduction in the norm of

U_{β}

and

U_{\tilde{β}}

, the loss function reduces consistently without much fluctuation. This aspect shows that the loss is not a monotone function with regard to

U_{β}

’s and

U_{\tilde{β}}

’s norms, and there are possibly many different solutions giving similar loss value.

The last column of Figure 8 is the zoomed-in detailed performance near the end (i.e., near convergence) of the algorithm for 8 simulated datasets. The two updates, with or without a learning rate, seemed to have different curves but both achieved a similar final loss value in all except for one dataset in the sense that they only differ at the second digit after the decimal point. An exception happened in the last row of Figure 7 for the overall loss: the update with no learning rate moved out of the region around the optimal solution, leading to a sudden drastic increase in the loss function, and it could not get out of there. Upon convergence, the algorithm stayed there at the much higher final loss values.

4. Scalability and Application to Named Entity Recognition

4.1. Scalability to Data with Large Vocabulary Size & Training Corpus

In this section, we present the application of the proposed method on a large training corpus to learn word embeddings. We downloaded the 20 January 2022 version of the Wikipedia dump from https://archive.org/details/enwiki-20220120 accessed on 30 December 2024. This Wikipedia dump has around 4 billion tokens and contains copies of all Wikipedia pages in xml format provided by Wikipedia. The decompressed data size is 86 GB.

We consider co-occurrence counts of WordPiece tokens. The WordPiece tokenizer of (Wu et al. [30]) divides words into a limited set of common subword units referred to as wordpieces. This tokenizer was used in BERT (Devlin et al. [6]) to tokenize words into subword tokens. We adopted the same tokenizer, which gives us 30,522 WordPiece tokens. Removing some international characters leads to 26,531 tokens for us to use. With WordPiece tokenization, the word ‘cucumber’ is broken into the sequence of characters ‘cu’, ‘##cum’, ‘##ber’, where ## indicates that the character is a middle part of the word. We tokenized the entire training corpus with this tokenizer and obtained token–token co-occurrence counts (

X_{i j}

) based on our definition in Equation (11).

\begin{matrix} X_{i j} = \{\begin{matrix} \sum_{s \in all sentences} 1 / d_{i j} (s), & if d_{i j} (s) \leq k \\ 0, & otherwise \end{matrix} \end{matrix}

(11)

where

d_{i j} (s)

= separation between word i and word j in sentence s.

The sparse entries of the co-occurrence counts were stored in a SQLite database to facilitate model training. This sparse format addresses memory limitations and significantly enhances computational efficiency. By storing the data in the database, we can retrieve a single row of the co-occurrence matrix as needed, rather than loading all entries into memory at once. This reduces memory usage and speeds up data retrieval.

To further optimize the computational efficiency of data loading, we implemented a memory-efficient technique when fetching individual rows of the co-occurrence matrix. It is important to note that one row of the co-occurrence matrix corresponds to multiple rows in the database. The technique consists of two key strategies:

Utilizing the ‘rowid’ column in SQLite: The rowid is automatically generated for most SQLite tables and serves as a fast-access key. SQLite stores data in a B-tree structure, which ensures quick lookups when using the rowid. According to SQLite documentation, searching for a record by its rowid is approximately twice as fast as searching by other indexed values. This optimization significantly reduces search time, especially with large tables, since locating rows in a B-tree requires at most ${log}_{2} (n)$ steps, where n is the total number of rows in the table.
Restricting the number of entries: To further enhance efficiency, we restrict the search to only a portion of the rowids rather than scanning the entire table. For instance, when retrieving the first row of a co-occurrence matrix with a vocabulary size of 400 K, we specify a rowid range from 0 to 400 K using the WHERE command. This is possible because the row indices (distinct from rowids) are already sorted in non-decreasing order. Therefore, when fetching data with a rowid of 0, there is no need to search beyond the first 400 K entries. After retrieving each row, we update the starting rowid for the next retrieval based on the entries already processed.

These two strategies significantly reduced the time required for data retrieval. Without these optimizations, fetching a single row of the co-occurrence matrix (with a vocabulary size of 400 K) took approximately 376.42 s. With the strategies in place, the retrieval time was reduced to just 0.61 s.

The counts corresponding to a single row of the co-occurrence matrix are used to compute a score vector and an information matrix, which are then utilized to update the embedding vector estimate for that row. While this update is based on token counts in a single row, the mean counts also depend on the embedding vectors of other tokens that appear within the context window. In essence, the update indirectly incorporates the vector representations of all other tokens, refining the embedding vector for each token iteratively. One complete iteration occurs when the algorithm has processed all rows of the matrix.

However, due to memory limitations, the co-occurrence counts cannot be retained in memory as the algorithm moves to subsequent rows. As a result, the data must be repeatedly retrieved from the database for each iteration. Since the algorithm requires numerous iterations to converge, each row of the co-occurrence matrix is retrieved multiple times—once per iteration—imposing a substantial computational burden and significantly increasing the time and resources required for convergence.

To address this challenge, we designed the algorithm to leverage CPU–GPU parallel processing. While data retrieval is handled by the CPU, parameter updates can be efficiently performed on the GPU. Since the parameter estimation cannot begin until the data is available, the data retrieval and parameter updating are generally performed sequentially. However, we parallelized these processes as follows:

Data retrieval: a child process handles data retrieval using Python’s multiprocessing package.
Parameter estimation: rhe main process, utilizing the GPU through PyTorch, handles the parameter updates.
Queue-based data transfer: Once a row is fetched, it is stored in a shared queue object that connects the child and main processes. The GPU retrieves data from the queue and removes it once used.
Parallel execution: While the GPU is processing the current row’s parameter update, the child process retrieves the next row of data. These two operations run in parallel, which reduces time and improves efficiency.
Efficient resource management: Data retrieval on the CPU only proceeds when the queue is empty, ensuring that memory resources are devoted to the computation and reducing overall memory usage.

In general, parameter updates on the GPU take longer than data retrieval from the CPU. By running these processes in parallel, we are able to save significant time and improve computational efficiency.

In addition to the data retrieval and parallel processing techniques discussed earlier, another key strategy employed to make the computation feasible is data chunking. This approach is controlled by the parameter num_chunks, which specifies the number of groups into which the co-occurrence counts for a single row of data are partitioned. Each group contains a small subset of entries, enabling batch processing that reduces GPU memory demands while shortening the computation time required to process an entire row of data. For each batch, the computation involves generating as many matrices of size

(d + 1) \times (d + 1)

as there are entries in the batch. This process inherently requires balancing memory usage and computational efficiency. The Fisher information matrix calculation is particularly demanding, as it involves outer products of terms

w_{i}

and

w_{i}^{⊤}

, and between

{\tilde{w}}_{j}

and

{\tilde{w}}_{j}^{⊤}

. Due to memory constraints, directly computing these outer products without looping is impractical, while using loops is prohibitively slow. To address this, we utilized the ‘einsum’ function, which efficiently converts the outer product calculations into a summation over a 3D array, processed in smaller batches. This approach achieves acceptable computational speeds without resorting to explicit loops.The optimal batch size depends on the dimensionality d of the word vectors. Through trial and error, we found that the values 30, 10, and 4 work well for dimensions 300, 100, and 50, respectively. However, these values are not rigid. A wide range of batch sizes can yield similar performance. For example, when

d = 100

, the computation time for processing one row with

n u m_c h u n k s = 10

is comparable to that with

n u m_c h u n k s = 4

. Similarly, for

d = 50

, the computation time with

n u m_c h u n k s = 10

is similar to that with

n u m_c h u n k s = 4

With these computational techniques, we are able to carry out the model estimation for the SA-Tweedie models trained on the entire Wikipedia data.

The SA-Tweedie model was applied to the log(co-occurrence count + 1) of token–token pairs in the entire Wikipedia dump data. The reason that we decide to use log count is because the skewness of the raw counts is too big to be modeled well. From the co-occurrence matrix, we explored the raw count data in each row to compute the sample skewness. The histogram of the sample skewnesses from different rows is given in the left panel in Figure 9. We can see that most of the skewness values are much larger than 40. It is very difficult to model the data if the skewness is over six. The log counts seem to have sample skewness values mostly within 6. Even after log transformation, the data in most of the rows are still very skewed and have plenty of zero observations. For this reason, we decided to model the log counts using the alternating Tweedie regression model.

We use relative convergence criteria on the loss function by checking if the magnitude of change in the loss function compared with the absolute value of the current loss function is less than a predefined threshold. In computation, we set the convergence threshold as

ϵ = 10^{- 4}

and the maximum number of iterations (

m a x i t

) to be 100. That is, the model is said to be converged if the relative change in loss below is less than

10^{- 4}

\begin{matrix} Relative change in Loss = \frac{| L o s s (β^{(t + 1)}, {\tilde{β}}^{(t + 1)}) - L o s s (β^{(t)}, {\tilde{β}}^{(t)}) |}{| L o s s (β^{(t + 1)}, {\tilde{β}}^{(t + 1)}) | + 0.1} . \end{matrix}

(12)

The Pseudo code is given in Algorithm 2.

Algorithm 2 Pseudo code to compute

β, β_{t i l}

1:: Input: $V, d, r e s u m e, m a x i t, n u m_c h u n k s, o u t p a t h, m a x V, i n p a t h, d b_f i l e_n a m e, d a t a$
2:: Output: $β, β_{t i l}$
3:: Initialize device (GPU if available, else CPU)
4:: Load precomputed table of p and $l o g (ϕ)$ and table of log(variance) and log(observed $μ$ ) from files (See Section 2.1)
5:: Determine intervals for each word based on log(observed $μ$ )
6:: Compute $ϕ$ vector and p vector based on intervals
7:: if not resume then
8:: Initialize $β$ and $β_{t i l}$ with random values
9:: else
10:: Load $β$ , $β_t i l$ , and start_it from saved files
11:: Initialize multiprocessing queue and shared memory
12:: Start child process to fetch data from database
13:: Main Process:
14:: for each iteration $i t$ from $s t a r t_i t$ to $m a x i t$ do
15:: for each word v from 0 to $V - 1$ do
16:: Fetch co-occurrence data $Y_{v}$ from queue
17:: Identify non-zero and zero co-occurrence columns
18:: Compute $β$ and $β_{t i l}$ using $Y_{v}$ , $ϕ$ vector, p vector
19:: Update $β$ and $β_{t i l}$ according to Algorithm 1
20:: Compute loss and check for convergence
21:: Save $β$ , $β_{t i l}$ , and loss trace
22:: if converged or max iterations reached then
23:: Stop child process and break loop
24:: Return: $β, β_{t i l}$

The initial values of the

β

and

β_{t i l}

were created as follows: First, generate V by the d matrix,

w

and

\tilde{w}

from a uniform distribution on

(- 0.5, 0.5)

from numpy with seeds 10 and 20, respectively. Then set initial entries of the bias terms (V by 1 vectors)

b

and

\tilde{b}

as either 0 if the log(observed

μ

) is zero or equal to observed

μ

otherwise. The

w

and

b

were horizontally stacked together to create

β

β_{t i l}

were similarly created with

\tilde{w}

and

\tilde{b}

Figure 10 presents the loss curves. They behave nicely in that it drastically goes down in very few iterations and stays with a monotone decreasing trend for all embedding dimensions. The algorithm converged in less than 30 iterations based on the relative convergence criterion. Upon convergence, the values of the loss function are

- 8.2186 \times 10^{8}

and

- 8.3284 \times 10^{8}

for

d = 100

and

d = 300

, respectively. Their relative loss change values are

9.7895 \times 10^{- 5}

and

9.4375 \times 10^{- 5}

, respectively.

4.2. Application to NER Task on CoNLL-2003 Data

In this section, we consider the Named Entity Recognition (NER) task using the CoNLL-2003 English benchmark dataset. CoNLL-2003 is a collection of documents from Reuters newswire articles. Each sentence in the dataset is annotated with nine categories representing the beginning or inside of a named entity from four entity types: person, location, organization, and miscellaneous. The nine categories are B-LOC, B-MISC, B-ORG, B-PER, I-LOC, I-MISC, I-ORG, I-PER, and O. The ‘O’ category is used for cases that do not belong to any of the four entity types. The data has generally very short input text. The output categories are highly unbalanced. The O class is the majority, and many other classes have a very small number of observations. Such unbalanced multi-class classification is much more challenging than a binary classification case, such as in a movie sentiment review.

Ref. [3] presented their result of analysis for this dataset. They used a comprehensive set of 437,905 discrete features that comes with the standard distribution of the Stanford NER model. In addition, they added 50-dimensional vectors for each word of a five-word context as continuous features. With these features as input, they trained a conditional random field (CRF) with exactly the same setup as the CRFjoin model of [31]. They reported an F1-weighted score of

93.2 %

and

88.3 %

on the validation and test set performance. In this subsection, we present our result of using a simple BiLSTM model with GloVe, random embedding, or SA-Tweedie embedding with 300-dimensional word vector representations. The GloVe performance in this simple model (around 88% on the test set) concurs with the numbers reported by [31] when we find the best performer from three runs. When using 13 runs, GloVe gave slightly better performance (89.7% on the test F1-weighted score). The simple model with SA-Tweedie embedding achieved over 91.46% in F1-weighted score for the test dataset.

4.2.1. Experiment Detail and Main Result for the NER Task

For our analysis, the vocabulary was generated from the training data. It contains all tokens with at least one count. The training and test data were both converted to lowercase. Note that lower- or uppercase letters do contribute to the NER task since many named entities are in uppercase letters. By converting to lowercase letters, we increased the difficulty level of the NER task.

For this task, we used a simple version of a bidirectional LSTM model using our token embedding to perform the classification. The model architecture starts with an embedding layer using the pretrained token embedding from either SA-Tweedie or GloVe, followed by a bidirectional dynamic RNN layer with variable input size padded to a common length and 512 hidden units, a max pooling layer with an output dimension of 512, a linear layer with an output dimension of 256 and relu activation, and then a linear activation layer at the end with an output dimension of 9. The loss function is the cross-entropy.

The initial values of the bidirectional RNN and linear layer were generated with the default commands torch.nn.LSTM() and torch.nn.Linear() from the Pytorch package.

The model was trained using a dataloader with a batch size of 256. The training dataloader shuffled the data before splitting it into batches, while the validation and test data remained unshuffled. Tokens and labels were padded to the same maximum length within each batch. Parameter updates were performed using the Adam optimizer with a learning rate of 0.001, no weight decay (i.e., weight decay = 0), and amsgrad set to False. To prevent excessive gradient updates, the maximum

L_{2}

norm of gradients with respect to all parameters was clipped at 10% of the total norm.

Due to the random shuffling of training data, performance varies slightly across different runs. In prior literature, NER task results are typically reported based on the best validation F1-weighted score across all runs and epochs, along with its corresponding test F1-weighted score. We followed this approach by conducting 13 runs with different random seeds. For each run, the model was trained for 20 epochs per embedding method. The test performance was determined as follows:

\begin{matrix} {\hat{r u n}}_{m e t h o d} = arg max_{1 \leq r u n \leq 13, 1 \leq i \leq 20} Validation {F 1 - weighted}_{m e t h o d, r u n, i}, \end{matrix}

(13)

\begin{matrix} Overall Test {F 1 - weighted}_{m e t h o d} = Test {F 1 - weighted}_{m e t h o d, {\hat{r u n}}_{m e t h o d}} . \end{matrix}

(14)

This methodology ensures a fair evaluation, as the validation and test datasets remain fixed while randomness arises solely from the shuffling and batching of training data. Each shuffling instance corresponds to a specific seed for random number generation, assuming identical initial model parameters in BiLSTM. Consequently, the seed functions as a hyperparameter, alongside other factors such as the max-pooling dimension and embedding dimension. The validation dataset is used to tune hyperparameters, identifying the optimal configuration for model performance.

In our case, the embedding dimension is fixed at

d = 300

, and the max-pooling dimension is set to 512, making the seed the only varying hyperparameter. Each seed represents one run, and the test performance is measured using the overall test F1-weighted_method defined in (14).

Table 3 reports these performance metrics, including precision and recall. The values for each embedding method correspond to the model achieving the best validation F1-weighted score across 13 runs and 20 epochs. The test metrics for each embedding method reflect the selected model from

{\hat{r u n}}_{m e t h o d}

at epoch

{\hat{epoch}}_{m e t h o d, {\hat{r u n}}_{m e t h o d}}

, applied to the test data.

SA-Tweedie achieved the lowest test loss (0.44), outperforming both random embedding (0.52) and GloVe (1.99). This indicates that SA-Tweedie has better generalization compared with Random and GloVe. In terms of weighted precision, recall, and F1 scores, SA-Tweedie also demonstrated superior performance, achieving the highest test precision-weighted score (92.16%), recall-weighted score (91.73%), and F1-weighted score (91.46%). This indicates that SA-Tweedie embeddings contribute to more accurate predictions compared with the other two embeddings. The random embedding achieves a comparable validation F1-weighted score to SA-Tweedie, but its test performance in weighted precision, recall, and F1 score is lower. GloVe performs worst in the test precision-weighted score and test F1-weighted score.

Across entity types, the SA-Tweedie method consistently produced competitive results, particularly excelling in the ‘B-PER’ and ‘I-PER’ categories, where it outperformed GloVe and Random embeddings. Additionally, SA-Tweedie achieved the highest precision and F1 scores for the ‘O’ (non-entity) category, suggesting that it is more effective in distinguishing named entities from background text.

Overall, the results suggest that SA-Tweedie embeddings provide a significant advantage over both Random and GloVe embeddings in the given NER task.

4.2.2. Effect Sizes and Significance of the Comparison Results

One of the reviewers recommended reporting the effect sizes with confidence intervals to validate the comparison results. In this subsection, we discuss this perspective.

As explained in Section 4.2.1, the NER task data we used is the standard CoNLL-2003 data with fixed partitions of training, validation, and test data. Model estimation uses the training data. Hyperparameter tuning uses the validation data. Performance comparison uses the prediction on unseen test data. For any given model already trained, the prediction on the given test data is fixed with no variations. The question is whether we will see a similar conclusion for another unseen test data as reported in Section 4.2.1. Therefore, it is of interest to examine the variation of the model predictions.

In our study, the only source of variation is the seed hyperparameter used for randomly shuffling the training data before splitting it into batches. Each unique seed corresponds to one independent run. Let n be the number of observations in the training dataset. Randomly shuffling the training data and partitioning it into batches is equivalent to generating a random permutation and grouping the observations accordingly. The number of possible ways to group n observations into batches of size 256 is given by

(\binom{n}{256 \dots, 256})

. Ideally, model training would be conducted for all unique groupings of the training data. For each trained model, validation data would then be used to assess predictive performance. The model with the highest validation performance would be selected as the final model to make predictions on the test data.

To analyze the problem statistically, let

X_{v a l, 1}, \dots, X_{v a l, k}

be the F1-weighted score on the validation dataset from k independent runs. Similarly, let

X_{t e, 1}, \dots, X_{t e, k}

denote the F1-weighted score on a test dataset from the same k independent runs. If the original data partitioning of data into training, validation, and test sets is independent and fair, then

X_{v a l, 1}, \dots, X_{v a l, k}

and

X_{t e, 1}, \dots, X_{t e, k}

are identically distributed. Consequently, the model that achieves the highest prediction performance on the validation set is also likely to perform best on the test set. In particular, the order statistics

X_{v a l, (k)} = max {X_{v a l, 1}, \dots, X_{v a l, k}}

and

X_{t e, (k)} = max {X_{t e, 1}, \dots, X_{t e, k}}

serve as indicators of the predictive performance. However, since

X_{v a l, (k)}

is used to select the best seed, its value tends to be an optimistic estimate of the actual performance. In contrast,

X_{t e, (k)}

provides an estimate of the model’s performance on unseen data. Note that

X_{t e, (k)}

approximates the upper bound of the support of the distribution of

X_{t e, 1}, \dots, X_{t e, k}

. The maximum F1-weighted scores from k runs are always biased downward for estimating the upper bound. As k increases, the bias reduces.

y_{1}, \dots, y_{k}

are i.i.d from

U (a, θ)

for some a such that

a < θ < 100

. Then

y_{(k)} = {max}_{1 \leq i \leq k} y_{i}

is the MLE of

θ

. It can be shown that

y_{(k)}

has the following probability density function, mean, and variance:

\begin{matrix} f_{y_{(k)}} (y) = k \frac{{(y - a)}^{k - 1}}{{(θ - a)}^{k}}, a \leq y \leq θ, \\ E (y_{(k)}) = a + (θ - a) \frac{k}{k + 1}, Var (y_{(k)}) = {(θ - a)}^{2} \frac{k}{(k + 2) {(k + 1)}^{2}} . \end{matrix}

(15)

From Equation (15), we see that the bias of using

y_{(k)}

to estimate

θ

- \frac{θ - a}{k + 1}

. As k increases, the bias shrinks. The variance also becomes smaller as k increases. For the NER task, we looked at the results from three runs and from thirteen runs. With three runs, the minimum and maximum F1-weighted scores are 87.78% and 88.72% for the random embedding, 88.48% and 88.67% for GloVe, and 91.00% and 91.06% for SA-Tweedie. With 13 runs, the minimum and maximum F1-weighted scores are 87.437% and 89.703% for the random embedding, 88.466% and 89.268% for GloVe, and 89.865% and 91.460% for SA-Tweedie. The minimum and maximum will be pushed out more toward the extreme and closer to a and

θ

as k increases. Using these numbers to replace the a and

θ

in the variance formula in (15), we obtain the estimated variance for

y_{(k)}

. With three runs, the variance estimates are

1.504 \times 10^{- 4}

3.682 \times 10^{- 3}

, and

1.5 \times 10^{- 5}

for the F1-weighted score (in percentage) of the random embedding, GloVe, and SA-Tweedie, respectively. With 13 runs, the variance estimates are

1.343 \times 10^{- 4}

1.681 \times 10^{- 5}

, and

6.650 \times 10^{- 5}

for the F1-weighted score (in percentage) of the random embedding, GloVe, and SA-Tweedie, respectively. These variances are very small, indicating that SA-Tweedie has a significantly higher F1-weighted score than GloVe and random embedding.

The aforementioned uniform distribution is only one possible approximation to the distribution of the F1-weighted score. We do not have the actual distribution. If

y_{1}, \dots, y_{k}

are i.i.d. from some distribution with pdf

f (y)

and cdf

F (y)

[a, θ]

, where

a < θ

are bounded. The order statistic

y_{(k)}

can be shown to have pdf

f_{y_{(k)}} (y) = k F {(y)}^{k - 1} f (y)

a \leq y \leq θ

For large k, the maximum order statistic

y_{(k)}

converges in distribution to an extreme value distribution. Under standard conditions (such as a well-behaved density near

θ

), the expected value of

y_{(k)}

can be approximated using the tail expansion of the CDF. For y near

θ

, we use a first-order Taylor expansion of

F (y)

(note that

F (θ) = 1

F (y) \approx 1 - (θ - y) f (θ), for y \approx θ .

Then,

F_{y_{(k)}} (y) = {(1 - (θ - y) f (θ))}^{k} \approx e^{- k (θ - y) f (θ)} for large k .

This resembles the CDF of an extreme value-type distribution. The approximated mean and variance are

\begin{matrix} E [y_{(k)}] \approx θ - \frac{1}{k f (θ)}, Var (y_{(k)}) = E [y_{(k)}^{2}] - {(E [y_{(k)}])}^{2} \approx \frac{1}{k^{2} f {(θ)}^{2}}, \end{matrix}

(16)

where

f (θ)

is the density at the upper bound

θ

. From (16), we see that the variance of

y_{(k)}

is still of order

O (k^{- 2})

, which is very small, as with the uniform distribution case.

To empirically quantify the significance of the difference between the test order statistics

X_{t e, (k)}

from GloVe (or random embedding) and that from SA-Tweedie, we need estimates of the pdf for the F1-weighted score from all three distributions. This requires a large number of runs. Alternatively, the difference between two distributions’ order statistics is reflected by the difference in the mean, especially if one distribution is a location shift of the other. So we will consider the mean difference of a metric as the effect size. In fact, this is only a conservative estimate. The minimum of SA-Tweedie’s test F1-weighted scores from all 13 runs is above the maximum of GloVe or random embedding’s F1-weighted scores. This makes it straightforward to see the conclusion. But for the purpose of estimating the effect size, we treat SA-Tweedie as a control and use Dunnett’s procedure to compare GloVe or random embedding’s test performance with SA-Tweedie. The procedure ensures to have familywise error rate control. The mean difference, their 95% confidence interval, and p-value from the 13 runs are reported in Table 4.

The Dunnett test uses SA-Tweedie as the control. A negative effect size indicates that SA-Tweedie outperforms the other embedding methods, while a positive effect size suggests that the alternative embedding performs better than SA-Tweedie. The key findings are as follows:

First, SA-Tweedie significantly outperforms GloVe in recognizing person (B-PER, I-PER) and organization entities (B-ORG, I-ORG). Large negative effect sizes are observed for precision, recall, and F1-score in these categories (e.g., F1 I-PER: −8.43,

p < 10^{- 16}

, F1 B-ORG: −2.65,

p = 5.30 \times 10^{- 8}

), indicating that SA-Tweedie is substantially better at identifying these entity types compared with GloVe. Similarly, SA-Tweedie excels in the recognition of miscellaneous entities (B-MISC, I-MISC), with statistically significant negative effect sizes in recall and F1-score.

Second, compared with SA-Tweedie, Random embeddings perform significantly worse for overall precision, recall, and F1-weighted score. The effect sizes for F1-weighted (−1.84,

p < 10^{- 11}

) and recall-weighted (−2.49,

p < 10^{- 11}

) confirm that SA-Tweedie provides a more robust performance across different entity types. Additionally, Random embeddings struggle significantly with other (O) category recognition, as reflected in large negative effect sizes for precision (−3.60,

p < 10^{- 16}

), recall (−6.41,

p < 10^{- 16}

), and F1-score (−5.19,

p < 10^{- 16}

), further emphasizing SA-Tweedie’s superiority.

Third, while SA-Tweedie generally performs better, there are a few cases where GloVe or Random embeddings show advantages. Specifically, GloVe achieves higher precision in location entity recognition (B-LOC, I-LOC), as seen in the positive effect sizes for precision (e.g., F1 B-LOC: 1.71,

p < 10^{- 8}

). Random embeddings also exhibit an advantage in recognizing person entities (I-PER), with a positive effect size in F1-score (1.36,

p = 0.002

), suggesting that SA-Tweedie may have weaknesses in this specific area.

In conclusion, SA-Tweedie significantly outperforms both GloVe and Random embeddings in most aspects of NER, particularly in recognizing person, organization, and miscellaneous entities. Random embeddings perform the worst in recall-weighted scores, with large negative effect sizes in weighted metrics and general entity recognition. The results suggest that SA-Tweedie is a more effective embedding approach for NER, though slight improvements could be made in location and person entity recognition.

4.2.3. Comparison with BERT on the NER Task

Here we compare the SA-Tweedie performance with the result from BERT models. In the BERT models on the NER task, [6] used case-preserving WordPiece models along with the maximal document context provided by the data to learn the context-specific representation of words. The F1 score of various BERT models on the validation and test set is given in Table 5. Fine-tuning pretrained BERT large and base models achieved an F1 score on the test set of 92.8 and 92.4, respectively. On the validation set, fine-tuning BERT large and base showed F1 scores of 96.6 and 96.4, respectively. Note that fine-tuning re-estimates all parameters in the pretrained BERT model in addition to the BiLSTM model parameters. In their feature-based approach, which freezes the parameters from pretrained BERT embeddings and estimates only BiLSTM model parameters, they reported an F1 score of 91 in validation data for BERT and did not report performance on test data. Under this setting, our SA-Tweedie gives an F1 score of around 94.32 for validation data and about 91.46 for test data. This comparison clearly shows that SA-Tweedie embeddings are much more effective than BERT embeddings.

It is important to highlight that SA-Tweedie results are achieved with a significantly smaller model and less training data:

Examine the feature-based approach that estimates only the BiLSTM model parameters while keeping the BERT parameters fixed. The BERT large model, for example, consists of 24 hidden layers and uses a 1024-dimensional embedding, while the BERT base model has 12 hidden layers with a 768-dimensional embedding. The total number of parameters is 345 million for the BERT large model and 110 million for the BERT base model. In contrast, our SA-Tweedie only has a 300-dimensional embedding and there are no layers. Even when using the 768-dimensional BiLSTM with BERT, the number of parameters remains much larger than that of our simpler model. Specifically, we use a one-layer, 512-dimensional BiLSTM model with SA-Tweedie’s 300-dimensional embedding and a max pooling layer, which significantly reduces the number of parameters.
The embeddings from GloVe and BERT were trained with much larger-sized corpora than our SA-Tweedie model. The corpus size of our SA-Tweedie is 4 billion tokens from Wikipedia. The pretrained GloVe embeddings were based on a training corpus size of 42 billion tokens. The BERT embeddings were from pretraining on English Wikipedia plus additional BooksCorpus (800M words) data. We do not have access to BooksCorpus data. Therefore, SA-Tweedie’s embedding has seen less training data.

With these considerations in mind, the BiLSTM model with pretrained SA-Tweedie embeddings offers a more efficient path to improved performance, leveraging a simpler architecture.

4.2.4. More Configuration Settings to Examine SA-Tweedie’s Performance

What we presented in earlier subsections is for one setting of the analysis. We also experimented on some other model settings. In particular, we considered varying the model architecture a little bit by changing the max pooling output dimension. When max pooling is used, we considered max pooling output dimensions of 256 and 512. When no max pooling is used, we denote the max pooling dimension as 1024, which is equal to the output dimension from the previous RNN layer. Further settings include differentiating capital versus lower case letters by adding an additional dimension in the vector representation, for which the value was taken to be the indicator of whether capital letters are present in the word. The result varies for all embedding methods.

For SA-Tweedie, we also considered using the representation of only the first piece of the WordPiece tokens from any word as an alternative. For some words, the WordPiece tokenizer splits one word into multiple pieces of tokens. The SA-Tweedie gives vector representation for all pieces separately. In the end, the representation of the word needs to either combine the representation from different pieces or take just the representation of the first piece alone. The authors of [6] presented results of NER analysis with BERT using the first piece token only. To combine the representation vectors from pieces to form a single representation of the word, some form of positional encoding is needed to preserve the relative location of the token in the word. The positional encoding is included along with the token representations throughout the model parameter estimation process. We used the simple form of positional encoding provided by [7]. That is, the positional encoding is a matrix that has a number of columns equal to the dimension of the word representation. The number of rows is equal to the number of tokens from the word. Suppose there are k WordPiece tokens, and let

p o s

enumerate the positions of the tokens in the word. The positional encoding for the

i^{t h}

column in the

p o s

row is given by

\begin{matrix} {PE}_{(pos, 2 i)} = sin (\frac{pos}{10000^{2 i / d}}), {PE}_{(pos, 2 i + 1)} = cos (\frac{pos}{10000^{2 i / d}}), \end{matrix}

where pos =

0, \dots, k - 1

i = 0, \dots, d / 2 .

Generally, we see that the SA-Tweedie with only the first piece token performs better than using a combination of the token pieces.

Figure 11 shows the comparison of performance from all three embedding methods from three independent runs. For the NER task, the SA-Tweedie embedding gives the best results in that the overall F1-weighted scores are consistently higher than those from GloVe and random embedding. This pattern can be seen by directly comparing performance under the same seed and the same lowercase condition shown in Figure 11.

The random embedding even outperformed the GloVe embedding in this task. Further investigation shows that the model using GloVe embedding tends to quickly overfit (see Figure 12) in just a couple of epochs.

In fact, the majority of the GloVe cases show drastic overfitting starting from the second epoch. The random embedding starts with no information and then progresses little by little to reduce the training loss until a certain point. The SA-Tweedie is in between the two. A few epochs of training and parameter updates still benefit the model with SA-Tweedie embedding.

A slight surprise happens when we compare the models that convert all words to lowercase letters with those that are cased. It is often thought that named entities tend to use capital letters, and therefore keeping capital letters may improve the performance. This is the case with the random embedding. The middle two panels in Figure 11 show that the extra information of capital versus lowercase letters provides an additional boost to the model using random embedding. On the other hand, the trend is opposite for GloVe or SA-Tweedie embedding. We see a slight decay here in the performance of using capital letters compared with the lowercases for both embedding methods. This tells that the embeddings themselves already carried more information than what was provided in the extra column of the indicator variable of capital versus lowercase letters. Instead of being beneficial, the single extra dimension of the indicator variable we added in the model served as a nuisance. To make capital letters helpful in prediction, a more suitable approach is to learn word embeddings for both capital and lowercase words from the start, but this significantly increases the computational demand, which requires access to many TPUs or pods of GPUs for extended time access. In the current settings, SA-Tweedie using only the first piece token and lowercase letters outperforms all other methods and settings.

5. Summary, Discussion, and Future Research

In this article, we presented the SA-Tweedie model to estimate the global dense vector representations for English words via a language modeling task using unsupervised text data. We presented an algorithm with multiple updating schemes and compared their performance in both simulation and real applications. Our algorithm with the popular Adam update performed well in the sense that it reduces the loss both within the innermost loop over epochs and throughout the iterations in the outer loop. However, when compared with our algorithm with the Fisher scoring update in the inner loop with or without a learning rate adjustment, the Adam is inefficient since its overall loss or loss within each epoch reduces slower than the other two updating schemes. The Fisher scoring update without learning rate adjustment optimizes the algorithm in the right direction but sometimes fails to detect the right direction, as we have seen in multiple simulated datasets. The Fisher scoring update with a learning rate adjustment performed the best because it consistently found the right direction to reduce the overall loss and reach the smallest loss value among these three update methods. In the end of all iterations, the difference in overall loss between with and without learning rate adjustment is negligible compared with its magnitude.

The unobserved row (or column) vector representations can be treated as missing data. This perspective establishes a connection between the SA-Tweedie model and the generalized EM algorithm (GEM), providing insights into why the SA-Tweedie model is effective. For estimating

β

, consider the complete data

x = (y, \tilde{β})

. The entire

\tilde{β}

matrix is missing. We can treat the

\tilde{β}

as randomly drawn from a distribution that has relatively few parameters. Assume the distribution of

x

is in the exponential family. The convergence behavior of likelihood-based estimation is governed by the relationship between the complete data likelihood and the conditional likelihood. When the mechanism of missing is completely at random and the natural parameter

θ

in the exponential family format in Equation (2) is a priori independent of the parameters of the missing data process, the sufficient statistics for the natural parameter in the complete data problem are linear in the data. In this case, the estimation in the GEM algorithm is equivalent to the procedure that first estimates the missing data points and then computes the sufficient statistics using fill-in values [32]. The sufficient statistics give the maximum likelihood estimate of the conditional mean upon convergence.

The convergence for the alternating procedure was established based on two conditions: (1). The log-likelihood function is bounded on the sequence of estimated parameter values. (2) The sufficient statistic for the natural parameter must be linear in the observed data. In SA-Tweedie, these conditions are met. This is the theoretical foundation for both our SA-Tweedie model and the matrix factorization using Alternating Least Squares (ALS). By contrast, the SA-Tweedie model employs a two-stage Bernoulli-Tweedie model rather than a normal distribution as a hidden normality assumption behind the ALS-based matrix factorization. In the case of normal distribution, the constant variance allows the use of ALS to estimate both the row and column representation vectors. However, the SA-Tweedie model cannot employ ALS because the variance of the Tweedie distribution is not constant. Instead, our alternating scheme leverages the Fisher scoring algorithm, with or without learning rate adjustment, to achieve convergence. This approach benefits from a quadratic rate of convergence when the true parameters and their estimates lie within the interior of the parameter space, conditional on the values of row vector representations during the estimation of column vector representations (and vice versa).

Learning rate adjustment is crucial in practical applications because the conditional distribution depends on the current estimates of the vectors being conditioned upon. When these estimates are far from accurate, large updates can lead to instability. By moderating the update step size, the learning rate adjustment ensures more stable and reliable convergence.

The global representation serves as a strong foundation for smaller models with enhanced performance. One promising avenue for future research is the development of adaptive fine-tuning techniques for word embeddings, enabling fast and resource-efficient model adaptation to new domains or tasks. This could involve meta-learning or other approaches that allow models to quickly adjust their representations using limited labeled data.

Another important direction is integrating global word embeddings with multimodal models that combine text with images, audio, and video. This may involve designing novel architectures that effectively merge embeddings from different modalities, improving applications such as multimodal sentiment analysis and cross-modal retrieval—where understanding relationships between text and other forms of data is essential.

For instance, consider an Instagram user navigating through a session. Suppose User A views User B’s photos, then User C’s, and so on. If User A is currently interested in certain topics, we can infer that User B’s and User C’s photos may also be relevant to those topics. Each user session generates a sequence of interactions, such as User A ⟶ views User B’s photos ⟶ views User C’s photos

⟶ \dots

These sequences function similarly to sentences in natural language processing. By extracting weighted co-occurrence information, akin to word–word co-occurrence, we can train the SA-Tweedie model to generate embeddings for photos and user accounts. These embeddings can then be used to predict which accounts a person is likely to engage with in a given session.

A similar concept applies to DoorDash’s personalized store feed. When users browse stores with a particular type of food in mind, they may view Store A, then Store B, and so forth. This session-based behavior forms a sequence: Store 1 ⟶ Store 2 ⟶ Store 3

⟶ \dots

By analyzing weighted co-occurrence data from these sessions, we can train embeddings for stores, allowing classification of store pairs that frequently appear together in user sessions. Additionally, text descriptions and user-posted food images can be incorporated into a two-tower neural network architecture, where one tower processes user and context embeddings while the other handles item (store) embeddings. Training this model on user session data would yield useful embeddings for both text and stores.

These research directions have the potential to significantly advance global word representations by addressing key challenges such as computational efficiency, resource demands, domain adaptability, and interpretability. By integrating these innovations, we can pave the way for more scalable and sustainable NLP models capable of handling diverse and complex real-world tasks.

Even though the SA-Tweedie model has great potential, we do want to remind the user to check the data carefully. Tweedie distribution can model data from different distributions by using different power parameters p. This flexibility also leads to extra work. As we discussed in Section 2.2, the power parameter p and the dispersion parameter

ϕ

need to be estimated before the loss function in the negative log-likelihood can be defined. Blindly using the Tweedie model without suitable p and

ϕ

may lead to poor fit and divergence in parameter estimation. The range of distributional parameters suitable for fitting the co-occurrence data from a small dataset such as 2000 business news articles from Reuters may be different from those from a big dataset such as the entire Wikipedia dump with nearly a hundred GB of data.

One limitation of the SA-Tweedie model is that the Fisher information matrix needs to be computed during model estimation. Each unique token has a Fisher information matrix. This prevents working with a large vocabulary size. In our application, we used the WordPiece tokens, which were also used by the BERT base model. If using subword n-gram tokens as in FastText, there are too many unique tokens. FastText also could not include all subword tokens due to computational constraints. It used a hash table of two million subword tokens as its vocabulary. SA-Tweedie would not be able to handle a vocabulary of two million tokens. However, it is unnecessary to use subword n-grams. WordPiece tokens also can be used to deal with unseen words. Further research is needed on how to effectively combine SA-Tweedie’s embedding on tokens with positional encoding to create a representation for unseen words.

SA-Tweedie performs well on tasks involving limited text. However, for tasks with rich textual information, such as movie reviews, more complex contextual models like BERT tend to offer greater advantages. We applied the SA-Tweedie model to the Stanford Movie Review dataset and found that its performance was comparable to that of GloVe. While these results are promising, we have not yet extensively tested SA-Tweedie on a broader range of NLP tasks. Exploring its applicability to other domains and tasks remains an area for future research.

Author Contributions

Conceptualization, H.W.; methodology, H.W. and T.K.; software, T.K. and H.W.; validation, T.K. and H.W.; formal analysis, T.K. and H.W.; investigation, T.K. and H.W.; resources, T.K. and H.W.; data curation, T.K. and H.W.; writing—original draft preparation, T.K. and H.W.; writing—review and editing, H.W.; visualization, T.K. and H.W.; supervision, H.W.; project administration, H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The Wikipedia dump data are available at https://archive.org/details/enwiki-20220120 accessed on 30 December 2024.

Conflicts of Interest

The authors declare no conflict of interest.

References

Mikolov, T.; Chen, K.; Corrado, G.S.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. Available online: https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf (accessed on 30 December 2024).
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. arXiv 2016, arXiv:1607.04606. [Google Scholar] [CrossRef]
Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. arXiv 2018, arXiv:1802.05365v2. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MI, USA, 2–7 June 2019. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. Available online: https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf (accessed on 30 December 2024).
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. Available online: https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf (accessed on 30 December 2024).
Freestone, M.; Karmaker, S.S. Word Embeddings Revisited: Do LLMs Offer Something New? arXiv 2024, arXiv:2402.11094v2. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288v2. [Google Scholar]
Greene, R.; Sanders, T.; Weng, L.; Neelakantan, A. New and Improved Embedding Model, 15 December 2022. OpenAI. Available online: https://openai.com/blog/new-and-improved-embedding-model (accessed on 30 December 2024).
Anil, R.; Dai, A.M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Z.; et al. PaLM 2 Technical Report. arXiv 2023, arXiv:2305.10403v3. [Google Scholar]
Cer, D.; Yang, Y.; Kong, S.; Hua, N.; Limtiaco, N.; John, R.S.; Constant, N.; Guajardo-Cespedes, M.; Yuan, S.; Tar, C.; et al. Universal Sentence Encoder. arXiv 2018, arXiv:1803.11175v2. [Google Scholar]
Raza, S.; Rahman, M.; Kamawal, S.; Toroghi, A.; Raval, A.; Navah, F.; Kazemeini, A. A Comprehensive Review of Recommender Systems: Transitioning from Theory to Practice. arXiv 2024, arXiv:cs.IR/2407.13699. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2020, arXiv:1910.01108v4. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Tanaka, H.; Kunin, D.; Yamins, D.L.; Ganguli, S. Pruning Neural Networks Without any Data by Iteratively Conserving Synaptic Flow. Available online: https://proceedings.neurips.cc/paper_files/paper/2020/file/46a4378f835dc8040c8057beb6a2da52-Paper.pdf (accessed on 30 December 2024).
Liang, T.; Glossner, J.; Wang, L.; Shi, S.; Zhang, X. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing 2021, 461, 370–403. [Google Scholar] [CrossRef]
Kusupati, A.; Bhatt, G.; Rege, A.; Wallingford, M.; Sinha, A.; Ramanujan, V.; Howard-Snyder, W.; Chen, K.; Kakade, S.; Jain, P.; et al. Matryoshka representation learning. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS’22, Red Hook, NY, USA, 28 November–9 December 2022. [Google Scholar]
Heidari, A.; Jafari Navimipour, N.; Dag, H.; Unal, M. Deepfake detection using deep learning methods: A systematic and comprehensive review. WIREs Data Min. Knowl. Discov. 2024, 14, e1520. [Google Scholar] [CrossRef]
Heidari, A.; Navimipour, N.J.; Dag, H.; Talebi, S.; Unal, M. A Novel Blockchain-Based Deepfake Detection Method Using Federated and Deep Learning Models. Cogn. Comput. 2024, 16, 1073–1091. [Google Scholar] [CrossRef]
Amiri, Z.; Heidari, A.; Navimipour, N.J.; Esmaeilpour, M.; Yazdani, Y. The deep learning applications in IoT-based bio- and medical informatics: A systematic literature review. Neural Comput. Appl. 2024, 36, 5757–5797. [Google Scholar] [CrossRef]
Amiri, Z.; Heidari, A.; Zavvar, M.; Navimipour, N.J.; Esmaeilpour, M. The applications of nature-inspired algorithms in Internet of Things-based healthcare service: A systematic literature review. Trans. Emerg. Telecommun. Technol. 2024, 35, e4969. [Google Scholar] [CrossRef]
Amiri, Z.; Heidari, A.; Navimipour, N.J. Comprehensive survey of artificial intelligence techniques and strategies for climate change mitigation. Energy 2024, 308, 132827. [Google Scholar] [CrossRef]
Heidari, A.; Jafari Navimipour, N.; Unal, M. A Secure Intrusion Detection Platform Using Blockchain and Radial Basis Function Neural Networks for Internet of Drones. IEEE Internet Things J. 2023, 10, 8445–8454. [Google Scholar] [CrossRef]
Heidari, A.; Jafari Navimipour, N.; Unal, M.; Zhang, G. Machine Learning Applications in Internet-of-Drones: Systematic Review, Recent Deployments, and Open Issues. ACM Comput. Surv. 2023, 55. [Google Scholar] [CrossRef]
Zanbouri, K.; Darbandi, M.; Nassr, M.; Heidari, A.; Navimipour, N.J.; Yalcın, S. A GSO-based multi-objective technique for performance optimization of blockchain-based industrial Internet of things. Int. J. Commun. Syst. 2024, 37, e5886. [Google Scholar] [CrossRef]
Tim, T.; Wang, H. Matrix Factorization and Prediction for High-Dimensional Co-Occurrence Count Data via Shared Parameter Alternating Zero Inflated Gamma Model. Mathematics 2024, 12, 3365. [Google Scholar] [CrossRef]
Bonat, W.H.; Kokonendji, C.C. Flexible Tweedie regression models for continuous data. J. Stat. Comput. Simul. 2017, 87, 2138–2152. [Google Scholar] [CrossRef]
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
Wang, M.; Manning, C.D. Effect of Non-linear Deep Architecture in Sequence Labeling. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, Nagoya, Japan, 14–19 October 2013; pp. 1285–1291. [Google Scholar]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data Via the EM Algorithm. J. R. Stat. Soc. Ser. Methodol. 1977, 39, 1–22. [Google Scholar] [CrossRef]

Figure 1. Illustration of model input and desired output. Left panel: Model input—the natural log of (weighted occurrence count +1) matrix for the top 300 words from Reuter Business news data. Right panel: Shared parameter Tweedie modeling process and output.

Figure 2. Computed log(loss) and log(overall loss) from simulated dataset using the Fisher scoring with or without learning rate adjustment, and gradient descent algorithm with Adam method for parameter update. The left panel depicts how the loss changes over 10 epochs for one row of the parameter update. As the epoch number grows, the loss has a general decreasing trend, but the Adam’s loss has higher values and reduces slower than the other two updates. The right panel is for overall loss versus the number of iterations in log scale. All losses decrease as the iteration number increases, but the Adam update has higher values of the overall loss.

Figure 3. Relationship between the log of the sample mean and the log of the sample variance from Wikipedia data with a 50 K vocabulary size. The three lines in each interval are the fitted linear regression line and upper and lower bounds with same slope.

Figure 4. The loss reduction was compared within epochs among three different updates: the alternating Tweedie regression algorithm with and without learning rate adjustment and Adam update. The results are from the first iteration and first row of data matrix in our Algorithm 1.

Figure 5. The overall loss over iterations among three different update methods: with or without learning rate adjustment and the Adam update. The Fisher scoring type update with or without learning rate adjustment started with lower overall loss than the Adam update and reduces the overall loss faster as the iteration number increases.

Figure 6. The

{log}_{10}

scaled norm of the score vector and the overall loss as iteration proceeds from simulated data. The top panel shows norm of the score vector on a

{log}_{10}

scale for two cases: with learning rate and without learning rate. The bottom panel illustrates

{log}_{10}

overall loss versus iteration for the two cases. Overall, both cases are reducing the overall loss and the norm of the Score vector. The case with no learning rate is faster to reduce the overall loss in earlier iterations but may not achieve the minimum overall loss in the end. The case with learning rate moves slowly in earlier iterations but shows advantage in the end by finding smaller value in the overall loss.

Figure 6. The

{log}_{10}

scaled norm of the score vector and the overall loss as iteration proceeds from simulated data. The top panel shows norm of the score vector on a

{log}_{10}

scale for two cases: with learning rate and without learning rate. The bottom panel illustrates

{log}_{10}

Figure 7. The overall loss in

{log}_{10}

scale during iteration between 110 and 170 for the two cases: with learning rate and without learning rate. The update without learning rate adjustment is stabilized at a certain value before reaching the minimum overall loss. The algorithm with learning rate adjustment continuously reduces the overall loss until satisfying the convergence criterion, even though it was less effective in earlier iterations compared with the one with no learning rate.

Figure 7. The overall loss in

{log}_{10}

Figure 8. Comparing performance of the alternating Tweedie regression algorithm with or without learning rate over eight simulated datasets. Each row is for one dataset. A label such as +3.279e2 on upper left corner of the plots in the right-most column means that the values on vertical axis need to add 327.9.

Figure 9. Histogram of skewness for each row in raw co-occurrence count matrix (left panel) and the log co-occurrence count (right panel) constructed from Wikipedia dump.

Figure 10. Trajectory of the training process of SA-Tweedie. Two embedding dimensions (100 and 300) are considered. The model achieved lower loss with higher embedding dimension.

Figure 11. Weighted F1 score on NER test set for different settings with seeds 12, 42, and 111. All embeddings used 300-dimensional representations.

Figure 12. Training and validation loss along with training and validation weighted F1 score for 15 epochs. Top row: random embedding. Middle row: GloVe embedding. Bottom row: SA-Tweedie embedding. All parameter initialization used identical global seed 42. The loss and weighted F1 score for test data are marked with cross marks with value given beside them.

Table 1. The estimated

δ

and p from fitting linear regression on the log of the sample mean and the log of the sample variance on each interval. The

{\hat{δ}}_{h i g h}

and

{\hat{δ}}_{l o w}

are intercepts from the data points that are having maximum or minimum signed distance from the fitted regression line while having the same corresponding slope.

Table 1. The estimated

δ

and p from fitting linear regression on the log of the sample mean and the log of the sample variance on each interval. The

{\hat{δ}}_{h i g h}

and

{\hat{δ}}_{l o w}

are intercepts from the data points that are having maximum or minimum signed distance from the fitted regression line while having the same corresponding slope.

Interval Index	$\hat{δ}$	${\hat{δ}}_{high}$	${\hat{δ}}_{low}$	$log (μ)$ Lower Bound	$log (μ)$ Upper Bound	$\hat{p}$
0	3.030	3.306	2.264	−7.000	−6.000	1.230
1	−1.483	−0.970	−2.607	−6.000	−5.000	0.485
2	−0.438	0.194	−1.180	−5.000	−4.000	0.688
3	−0.115	0.576	−0.607	−4.000	−3.000	0.811
4	0.197	1.099	−0.188	−3.000	−2.000	0.943
5	0.554	1.221	0.161	−2.000	−1.000	1.098
6	0.451	0.984	0.142	−1.000	0.000	0.990
7	0.457	0.887	0.145	0.000	1.000	0.642
8	1.105	1.680	0.809	1.000	2.582	−0.053

Table 3. Best F1-weighted score of validation data from 13 runs and their corresponding performance in test data. For each run, the training dataset was randomly shuffled before splitting into batches. The validation and test sets were not shuffled. The model with the best validation F1-weighted score from 13 runs was selected for each embedding method, and the selected model was applied to the test set to obtain the test performance. For ‘Random’, the embeddings were generated from the uniform (−0.05, 0.05) as is the default setting in Tensorflow Keras. Precision, recall, and F1 scores are in percentage. Embedding dimension is 300 for all three methods, and the BiLSTM model is used with a max pooling output dimension of 512. SA-Tweedie achieved better results than GloVe and random embedding in overall precision-weighted, recall-weighted, and F1-weighted scores.

	Random		GloVe		SA-Tweedie
	Validation	Test	Validation	Test	Validation	Test
Loss	0.30	0.52	1.29	1.99	0.29	0.44
Precision B.LOC	33.82	29.17	34.78	29.29	32.53	27.85
Precision B-MISC	16.14	11.19	16.09	11.17	15.81	10.89
Precision B-ORG	21.12	23.39	20.37	22.55	20.34	23.26
Precision B-PER	25.86	19.31	23.97	15.56	26.30	21.90
Precision I-LOC	4.83	4.29	5.34	4.54	5.14	4.01
Precision I-MISC	4.92	2.54	4.75	2.50	4.92	2.77
Precision I-ORG	8.20	9.34	7.13	8.24	8.04	9.16
Precision I-PER	19.50	15.67	12.80	6.03	19.49	16.19
Precision O	94.56	91.05	94.02	91.15	95.81	94.36
Precision-weighted	95.24	90.88	93.38	88.53	94.66	92.16
Recall B-LOC	32.56	28.78	34.17	28.71	31.99	27.20
Recall B-MISC	15.98	10.87	15.55	10.79	15.44	10.51
Recall B-ORG	21.69	23.42	19.49	21.58	19.77	22.72
Recall B-PER	24.69	17.89	22.79	13.69	27.41	22.17
Recall I-LOC	4.78	4.22	5.32	4.46	5.01	3.93
Recall I-MISC	4.83	2.60	4.62	2.47	4.73	2.71
Recall I-ORG	8.27	9.20	6.74	7.79	7.64	8.93
Recall I-PER	18.93	15.32	11.59	4.81	19.47	15.52
Recall O	94.21	90.41	98.06	97.59	96.65	95.66
Recall-weighted	94.27	89.67	95.17	91.17	94.57	91.73
f1.B-LOC	32.92	28.68	34.22	28.72	31.99	27.25
F1 B-MISC	15.90	10.92	15.67	10.86	15.45	10.58
F1 B-ORG	21.12	23.05	19.73	21.77	19.80	22.64
F1 B-PER	24.98	18.28	23.09	14.24	26.35	21.68
F1 I-LOC	4.76	4.22	5.31	4.48	5.04	3.94
F1 I-MISC	4.79	2.51	4.63	2.42	4.77	2.71
F1 I-ORG	8.05	9.10	6.78	7.87	7.71	8.87
F1 I-PER	19.04	15.28	11.94	5.16	19.29	15.65
F1 O	94.14	90.23	95.70	93.75	96.00	94.51
F1-weighted	94.41	89.70	93.92	89.27	94.32	91.46

Table 4. Effect sizes in terms of mean difference in test metrics, their 95% confidence intervals, and p-values of performance difference between the random (or GloVe embedding) and SA-Tweedie embedding in the NER task. The results are from Dunnett test with family-wise error rate control. The effect size is the mean difference in performance metric between the random and SA-Tweedie or between GloVe and SA-Tweedie. These results were computed using 13 independent runs. The mean recall-weighted score is not significantly different between SA-Tweedie and GloVe, but the random embedding has significantly lower mean recall-weighted score than SA-Tweedie. SA-Tweedie also has significantly higher mean precision-weighted and F1-weighted scores than the random or GloVe embeddings in the test performance.

	GloVe-SA-Tweedie				Random-SA-Tweedie
	Effect Size	Lowr.CI	Upr.CI	$p$ -Value	Effect Size	Lwr.CI	Upr.CI	$p$ -Value
Precision B.LOC	$1.80$	$1.34$	$2.26$	$1.86 \times 10^{- 10}$	$1.69$	$1.23$	$2.15$	$8.96 \times 10^{- 10}$
Precision B.MISC	$- 0.47$	$- 0.78$	$- 0.16$	$2.72 \times 10^{- 3}$	$- 0.10$	$- 0.41$	$0.22$	$7.08 \times 10^{- 1}$
Precision B.ORG	$- 2.32$	$- 3.10$	$- 1.53$	$1.24 \times 10^{- 7}$	$- 0.57$	$- 1.36$	$0.22$	$1.80 \times 10^{- 1}$
Precision B.PER	$- 3.87$	$- 4.74$	$- 2.99$	$7.91 \times 10^{- 12}$	$0.02$	$- 0.86$	$0.89$	$9.98 \times 10^{- 1}$
Precision I.LOC	$0.57$	$0.22$	$0.91$	$1.04 \times 10^{- 3}$	$0.40$	$0.06$	$0.75$	$2.05 \times 10^{- 2}$
Precision I.MISC	$- 0.40$	$- 0.51$	$- 0.29$	$1.03 \times 10^{- 9}$	$- 0.09$	$- 0.20$	$0.02$	$1.09 \times 10^{- 1}$
Precision I.ORG	$- 1.79$	$- 2.19$	$- 1.39$	$5.41 \times 10^{- 12}$	$- 0.38$	$- 0.78$	$0.02$	$6.24 \times 10^{- 2}$
Precision I.PER	$- 8.32$	$- 9.13$	$- 7.51$	$- 2.22 \times 10^{- 16}$	$1.12$	$0.31$	$1.93$	$5.87 \times 10^{- 3}$
Precision O	$- 3.48$	$- 3.79$	$- 3.17$	$2.22 \times 10^{- 16}$	$- 3.60$	$- 3.91$	$- 3.29$	$0.00 \times 10^{0}$
Precision-weighted	$- 3.25$	$- 3.60$	$- 2.90$	$- 2.22 \times 10^{- 16}$	$- 0.80$	$- 1.15$	$- 0.45$	$1.38 \times 10^{- 5}$
Recall B.LOC	$1.63$	$1.05$	$2.20$	$2.53 \times 10^{- 7}$	$1.94$	$1.37$	$2.51$	$5.50 \times 10^{- 9}$
Recall B.MISC	$- 0.60$	$- 0.95$	$- 0.26$	$5.50 \times 10^{- 4}$	$0.01$	$- 0.34$	$0.35$	$9.98 \times 10^{- 1}$
Recall B.ORG	$- 3.09$	$- 4.08$	$- 2.09$	$4.33 \times 10^{- 8}$	$- 0.07$	$- 1.07$	$0.92$	$9.79 \times 10^{- 1}$
Recall B.PER	$- 4.34$	$- 5.61$	$- 3.07$	$4.80 \times 10^{- 9}$	$- 0.20$	$- 1.47$	$1.07$	$9.06 \times 10^{- 1}$
Recall I.LOC	$0.54$	$0.18$	$0.90$	$2.48 \times 10^{- 3}$	$0.58$	$0.22$	$0.93$	$1.24 \times 10^{- 3}$
Recall I.MISC	$- 0.44$	$- 0.57$	$- 0.32$	$2.33 \times 10^{- 9}$	$- 0.04$	$- 0.17$	$0.08$	$6.37 \times 10^{- 1}$
Recall I.ORG	$- 2.16$	$- 2.63$	$- 1.68$	$3.41 \times 10^{- 12}$	$- 0.27$	$- 0.74$	$0.20$	$3.27 \times 10^{- 1}$
Recall I.PER	$- 8.49$	$- 9.46$	$- 7.52$	$4.44 \times 10^{- 16}$	$1.59$	$0.61$	$2.56$	$1.18 \times 10^{- 3}$
Recall O	$2.24$	$1.67$	$2.82$	$2.29 \times 10^{- 10}$	$- 6.41$	$- 6.99$	$- 5.83$	$4.44 \times 10^{- 16}$
Recall-weighted	$0.00$	$- 0.54$	$0.53$	$9.99 \times 10^{- 1}$	$- 2.49$	$- 3.03$	$- 1.96$	$1.69 \times 10^{- 12}$
F1 B.LOC	$1.71$	$1.21$	$2.22$	$5.66 \times 10^{- 9}$	$1.79$	$1.29$	$2.30$	$1.88 \times 10^{- 9}$
F1 B.MISC	$- 0.53$	$- 0.86$	$- 0.21$	$1.07 \times 10^{- 3}$	$- 0.05$	$- 0.37$	$0.27$	$9.14 \times 10^{- 1}$
F1 B.ORG	$- 2.65$	$- 3.51$	$- 1.78$	$5.30 \times 10^{- 8}$	$- 0.35$	$- 1.21$	$0.51$	$5.47 \times 10^{- 1}$
F1 B.PER	$- 4.13$	$- 5.20$	$- 3.06$	$2.81 \times 10^{- 10}$	$- 0.10$	$- 1.17$	$0.98$	$9.68 \times 10^{- 1}$
F1 I.LOC	$0.55$	$0.20$	$0.90$	$1.60 \times 10^{- 3}$	$0.46$	$0.11$	$0.81$	$8.62 \times 10^{- 3}$
F1 I.MISC	$- 0.41$	$- 0.52$	$- 0.30$	$6.07 \times 10^{- 10}$	$- 0.07$	$- 0.18$	$0.04$	$2.37 \times 10^{- 1}$
F1 I.ORG	$- 1.94$	$- 2.38$	$- 1.51$	$4.49 \times 10^{- 12}$	$- 0.35$	$- 0.78$	$0.08$	$1.20 \times 10^{- 1}$
F1 I.PER	$- 8.43$	$- 9.32$	$- 7.53$	$0.00 \times 10^{0}$	$1.36$	$0.47$	$2.26$	$2.40 \times 10^{- 3}$
F1 O	$- 0.92$	$- 1.28$	$- 0.56$	$2.05 \times 10^{- 6}$	$- 5.19$	$- 5.55$	$- 4.83$	$1.11 \times 10^{- 16}$
F1-weighted	$- 1.78$	$- 2.19$	$- 1.36$	$1.83 \times 10^{- 11}$	$- 1.84$	$- 2.26$	$- 1.43$	$6.73 \times 10^{- 12}$

Table 5. F1 score reported in Table 5 of [6] from various BERT models on NER task of CoNLL-2003 data. The fine-tuning approach takes the representation of the first cased WordPiece sub-token of each word as the input and fine-tunes the pretrained BERT as a classification task. The feature-based approach froze the parameters from one or more layers of the pretrained BERT and used them as input to 768-dimensional BiLSTM for the tagging task.

	Dev F1	Test F1
Fine-tuning approach
BERT large	96.6	92.8
BERT base	96.4	92.4
Feature-based approach (BERT base)
Embeddings	91.0	-
Second-to-Last Hidden	95.6	-
Last Hidden	94.9	-
Weighted Sum Last Four Hidden	95.9	-
Concat Last Four Hidden	96.1	-
Weighted Sum All 12 Layers	95.5	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, T.; Wang, H. Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model. Mathematics 2025, 13, 612. https://doi.org/10.3390/math13040612

AMA Style

Kim T, Wang H. Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model. Mathematics. 2025; 13(4):612. https://doi.org/10.3390/math13040612

Chicago/Turabian Style

Kim, Taejoon, and Haiyan Wang. 2025. "Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model" Mathematics 13, no. 4: 612. https://doi.org/10.3390/math13040612

APA Style

Kim, T., & Wang, H. (2025). Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model. Mathematics, 13(4), 612. https://doi.org/10.3390/math13040612

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model

Abstract

1. Introduction

1.1. Recent Advances in NLP and Efficient Embedding Learning

1.2. Motivation for the SA-Tweedie Model

1.3. SA-Tweedie Versus SA-ZIG as Probabilistic Models for Word Representations

1.4. Contributions and Organization of the Paper

2. The Probability Distribution for the Proposed SA-Tweedie Model

2.1. MLE for Alternating Tweedie Regression

2.2. Impact of the Parameters p and $ϕ$

3. A Small Simulation Study

4. Scalability and Application to Named Entity Recognition

4.1. Scalability to Data with Large Vocabulary Size & Training Corpus

4.2. Application to NER Task on CoNLL-2003 Data

4.2.1. Experiment Detail and Main Result for the NER Task

4.2.2. Effect Sizes and Significance of the Comparison Results

4.2.3. Comparison with BERT on the NER Task

4.2.4. More Configuration Settings to Examine SA-Tweedie’s Performance

5. Summary, Discussion, and Future Research

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model

Abstract

1. Introduction

1.1. Recent Advances in NLP and Efficient Embedding Learning

1.2. Motivation for the SA-Tweedie Model

1.3. SA-Tweedie Versus SA-ZIG as Probabilistic Models for Word Representations

1.4. Contributions and Organization of the Paper

2. The Probability Distribution for the Proposed SA-Tweedie Model

2.1. MLE for Alternating Tweedie Regression

2.2. Impact of the Parameters p and ϕ

3. A Small Simulation Study

4. Scalability and Application to Named Entity Recognition

4.1. Scalability to Data with Large Vocabulary Size & Training Corpus

4.2. Application to NER Task on CoNLL-2003 Data

4.2.1. Experiment Detail and Main Result for the NER Task

4.2.2. Effect Sizes and Significance of the Comparison Results

4.2.3. Comparison with BERT on the NER Task

4.2.4. More Configuration Settings to Examine SA-Tweedie’s Performance

5. Summary, Discussion, and Future Research

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.2. Impact of the Parameters p and $ϕ$