Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model
<p>Illustration of model input and desired output. Left panel: Model input—the natural log of (weighted occurrence count +1) matrix for the top 300 words from Reuter Business news data. Right panel: Shared parameter Tweedie modeling process and output.</p> "> Figure 2
<p>Computed log(loss) and log(overall loss) from simulated dataset using the Fisher scoring with or without learning rate adjustment, and gradient descent algorithm with Adam method for parameter update. The left panel depicts how the loss changes over 10 epochs for one row of the parameter update. As the epoch number grows, the loss has a general decreasing trend, but the Adam’s loss has higher values and reduces slower than the other two updates. The right panel is for overall loss versus the number of iterations in log scale. All losses decrease as the iteration number increases, but the Adam update has higher values of the overall loss.</p> "> Figure 3
<p>Relationship between the log of the sample mean and the log of the sample variance from Wikipedia data with a 50 K vocabulary size. The three lines in each interval are the fitted linear regression line and upper and lower bounds with same slope.</p> "> Figure 4
<p>The loss reduction was compared within epochs among three different updates: the alternating Tweedie regression algorithm with and without learning rate adjustment and Adam update. The results are from the first iteration and first row of data matrix in our Algorithm 1.</p> "> Figure 5
<p>The overall loss over iterations among three different update methods: with or without learning rate adjustment and the Adam update. The Fisher scoring type update with or without learning rate adjustment started with lower overall loss than the Adam update and reduces the overall loss faster as the iteration number increases.</p> "> Figure 6
<p>The <math display="inline"><semantics> <msub> <mo form="prefix">log</mo> <mn>10</mn> </msub> </semantics></math> scaled norm of the score vector and the overall loss as iteration proceeds from simulated data. The top panel shows norm of the score vector on a <math display="inline"><semantics> <msub> <mo form="prefix">log</mo> <mn>10</mn> </msub> </semantics></math> scale for two cases: with learning rate and without learning rate. The bottom panel illustrates <math display="inline"><semantics> <msub> <mo form="prefix">log</mo> <mn>10</mn> </msub> </semantics></math> overall loss versus iteration for the two cases. Overall, both cases are reducing the overall loss and the norm of the Score vector. The case with no learning rate is faster to reduce the overall loss in earlier iterations but may not achieve the minimum overall loss in the end. The case with learning rate moves slowly in earlier iterations but shows advantage in the end by finding smaller value in the overall loss.</p> "> Figure 7
<p>The overall loss in <math display="inline"><semantics> <msub> <mo form="prefix">log</mo> <mn>10</mn> </msub> </semantics></math> scale during iteration between 110 and 170 for the two cases: with learning rate and without learning rate. The update without learning rate adjustment is stabilized at a certain value before reaching the minimum overall loss. The algorithm with learning rate adjustment continuously reduces the overall loss until satisfying the convergence criterion, even though it was less effective in earlier iterations compared with the one with no learning rate.</p> "> Figure 8
<p>Comparing performance of the alternating Tweedie regression algorithm with or without learning rate over eight simulated datasets. Each row is for one dataset. A label such as +3.279e2 on upper left corner of the plots in the right-most column means that the values on vertical axis need to add 327.9.</p> "> Figure 8 Cont.
<p>Comparing performance of the alternating Tweedie regression algorithm with or without learning rate over eight simulated datasets. Each row is for one dataset. A label such as +3.279e2 on upper left corner of the plots in the right-most column means that the values on vertical axis need to add 327.9.</p> "> Figure 9
<p>Histogram of skewness for each row in raw co-occurrence count matrix (left panel) and the log co-occurrence count (right panel) constructed from Wikipedia dump.</p> "> Figure 10
<p>Trajectory of the training process of SA-Tweedie. Two embedding dimensions (100 and 300) are considered. The model achieved lower loss with higher embedding dimension.</p> "> Figure 11
<p>Weighted F1 score on NER test set for different settings with seeds 12, 42, and 111. All embeddings used 300-dimensional representations.</p> "> Figure 12
<p>Training and validation loss along with training and validation weighted F1 score for 15 epochs. Top row: random embedding. Middle row: GloVe embedding. Bottom row: SA-Tweedie embedding. All parameter initialization used identical global seed 42. The loss and weighted F1 score for test data are marked with cross marks with value given beside them.</p> "> Figure 12 Cont.
<p>Training and validation loss along with training and validation weighted F1 score for 15 epochs. Top row: random embedding. Middle row: GloVe embedding. Bottom row: SA-Tweedie embedding. All parameter initialization used identical global seed 42. The loss and weighted F1 score for test data are marked with cross marks with value given beside them.</p> ">
Abstract
:1. Introduction
1.1. Recent Advances in NLP and Efficient Embedding Learning
1.2. Motivation for the SA-Tweedie Model
1.3. SA-Tweedie Versus SA-ZIG as Probabilistic Models for Word Representations
1.4. Contributions and Organization of the Paper
2. The Probability Distribution for the Proposed SA-Tweedie Model
2.1. MLE for Alternating Tweedie Regression
- The Tweedie distribution belongs to the exponential family. All exponential dispersion models that are closed to scale transformation belong to the Tweedie distribution family. In particular, if X has a Tweedie distribution with mean and dispersion parameter , then has a Tweedie distribution with mean and dispersion parameter . It has a mean and variance relationship as follows:
- The Compound Poisson Gamma distribution corresponds to the case that the power parameter satisfies . In this case, the distribution has non-negative support and can have a discrete mass at zero, making it useful to model responses that are a mixture of zeros and positive values.
- The distributions that correspond to other power values of p:
- –
- When , the distribution is the Extreme stable distribution.
- –
- When , the distribution is the Normal distribution.
- –
- When , the distribution coincides with the Poisson distribution.
- –
- When , the distribution is the Gamma distribution.
- –
- When , the distribution is the Positive stable distribution.
- –
- When , the distribution is the Inverse Gaussian distribution.
- –
- When , the distribution is the Positive stable distribution.
- –
- When , the distribution is the Extreme stable distribution.
- The mean domain is R if p is 0 or ∞, and is for all other . Therefore, except when the distribution is Normal or Extreme stable with , the log link is frequently used as the link function for the Tweedie regression.
- The positive zero mass is only allowed when . Since our co-occurrence count has many zeros, we focus our attention on the Compound Poisson Gamma distribution, which is the case when .
- Specify values of and p. ▹ (see Section 2.2)
- Initialize elements of and , for example, with i.i.d. Uniform(-0.5, 0.5) values.
- It is natural to use the negative log likelihood as the loss for fitting each row and column. However, the log-likelihood functions ℓ and both contain the . This term does not involve regression parameters and . Therefore, it is redundant to compute the term when estimating and . This is especially the case for Tweedie regression because has a log sum of infinitely many terms. In the Python package sklearn.linear_model. TweedieRegressor, the loss is defined as Half Tweedie deviance when the observed response is (see the Python source code at https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/_loss/loss.py accessed on 30 December 2024). The first term in it does not depend on the regression parameter . The other two terms are the negative value of in the log likelihood (6). Therefore, using the Half Tweedie deviance as the loss is equivalent to using the . We define our loss as follows:Note that the loss is a function of and , which will change as the algorithm proceeds.
- Set a convergence threshold such as and a maximum number of iterations (). We use relative convergence criteria on the loss function by checking if the relative change in loss defined below is less than the threshold. This is consistent with the package in R, which uses the relative convergence criteria on the log likelihood.
Algorithm 1 Alternating Tweedie regression |
|
2.2. Impact of the Parameters p and
3. A Small Simulation Study
- We first retrieved the most frequently used 300 words in 2000 Reuters business news articles. For each word, a 50 dim vector () needs to be generated for usage in the Tweedie regression model. The model components are as follows:
- The Tweedie power parameter was generated as , where and are independently sampled from Uniform(0.5, 1). With this generation mechanism, the power is between 1 and 2 so that the count data follows a Compound Poisson Gamma distribution.
- The dispersion parameter was generated as , where and were taken from the estimated value of the Reuters news dataset from Table 2.
- The count data were independently generated from the Tweedie distribution with mean , power parameter , and the dispersion parameter .
Interval Index | Lower Bound | Upper Bound | ||||
---|---|---|---|---|---|---|
0 | 2.955 | 2.955 | 2.955 | −1.500 | −1.000 | 2.264 * |
1 | −1.009 | −0.896 | −1.125 | −1.000 | −0.500 | −1.501 |
2 | −0.202 | 0.152 | −0.450 | −0.500 | 0.000 | −0.842 |
3 | −0.127 | 0.416 | −0.432 | 0.000 | 0.500 | 0.561 |
4 | 0.077 | 0.480 | −0.237 | 0.500 | 1.000 | 0.197 |
5 | 0.776 | 1.129 | 0.503 | 1.000 | 1.500 | −0.475 |
6 | −1.133 | −1.039 | −1.288 | 1.500 | 2.180 | 0.744 |
4. Scalability and Application to Named Entity Recognition
4.1. Scalability to Data with Large Vocabulary Size & Training Corpus
- Utilizing the ‘rowid’ column in SQLite: The rowid is automatically generated for most SQLite tables and serves as a fast-access key. SQLite stores data in a B-tree structure, which ensures quick lookups when using the rowid. According to SQLite documentation, searching for a record by its rowid is approximately twice as fast as searching by other indexed values. This optimization significantly reduces search time, especially with large tables, since locating rows in a B-tree requires at most steps, where n is the total number of rows in the table.
- Restricting the number of entries: To further enhance efficiency, we restrict the search to only a portion of the rowids rather than scanning the entire table. For instance, when retrieving the first row of a co-occurrence matrix with a vocabulary size of 400 K, we specify a rowid range from 0 to 400 K using the WHERE command. This is possible because the row indices (distinct from rowids) are already sorted in non-decreasing order. Therefore, when fetching data with a rowid of 0, there is no need to search beyond the first 400 K entries. After retrieving each row, we update the starting rowid for the next retrieval based on the entries already processed.
- Data retrieval: a child process handles data retrieval using Python’s multiprocessing package.
- Parameter estimation: rhe main process, utilizing the GPU through PyTorch, handles the parameter updates.
- Queue-based data transfer: Once a row is fetched, it is stored in a shared queue object that connects the child and main processes. The GPU retrieves data from the queue and removes it once used.
- Parallel execution: While the GPU is processing the current row’s parameter update, the child process retrieves the next row of data. These two operations run in parallel, which reduces time and improves efficiency.
- Efficient resource management: Data retrieval on the CPU only proceeds when the queue is empty, ensuring that memory resources are devoted to the computation and reducing overall memory usage.
Algorithm 2 Pseudo code to compute |
|
4.2. Application to NER Task on CoNLL-2003 Data
4.2.1. Experiment Detail and Main Result for the NER Task
4.2.2. Effect Sizes and Significance of the Comparison Results
4.2.3. Comparison with BERT on the NER Task
- Examine the feature-based approach that estimates only the BiLSTM model parameters while keeping the BERT parameters fixed. The BERT large model, for example, consists of 24 hidden layers and uses a 1024-dimensional embedding, while the BERT base model has 12 hidden layers with a 768-dimensional embedding. The total number of parameters is 345 million for the BERT large model and 110 million for the BERT base model. In contrast, our SA-Tweedie only has a 300-dimensional embedding and there are no layers. Even when using the 768-dimensional BiLSTM with BERT, the number of parameters remains much larger than that of our simpler model. Specifically, we use a one-layer, 512-dimensional BiLSTM model with SA-Tweedie’s 300-dimensional embedding and a max pooling layer, which significantly reduces the number of parameters.
- The embeddings from GloVe and BERT were trained with much larger-sized corpora than our SA-Tweedie model. The corpus size of our SA-Tweedie is 4 billion tokens from Wikipedia. The pretrained GloVe embeddings were based on a training corpus size of 42 billion tokens. The BERT embeddings were from pretraining on English Wikipedia plus additional BooksCorpus (800M words) data. We do not have access to BooksCorpus data. Therefore, SA-Tweedie’s embedding has seen less training data.
4.2.4. More Configuration Settings to Examine SA-Tweedie’s Performance
5. Summary, Discussion, and Future Research
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Mikolov, T.; Chen, K.; Corrado, G.S.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. Available online: https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf (accessed on 30 December 2024).
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. arXiv 2016, arXiv:1607.04606. [Google Scholar] [CrossRef]
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. arXiv 2018, arXiv:1802.05365v2. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MI, USA, 2–7 June 2019. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. Available online: https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf (accessed on 30 December 2024).
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. Available online: https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf (accessed on 30 December 2024).
- Freestone, M.; Karmaker, S.S. Word Embeddings Revisited: Do LLMs Offer Something New? arXiv 2024, arXiv:2402.11094v2. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288v2. [Google Scholar]
- Greene, R.; Sanders, T.; Weng, L.; Neelakantan, A. New and Improved Embedding Model, 15 December 2022. OpenAI. Available online: https://openai.com/blog/new-and-improved-embedding-model (accessed on 30 December 2024).
- Anil, R.; Dai, A.M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Z.; et al. PaLM 2 Technical Report. arXiv 2023, arXiv:2305.10403v3. [Google Scholar]
- Cer, D.; Yang, Y.; Kong, S.; Hua, N.; Limtiaco, N.; John, R.S.; Constant, N.; Guajardo-Cespedes, M.; Yuan, S.; Tar, C.; et al. Universal Sentence Encoder. arXiv 2018, arXiv:1803.11175v2. [Google Scholar]
- Raza, S.; Rahman, M.; Kamawal, S.; Toroghi, A.; Raval, A.; Navah, F.; Kazemeini, A. A Comprehensive Review of Recommender Systems: Transitioning from Theory to Practice. arXiv 2024, arXiv:cs.IR/2407.13699. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2020, arXiv:1910.01108v4. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Tanaka, H.; Kunin, D.; Yamins, D.L.; Ganguli, S. Pruning Neural Networks Without any Data by Iteratively Conserving Synaptic Flow. Available online: https://proceedings.neurips.cc/paper_files/paper/2020/file/46a4378f835dc8040c8057beb6a2da52-Paper.pdf (accessed on 30 December 2024).
- Liang, T.; Glossner, J.; Wang, L.; Shi, S.; Zhang, X. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing 2021, 461, 370–403. [Google Scholar] [CrossRef]
- Kusupati, A.; Bhatt, G.; Rege, A.; Wallingford, M.; Sinha, A.; Ramanujan, V.; Howard-Snyder, W.; Chen, K.; Kakade, S.; Jain, P.; et al. Matryoshka representation learning. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS’22, Red Hook, NY, USA, 28 November–9 December 2022. [Google Scholar]
- Heidari, A.; Jafari Navimipour, N.; Dag, H.; Unal, M. Deepfake detection using deep learning methods: A systematic and comprehensive review. WIREs Data Min. Knowl. Discov. 2024, 14, e1520. [Google Scholar] [CrossRef]
- Heidari, A.; Navimipour, N.J.; Dag, H.; Talebi, S.; Unal, M. A Novel Blockchain-Based Deepfake Detection Method Using Federated and Deep Learning Models. Cogn. Comput. 2024, 16, 1073–1091. [Google Scholar] [CrossRef]
- Amiri, Z.; Heidari, A.; Navimipour, N.J.; Esmaeilpour, M.; Yazdani, Y. The deep learning applications in IoT-based bio- and medical informatics: A systematic literature review. Neural Comput. Appl. 2024, 36, 5757–5797. [Google Scholar] [CrossRef]
- Amiri, Z.; Heidari, A.; Zavvar, M.; Navimipour, N.J.; Esmaeilpour, M. The applications of nature-inspired algorithms in Internet of Things-based healthcare service: A systematic literature review. Trans. Emerg. Telecommun. Technol. 2024, 35, e4969. [Google Scholar] [CrossRef]
- Amiri, Z.; Heidari, A.; Navimipour, N.J. Comprehensive survey of artificial intelligence techniques and strategies for climate change mitigation. Energy 2024, 308, 132827. [Google Scholar] [CrossRef]
- Heidari, A.; Jafari Navimipour, N.; Unal, M. A Secure Intrusion Detection Platform Using Blockchain and Radial Basis Function Neural Networks for Internet of Drones. IEEE Internet Things J. 2023, 10, 8445–8454. [Google Scholar] [CrossRef]
- Heidari, A.; Jafari Navimipour, N.; Unal, M.; Zhang, G. Machine Learning Applications in Internet-of-Drones: Systematic Review, Recent Deployments, and Open Issues. ACM Comput. Surv. 2023, 55. [Google Scholar] [CrossRef]
- Zanbouri, K.; Darbandi, M.; Nassr, M.; Heidari, A.; Navimipour, N.J.; Yalcın, S. A GSO-based multi-objective technique for performance optimization of blockchain-based industrial Internet of things. Int. J. Commun. Syst. 2024, 37, e5886. [Google Scholar] [CrossRef]
- Tim, T.; Wang, H. Matrix Factorization and Prediction for High-Dimensional Co-Occurrence Count Data via Shared Parameter Alternating Zero Inflated Gamma Model. Mathematics 2024, 12, 3365. [Google Scholar] [CrossRef]
- Bonat, W.H.; Kokonendji, C.C. Flexible Tweedie regression models for continuous data. J. Stat. Comput. Simul. 2017, 87, 2138–2152. [Google Scholar] [CrossRef]
- Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
- Wang, M.; Manning, C.D. Effect of Non-linear Deep Architecture in Sequence Labeling. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, Nagoya, Japan, 14–19 October 2013; pp. 1285–1291. [Google Scholar]
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data Via the EM Algorithm. J. R. Stat. Soc. Ser. Methodol. 1977, 39, 1–22. [Google Scholar] [CrossRef]
Interval Index | Lower Bound | Upper Bound | ||||
---|---|---|---|---|---|---|
0 | 3.030 | 3.306 | 2.264 | −7.000 | −6.000 | 1.230 |
1 | −1.483 | −0.970 | −2.607 | −6.000 | −5.000 | 0.485 |
2 | −0.438 | 0.194 | −1.180 | −5.000 | −4.000 | 0.688 |
3 | −0.115 | 0.576 | −0.607 | −4.000 | −3.000 | 0.811 |
4 | 0.197 | 1.099 | −0.188 | −3.000 | −2.000 | 0.943 |
5 | 0.554 | 1.221 | 0.161 | −2.000 | −1.000 | 1.098 |
6 | 0.451 | 0.984 | 0.142 | −1.000 | 0.000 | 0.990 |
7 | 0.457 | 0.887 | 0.145 | 0.000 | 1.000 | 0.642 |
8 | 1.105 | 1.680 | 0.809 | 1.000 | 2.582 | −0.053 |
Random | GloVe | SA-Tweedie | ||||
---|---|---|---|---|---|---|
Validation | Test | Validation | Test | Validation | Test | |
Loss | 0.30 | 0.52 | 1.29 | 1.99 | 0.29 | 0.44 |
Precision B.LOC | 33.82 | 29.17 | 34.78 | 29.29 | 32.53 | 27.85 |
Precision B-MISC | 16.14 | 11.19 | 16.09 | 11.17 | 15.81 | 10.89 |
Precision B-ORG | 21.12 | 23.39 | 20.37 | 22.55 | 20.34 | 23.26 |
Precision B-PER | 25.86 | 19.31 | 23.97 | 15.56 | 26.30 | 21.90 |
Precision I-LOC | 4.83 | 4.29 | 5.34 | 4.54 | 5.14 | 4.01 |
Precision I-MISC | 4.92 | 2.54 | 4.75 | 2.50 | 4.92 | 2.77 |
Precision I-ORG | 8.20 | 9.34 | 7.13 | 8.24 | 8.04 | 9.16 |
Precision I-PER | 19.50 | 15.67 | 12.80 | 6.03 | 19.49 | 16.19 |
Precision O | 94.56 | 91.05 | 94.02 | 91.15 | 95.81 | 94.36 |
Precision-weighted | 95.24 | 90.88 | 93.38 | 88.53 | 94.66 | 92.16 |
Recall B-LOC | 32.56 | 28.78 | 34.17 | 28.71 | 31.99 | 27.20 |
Recall B-MISC | 15.98 | 10.87 | 15.55 | 10.79 | 15.44 | 10.51 |
Recall B-ORG | 21.69 | 23.42 | 19.49 | 21.58 | 19.77 | 22.72 |
Recall B-PER | 24.69 | 17.89 | 22.79 | 13.69 | 27.41 | 22.17 |
Recall I-LOC | 4.78 | 4.22 | 5.32 | 4.46 | 5.01 | 3.93 |
Recall I-MISC | 4.83 | 2.60 | 4.62 | 2.47 | 4.73 | 2.71 |
Recall I-ORG | 8.27 | 9.20 | 6.74 | 7.79 | 7.64 | 8.93 |
Recall I-PER | 18.93 | 15.32 | 11.59 | 4.81 | 19.47 | 15.52 |
Recall O | 94.21 | 90.41 | 98.06 | 97.59 | 96.65 | 95.66 |
Recall-weighted | 94.27 | 89.67 | 95.17 | 91.17 | 94.57 | 91.73 |
f1.B-LOC | 32.92 | 28.68 | 34.22 | 28.72 | 31.99 | 27.25 |
F1 B-MISC | 15.90 | 10.92 | 15.67 | 10.86 | 15.45 | 10.58 |
F1 B-ORG | 21.12 | 23.05 | 19.73 | 21.77 | 19.80 | 22.64 |
F1 B-PER | 24.98 | 18.28 | 23.09 | 14.24 | 26.35 | 21.68 |
F1 I-LOC | 4.76 | 4.22 | 5.31 | 4.48 | 5.04 | 3.94 |
F1 I-MISC | 4.79 | 2.51 | 4.63 | 2.42 | 4.77 | 2.71 |
F1 I-ORG | 8.05 | 9.10 | 6.78 | 7.87 | 7.71 | 8.87 |
F1 I-PER | 19.04 | 15.28 | 11.94 | 5.16 | 19.29 | 15.65 |
F1 O | 94.14 | 90.23 | 95.70 | 93.75 | 96.00 | 94.51 |
F1-weighted | 94.41 | 89.70 | 93.92 | 89.27 | 94.32 | 91.46 |
GloVe-SA-Tweedie | Random-SA-Tweedie | |||||||
---|---|---|---|---|---|---|---|---|
Effect Size | Lowr.CI | Upr.CI | -Value | Effect Size | Lwr.CI | Upr.CI | -Value | |
Precision B.LOC | ||||||||
Precision B.MISC | ||||||||
Precision B.ORG | ||||||||
Precision B.PER | ||||||||
Precision I.LOC | ||||||||
Precision I.MISC | ||||||||
Precision I.ORG | ||||||||
Precision I.PER | ||||||||
Precision O | ||||||||
Precision-weighted | ||||||||
Recall B.LOC | ||||||||
Recall B.MISC | ||||||||
Recall B.ORG | ||||||||
Recall B.PER | ||||||||
Recall I.LOC | ||||||||
Recall I.MISC | ||||||||
Recall I.ORG | ||||||||
Recall I.PER | ||||||||
Recall O | ||||||||
Recall-weighted | ||||||||
F1 B.LOC | ||||||||
F1 B.MISC | ||||||||
F1 B.ORG | ||||||||
F1 B.PER | ||||||||
F1 I.LOC | ||||||||
F1 I.MISC | ||||||||
F1 I.ORG | ||||||||
F1 I.PER | ||||||||
F1 O | ||||||||
F1-weighted |
Dev F1 | Test F1 | |
---|---|---|
Fine-tuning approach | ||
BERT large | 96.6 | 92.8 |
BERT base | 96.4 | 92.4 |
Feature-based approach (BERT base) | ||
Embeddings | 91.0 | - |
Second-to-Last Hidden | 95.6 | - |
Last Hidden | 94.9 | - |
Weighted Sum Last Four Hidden | 95.9 | - |
Concat Last Four Hidden | 96.1 | - |
Weighted Sum All 12 Layers | 95.5 | - |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, T.; Wang, H. Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model. Mathematics 2025, 13, 612. https://doi.org/10.3390/math13040612
Kim T, Wang H. Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model. Mathematics. 2025; 13(4):612. https://doi.org/10.3390/math13040612
Chicago/Turabian StyleKim, Taejoon, and Haiyan Wang. 2025. "Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model" Mathematics 13, no. 4: 612. https://doi.org/10.3390/math13040612
APA StyleKim, T., & Wang, H. (2025). Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model. Mathematics, 13(4), 612. https://doi.org/10.3390/math13040612