Topicality boosts popularity: a comparative analysis of NYT articles and Reddit memes

Kate Barnes¹,
Péter Juhász²,
Marcell Nagy¹ &
…
Roland Molontay^1,3

1143 Accesses
Explore all metrics

Abstract

This study sheds light on interconnected topic dynamics across traditional news sources and social media platforms, emphasizing the influential role of topicality in shaping content popularity in social media. Using the Latent Dirichlet Allocation and BERTopic models, we define sets of 120 New York Times (NYT) topics to compare with 899,766 image-with-text memes from Reddit, showing that social media content aligns with many of the same topical patterns observed in news outlets. Topicality is formalized based on the temporal distributions of topics over the past 5 years. Using these topicality features, the investigation reveals significant correlations between the rising popularity of NYT topics and increased average upvotes on Reddit, particularly evident in “innovator” memes posted during the early stages of a topic’s prevalence in the NYT. Furthermore, topicality features show significant predictive power over other content-based control features in a CatBoost classifier prediction of viral Reddit memes.

Dank or not? Analyzing and predicting the popularity of memes on Reddit

Article Open access 09 March 2021

Multiple bursts of highly retweeted articles on social media

Article 26 April 2021

What Drives Online Popularity: Author, Content or Sharers? Estimating Spread Dynamics with Bayesian Mixture Hawkes

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Popular content from social media sites such as Reddit provides a window into the opinions and interests of internet users. Now that people spend more time on the internet, the question of what they are engaging with has come to the forefront. However, the forces underlying the emergence of popular internet content are still poorly understood. Particularly, the influence of memes’ topics on their popularity has not yet been analyzed. Do social media sites discuss the same issues as verified news sources? Are topical subjects that are currently prevailing in the news more likely to go viral on social media too? Is it worth “riding the wave” of topicality, or does interest in a topical subject decline rapidly on social media sites?

We investigate these questions using a dataset of image-with-text memes from Reddit and published articles from the New York Times. Image-with-text memes are usually images with overlaid text that are copied and shared extensively across the internet. The widespread prevalence of image-with-text memes online shows how well-suited the internet is to multi-modal reasoning and communication. Video, sound, and image data are more complicated to analyze than pure text data, making them less studied in scientific literature, but our understanding of online communication would be incomplete without these analyses (Halevy et al. 2022). To address this shortage, here we analyze both the image and text content of meme data collected from Reddit.

In the digital age, many are concerned about losing the depth of conversation in the short-form discourse of the internet. However, social media content may simply achieve nuance in different ways than longer-form written communication. Multi-modality is one way of conveying nuance online. The level of nuance in online communication is further enhanced by referencing relatable and popular topics in society. Often, social media users are participating in larger discourses that are happening across many internet platforms. Topic analysis algorithms such as Latent Dirichlet Allocation (LDA) (Blei et al. 2003) and BERTopic (Grootendorst 2022) provide the means for analyzing the larger conversations referenced.

The mechanism of reference is especially important to image-with-text memes, which are elsewhere defined exactly as a reference to reality followed by a punchline (Spitzberg 2014). By identifying meme topics, in this paper, we connect the individual posts to the broader discourse they are a part of and analyze overall trends in internet discourse across 5 years of data. Our method moves beyond individual posts to analyze the ideas purported by memes, the importance of which is highlighted by Mascarenhas et al. (2024).

Our previous work highlighted the prevalence of COVID-19-related content in viral Reddit memes in March 2020, at the beginning of the pandemic (Barnes et al. 2021). Other events likely exhibit similar properties. We assume that, in general, topical posts about subjects that are currently widely discussed in society are more likely to go viral online. In the present work, we investigate this hypothesis using the New York Times (NYT) as a reference document for topicality. Importantly, this study does not establish a causal relationship between NYT and Reddit. The correlations found simply indicate that the two data sources are tuned to similar topics. Specific contributions are listed below.

Identifying Reddit- and NYT-based topics.
Comparing the temporal distributions of 120 topics on both platforms.
Showing that posts about topics that are gaining prevalence in the NYT receive more upvotes on Reddit, including an innovator’s advantage for being among the first to post about a topic.
Demonstrating that topic-based features have significant predictive power in the machine learning identification of viral memes.

The research workflow used to establish these findings is summarized in Fig. 1. In Sect. 2, we review related literature. Section 3 discusses the data sets and engineering of control features describing the Reddit memes. Topicality features, describing topics extrinsic and intrinsic to the Reddit data, are discussed in Sect. 4. Section 5 discusses the training of a Categorical Boost classification model (Prokhorenkova et al. 2018) to predict viral memes using all engineered features. We analyze the incremental predictive power of topic-based features over other content-based control features, as well as the importance of intrinsic vs. extrinsic topicality features in the same section. Our findings are discussed and summarized in Sects. 6 and 7.

2 Related work

Previous work has shown that image-with-text memes often reference cultural and political topics (Brodie 2009; Du et al. 2020) such as those discussed in NYT. Memes can be used to consolidate group identities (Dynel and Messerli 2020), influence political opinions (McClure 2016), and inspire political actions (Leach and Allen 2017). To give a current example, TikTok has been used to increase public engagement in the Israeli-Palestinian conflict (Yarchi and Boxman-Shabtai 2023). The work of Zannettou et al. (2018) investigated political content on social media more broadly. The authors labeled memes from four social media sites with annotations from the Know Your Memes (KYM) website, indicating whether a meme template has political content. They found that memes on many social media sites contain political content; and on Reddit, political memes are more popular than non-political memes. These findings suggest that the NYT article corpus can appropriately serve to model some of the topics found in Reddit memes. Moreover, today, over half of U.S. adults get news from social media (Jacob Liedke and Luxuan Wang 2023), making work highlighting the connections between social media and the news increasingly important.

A prior model by Leskovec et al. (2009) showed that mimicry and novelty are key features of the news cycle. The dynamics of news stories in internet discourse can be traced through the replication of quotes across the web. Using this method, Leskovec et al. (2009) showed that new stories are more widely shared than out-of-date stories. In fact, the window within which a news story is relevant is typically very small. Most news-related quotes maintain a nontrivial volume for only 40 h around the moment of peak publication volume (Yang and Leskovec 2011), and on average, blog adoption of the quotes peaks only an average of 2.5 h after the phrase peaked in news media (Leskovec et al. 2009). Yang and Leskovec (2011) furthermore found that temporal features achieve the best accuracy for predicting how the popularity of a news-related quote will vary with time. Importantly, online information is not always trustworthy and memes containing quotes and factual information have been shown to change as they replicate across the internet (Simmons et al. 2011; Leskovec et al. 2009). To the best of our knowledge, our study is the first to analyze image-with-text memes in relation to the news cycle, but topic analysis has been used in connection with social media data in other ways.

Popularity is a common framework through which to analyze social media data. Liking and re-posting are implicit endorsements of the memes’ topics and opinions (Spitzberg 2014), implying that topics may influence memes’ popularity and diffusion across the web. Indeed, previous research shows that individuals are more likely to spread information about topics of interest to them. For example, Twitter users are more likely to adopt hashtags that align with their own topical interests (Grabowicz et al. 2016) and retweet hashtags that they have tweeted about in the past (Weng et al. 2012). Typically, the diffusion of memes through the internet is modeled as a simple contagion process, similar to the spread of disease (Wang and Wood 2011), however, topical content clearly exhibits more complex diffusion mechanisms. Poux-Médard et al. (2022) studied the interactions between topics in the diffusion of information on Reddit news pages, showing that the topics of previous posts have minor explanatory power for predicting the topics of subsequent posts. These analyses suggest that topics are interrelated and exhibit popularity trends.

The majority of previous research using topic analysis models such as LDA or BERTopic to model topics in social media data focuses on the topical interests of internet users. For example, preferential attachment can be described via topical analysis. Instagram users tend to connect with other users who have the same interests as them (Ferrara et al. 2014). Weng and Menczer (2015) show that if a particular hashtag on Twitter is adopted by people with diverse topical interests, the hashtag is more likely to go viral, due to exposure to a diverse, and presumably broadly connected, community. In other cases, homogeneous communities with strong central figures promote content virality when the topic of the shared content matches the interests of the community members, implying the importance of shared beliefs and attitudes (Cialdini and Goldstein 2004).

Our analysis combines themes from previous research about the topicality of posts on social media with research about political events and the news cycle. Many studies compare multiple data sources to gain a fuller picture of internet content (Zannettou et al. 2018; Leskovec et al. 2009), but to the best of our knowledge, ours is the first study to use one dataset (NYT) as a source document for defining topics that are then assigned to another, distinct dataset (Reddit). Using this technique, we are able to observe significant alignment between data sets that are otherwise very different (community-run vs. institutional, comedic vs. informational). Our primary contribution to the literature is to study the influence of topicality on social media popularity more broadly. Informally, when a particular topic is widely prevailing, or increasingly published in the NYT, it is said to be topical. Following our formal definitions of topicality in Sect. 4, we show that memes about topics with growing topicality in NYT receive more upvotes on average on Reddit.

3 Data description and preparation

Data was collected from two sources: image-with-text memes from a popular social media site called Reddit, discussed in Sect. 3.1, and archived article metadata from the New York Times (NYT), see Sect. 3.2. Sections 3.3 and 3.4 discuss the engineering of content-based control features such as the color content of meme images and the sentiment of meme texts. Both data sets can be found on Github (HSDS 2024).

3.1 Reddit

Reddit, nicknamed “the front page of the Internet”, is the source of a lot of viral internet data, making it likely that the viral content analyzed here also circulates elsewhere on the internet (Sanderson and Rigby 2013). Data from r/Memes, the largest community dedicated to sharing memes on Reddit, was collected using the Pushift API (Baumgartner et al. 2020; Boe 2016; Podolak 2021). We limited the collection to a maximum of 1000 randomly sampled posts per day. There were 37 missing dates from which we were unable to collect data due to issues with the API. After removing items with broken image links and gifs, the Reddit data set contained 899,766 memes posted between January 1, 2018 and November 14, 2022. The r/Memes subreddit is the 12th largest community on Reddit, with more than 26 million subscribers. Based on the above characteristics, we believe that the collected data is representative, and our conclusions can be applied to memes appearing elsewhere.

Table 1 Features table. Shaded rows indicate features used for the Catboost identification of viral memes including topic-related features (red) and control features (gray)

Full size table

In addition to the meme images, the API provides several metadata features, including the title and caption posted with the meme, the publication date, an over-18 content indicator variable, the number of comments, and the score. On Reddit, the score of a post is calculated as the number of upvotes minus the number of downvotes it received. These attributes provide opportunities to engineer various other features.

As visualized in the memes’ score distribution in Fig. 2, few memes receive a lot of attention, while the vast majority go unnoticed. In general, memes with high scores also receive more comments, as reflected by the Spearman rank-order correlation coefficient of $r=0.65$ ($p<0.001$). In many places on the Internet including Reddit, popular content, whether measured by likes, followers, or views, is sorted to the top of users’ news feeds (Figueiredo et al. 2014). Reddit’s default content sorting algorithm, used to curate the “hot” tab, prioritizes content based on a mixture of how recently it was posted and the logarithm of its score (Amir Salihefendic 2015). Content ranking methods such as this create a feedback loop between what is popular and what is visible online. While unpopular memes are visible briefly when they are first posted, popular memes are widely viewed. One study manipulated the scores received by Reddit posts, finding that by providing 10 upvotes soon after the content was posted, they could greatly increase the posts’ chances of going viral (Carman et al. 2018). This can explain the heavy-tailed distribution of popularity measures on many sites, including in our Reddit data.

It is a common choice to model the heavy-tailed popularity distributions on social media as binary variables (Weng and Menczer 2015; Ling et al. 2021). Here, we define virality as a percentile. If a given meme is in the top 5 percent of memes posted within a $\pm 7$-day period around it, it is labeled viral (1), and otherwise, it is labeled non-viral (0). The choice of percentile was based on the Reddit upvotes distribution in Fig. 2 and can also be visualized in this figure. This percentile is further discussed and explored in Sect. 5 where the viral/not-viral target variable is used in a binary classification task. By measuring virality locally, over a two-week period, we avoid replicating seasonal popularity trends on Reddit. Additionally, as the Reddit userbase is constantly growing (Tiago Biachi 2023), this ensures we do not overstate the influence of memes from recent years.

Table 1 summarizes all features describing the Reddit meme data. Shaded rows indicate features used for training the predictive model, including topicality-related features (red) and control features (gray).

3.2 New York times

Data from the same timeframe as the Reddit memes, January 1, 2018 to November 14, 2022, was collected from the New York Times Archives API.^{Footnote 1} The API returns all article metadata for a given month, including the abstract, snippet, lead paragraph, headline, publication date, keywords, and type of material. Most (72%) of the collected data was from the news section, but other types of material such as crosswords, obituaries, sports, and book reviews were also included. The NYT published roughly 200 articles per day on weekdays and 100 articles per day on weekends, and publishing rates showed a slight decline over the 5-year period we examined. On a few dates, the archives contained more than 500 items on a single day due to the NYT updating the archived podcast episodes, or, during the first year of the COVID-19 pandemic and during US elections, due to state-by-state statistics reports. These statistics reports and podcasts were removed to prevent the topic models from forming topics specific to this type of post and to ensure an even daily distribution of articles. No other statistics that should be handled similarly were identified. After removing these entries, our data set contained 255,783 NYT articles in total.

3.3 Text features

We extracted text from the Reddit meme images using Optical Character Recognition,^{Footnote 2} and generated image captions using Bootstrapping Language-Image Pre-training (BLIP) (Li et al. 2022), discussed in the next section. Examples of OCR and image caption results can be seen in Fig. 1. The OCR text and image captions were combined with the title and caption posted with the meme image on Reddit into one feature containing all text associated with the meme. This all-text feature was subsequently used for topic extraction and assignment in the Reddit data. In the NYT data set, the abstract, snippet, lead paragraph, headline, and keywords were all combined into one all-text feature for topic extraction.

The all-text features for the Reddit and NYT data were cleaned identically to assure alignment when modeling the topics in both document corpora. We removed punctuation, made the text lowercase, removed stop words, and stemmed the words to their root forms using NLTK (Bird et al. 2009). Additionally, we made custom edits to the NYT text data based on observed differences between the NYT and Reddit vernaculars. For example, NYT referred to the COVID-19 pandemic with the word “coronavirus”, whereas posts on Reddit tended to use “covid”, “rona”, “corona”, and “pandemic”. This is important as the LDA algorithm identifies and assigns topics based on shared words. Therefore, if a post on Reddit referred to “covid”, it would not necessarily be associated with an NYT topic that only included the word “coronavirus”. To solve this issue, we added the words “covid”, “rona”, “corona”, and “pandemic” to the all-text feature of every NYT article that was tagged with the NYT keyword “Coronavirus”. Other examples include adding “RGB” to articles tagged with Ruth Bater Ginsberg and “BLM” to articles tagged with Black Lives Matter. Additionally, acronyms of the form “G.O.P.” and “N.F.L.” were edited to “GOP” and “NFL” in order to keep this information when punctuation was removed in the text cleaning process.

In addition to text cleaning, we extracted numerical features from the Reddit text data. These features were used as control features describing the Reddit memes for classification. As seen in Table 1, we recorded the number of characters in the title posted with the meme and the OCR text extracted from the meme image. Other studies show that on Twitter, the length of the post is strongly correlated with popularity (Tsur and Rappoport 2015). A binary variable indicating whether or not the all-text feature contained emojis was engineered using the Emoji 2.8.0 identification Python library (Kim and Wurster 2023). The valence of the Reddit title and OCR-extracted text was analyzed with the NLTK sentiment model (Shreyas 2019). Sentiment scores close to 1 are more positive while sentiment scores close to 0 represent more negative sentiment.

3.4 Image features

In addition to text features, we extracted numerical features from the meme images following a similar procedure to our previous work (Barnes et al. 2021). Due to higher information density, images are more succinct than pure text content. They grab our attention and are able to rapidly communicate complicated messages (Nelson et al. 1976). Like many forms of online media, the topic of a meme is not only communicated through the text caption but also through the underlying image, so it was important to determine meme topics based on the image as well as the text content. In total, we used 19 image-related features, including high-level features such as BLIP image captions and face identification, and low-level features such as the color content of the images.

Bootstrapping Language-Image Pre-training (BLIP) (Li et al. 2022) is a pre-trained, transformer-based model for describing images. The model generates an image caption of usually 5–10 words, such as “a skeleton sitting on a cart in front of a house”, to quote the example in Fig. 1. This allowed us to identify objects in the meme images that may be relevant to their topics, such as “skeleton”, to use for topic assignment. The BLIP caption was added to the memes’ all-text feature and cleaned identically to the other text data.

The pre-trained Multi-Task Cascading Convolutional Neural Network (MTCNN) was used to detect faces in the meme images (Zhang et al. 2016). The model returned the probability that a face was present and the number of faces present in the image. For simplicity, we elected to use a binary variable indicating whether or not the meme image contained a face. Using the OpenCV image segmentation technique to mask the meme images, we calculated what percentage of the image area contained each of 10 colors (Stone 2018). The average hue, saturation, and value components of the HSV representation of the images and the average red, green, and blue components of the RGB image were also used as control features.

Previous work has shown that low-level image features such as those mentioned above can have a large effect on popularity (Russakovsky et al. 2015; Khosla et al. 2014). However, while an earlier study on the aesthetics of images shows that high-definition, bright-colored images are more appealing (Datta et al. 2006), in the case of memes, it appears to be the opposite. Popular memes generally contain dull colors (Barnes et al. 2021). For memes, these low-level features could encode the template image used to create memes, indicating that meme templates, rather than the low-level features themselves, have an impact on popularity (Coscia 2014).

Table 2 Most common LDA-identified topics in NYT and Reddit corpora. Words are listed in order of importance to the topic

Full size table

4 Topic modeling

Using two topic models, LDA and BERTopic, we analyzed the topics of Reddit memes and NYT articles. For comparison, we used top models from both paradigms of topic modeling: probabilistic (LDA) and embedding (BERTopic) approaches (Egger and Yu 2022). Each topic is represented by a list of its most common words, as shown in the example topics in Tables 2 and 3. To identify “extrinsic topics” (NYT-based), we trained the topic models on the NYT data and performed inference on the Reddit data. To determine “intrinsic topics” (Reddit-based), the topic models were trained on and inference was performed on the Reddit text data. From the distributions of each topic over time, we defined topicality features for the Reddit data. In total, 12 topicality features were added for each topic model. This is twice the number listed in Table 1 because the features were calculated using both the extrinsic and intrinsic topic models.

In this section, we discuss the LDA topic model (Sect. 4.1), the BERTopic model (Sect. 4.2), and the formalization of topicality into numerical features (Sect. 4.3). At the end of Sect. 4.3 we report statistics describing the relation between topicality features and the popularity of the Reddit memes. We display plots comparing topicality trends from the past 5 years of NYT and Reddit memes data. The interested reader can see the complete sets of NYT-based topics identified by LDA and BERTopic on our GitHub page (HSDS 2024).

Table 3 Most common BERTopic-identified topics in NYT and Reddit corpora. Words are listed in order of importance to the topic

Full size table

4.1 Latent Dirichlet allocation

LDA is a statistical language model used to distill topics from a corpus of documents using a bag-of-words approach in which word order is not considered. Words that commonly co-occur in documents are grouped into topics. Topics are then represented as a probability distribution over all of the words in the document corpus, with the words most highly associated with the given topic receiving the highest probabilities. In the inference step, the word distributions of the topics are compared with the word distributions of each document to assess the probabilities with which each document discusses each single topic. Documents are then represented as probability distributions over the set of topics (Blei et al. 2003).

One limitation of the LDA algorithm is the requirement to specify the number of topics in advance. We used the keywords metadata provided by the NYT to estimate an appropriate number of topics for the NYT data. The NYT indexed 1,583,868 unique keywords covering 92% of the articles, the most used keywords being “US Politics and Government”, “Trump, Donald J.”, and “Coronavirus (2020-nCov)”. We filtered the keywords list based on three criteria. The keyword should be used, 1. at least 100 times total, 2. on at least 50 distinct dates, and 3. at least 5 times on at least one day. Only, 127 unique keywords fit these criteria, providing an estimated number of topics for the NYT corpus, which we rounded down to 120 for simplicity.

The NYT all-text feature was used to train the LDA model, which was parameterized to find 120 topics in 1500 iterations over the corpus. We then labeled articles in the NYT corpus according to their highest probability topic. Table 2 shows that the most prevalent topics identified in the NYT data are about US politics & government and Coronavirus. This aligns with the most common NYT-applied keywords, lending confidence to our model. Using the trained LDA model, we performed inference on the Reddit all-text feature, resulting in a topic distribution assigned to each meme in the Reddit data, which described the probability that the meme is about each of the 120 NYT-based topics. Later, we will refer to this topic distribution as the vector, $\underline{\tau }_{m}$.

We labeled the Reddit memes according to their highest probability topic, saving that topic’s probability of being assigned to the meme as well. Additionally, we calculated the Shannon’s entropy of the memes’ topic distributions. Memes uniformly assigned to all topics receive high entropy values, while memes with few dominant topics receive low entropy values. These are the “topic”, “prob”, and “topic entropy” features described in Table 1.

In the same manner, a set of 120 intrinsic topics (500 iterations over the Reddit corpus) was extracted by training the LDA algorithm on the Reddit all-text feature. In this case, we chose 120 topics to be consistent with the NYT-based topic set. Again, the highest probability intrinsic topic, its associated probability, and the Shannon’s entropy of the topic distribution were saved for the Reddit memes.

Both the intrinsic and extrinsic topic models created one miscellaneous topic in which the most important words in the topic were common words like “already”, “life”, “way”, “time” and “person”. This topic was the most prevalent in both the NYT and Reddit corpora, but it did not dominate the data sets. In both cases, the miscellaneous topic was assigned to less than 10% of the documents. Table 2 shows the top 2 most frequent topics in the Reddit and NYT data, excluding this miscellaneous topic.

4.2 BERTopic

BERTopic is a newer topic modeling technique that uses a transformer-based approach (Grootendorst 2022). The model generates high-dimensional document embeddings, reduces the dimensionality of these embeddings with UMAP, and then clusters the vectors into topic groups using HDBSCAN. Word order is important for the transformer-based model to understand the context of texts. Therefore, word order was not changed during the text cleaning steps described in Sect. 3.3. To ensure comparability, we used the same text data for both the BERTopic and LDA models.

A set of 120 extrinsic topics was extracted using BERTopic from the NYT all-text document corpus, and inference was run on the Reddit all-text corpus to assign these extrinsic topics to the Reddit memes. Although it is not necessary to specify the number of topics before fitting BERTopic, the number of topics can be reduced after training. Our BERTopic model was parameterized to cluster topics containing at least 100 documents, resulting in 194 NYT topics. This was reduced to a set of 120 topics, to match the LDA model, using automatic and manual approaches. We applied BERTopic’s default topic-reduction technique, which merges the most similar HDBSCAN clusters. Additionally, we qualitatively assessed the similarity of topics’ word distributions and manually merged those that were most similar. We trained a second BERTopic model to find 120 intrinsic topics, using only the Reddit all-text features, following the same procedure.

As with the LDA model, BERTopic initially generated miscellaneous topics with common words such as “like” and “also”. These topics were manually combined into one miscellaneous topic, which, again was the most prevalent topic assigned to both the NYT and Reddit data. However, again this topic accounted for less than 10% of the data sets.

A disadvantage of BERTopic is the number of outliers produced. These are documents that are not assigned to any topic. To address this, we used the BERTopic calculate probabilities parameter to calculate the probability that each document belonged to every topic. Then, we automatically assigned every document to its highest probability topic, resulting in no outlier documents. Furthermore, this parameter was used to match the LDA results, in which every document is represented by a probability distribution over the topics, $\underline{\tau }_{m}$. As in the case of the LDA model, we supplemented the meme data with the highest probability topic, its probability, and the Shannon’s entropy of the topic distribution.

The BERTopic and LDA topic models obtained similar results. As shown in Fig. 3, there was a considerable amount of overlap between the topics found by the LDA and BERTopic models, and their 5-year temporal distributions. Table 3 shows the top 2 most common topics identified by BERTopic in the NYT and Reddit data sets. In the NYT, the most prevalent topics were those that the NYT publishes about regularly, such as arts events, real estate, and sports. COVID-19 and the invasion of Ukraine were the only two current events that appeared in the top 10 NYT topics. Notably, topics about police brutality and sexual assault were among the top 10 most prevalent NYT-based topics in the memes data as identified by both models, but these topics were not as prevalent in the NYT data.

Although NYT and Reddit follow many of the same topicality trends, the proportions with which they publish about these topics differ. This can be seen in the stream charts in Fig. 4. The NYT publishes much more about the arts, economics, and real estate than Reddit, whereas Reddit posts are more likely to discuss animals and technology than the NYT. Notably, current events are discussed at length in the NYT, but in Reddit memes interest in these topics declines rapidly after the date of the event. The impact of COVID-19 on NYT publishing is also evident in Fig. 4 as live events such as theater performances, weddings, and fashion week were canceled due to the pandemic. Moreover, the effect of Reddit outages, when the website was down for 5–10 h in 2019 and 2021, can also be observed.

4.3 Topicality features

In addition to saving the topic, probability, and topic entropy for each Reddit meme, we developed variables to assess whether trending, “topical” topics were more popular than non-trending topics. These variables were calculated for both the LDA and BERTopic models and for both the NYT-based, extrinsic topics and the Reddit-based, intrinsic topics.

4.3.1 Defining trending topics

First, we define trending topics in terms of how frequently they were published on given dates. This topicality variable, $\tau _{t}(d)$, was calculated as the normalized sum of the probability with which documents were assigned to a given topic t on a given date d. Formally,

$$\begin{aligned} \tau _{t}(d) = \dfrac{\sum \limits _{a \in A_d}^{}p_a(t)}{|A_d|}, \end{aligned}$$

(1)

where $A_d$ is the set of articles published on date d and $p_a(t)$ is the probability that document a is about topic t. Note that $\tau _{t}(d)$ can be calculated for either daily or monthly granularity. Figures 3 and 4 show monthly topic trends, meaning d in Eq. (1) was the month and year in which the document was posted. Figures 5 and 6, on the other hand, show daily topic trends, meaning d was the date on which the document was posted. Furthermore, $\tau _{t}(d)$ was calculated for both NYT articles and Reddit memes. We use “document” to refer to either an article or a meme. The blue lines in the above-mentioned figures show NYT distributions in which a in Eq. (1) represents NYT articles, whereas, for the red lines, a represents Reddit memes. The plots serve to compare topical trends in the NYT and Reddit and are thus only based on the sets of extrinsic topics.

Topics about current events typically exhibit a rapid spike followed by a slower decline. Oscillating topics appeared for sporting events, holidays, and other regular celebrations, such as Pride. In many topics, like the invasion of Ukraine and Christmas holidays shown in Fig. 3, the distribution of Reddit memes posted about the given topics aligns very well with the distribution of NYT articles. In other topics, such as the marriage and romance topic in Fig. 3, the two sources differ. We can also see that while the topic models identify the documents associated with each topic well, they are not perfect. For example, some NYT articles about Russian interference in the US 2020 presidential elections are associated with the topic about the invasion of Ukraine by the LDA topic model. This can be seen in the 2020 spike in the LDA Ukraine invasion topic in Fig. 3, and is likely due to words about Russia being prevalent in both topics. Despite this disadvantage, the identified topics describe the data sets well for both algorithms used.

4.3.2 Defining topical memes

Equation (1) and the corresponding charts describing trending topics can be used to identify which memes are topical and which are not. Here, we consider Reddit memes topical if they are about events that are currently on the rise in NYT publishing. Visually, memes published about a given topic when there is a spike in the corresponding NYT distribution (e.g., on the date of the current event) are considered topical.

Applying Eq. (1) for each topic on the NYT document corpus results in topicality vectors with a length of 120 for each date d, $\underline{\tau }_{d}$, describing how frequently each topic was being discussed on a given month or day in the 5 years that our data span. This vector can be formalized as,

$$\begin{aligned} \underline{\tau }_{d} = \sum \limits _{t \in T}^{}{\tau _{t}(d)}\cdot {\underline{i}_{t}}, \end{aligned}$$

(2)

where T is the set of topics, and $\underline{i}_{t}$ is the unit vector on |T| coordinates with the coordinate corresponding to topic t being 1.

Using the topicality vector $\underline{\tau }_{d}$ and the topic distribution $\underline{\tau }_{m}$ assigned to a meme m in the model inference step, we define the topicality of a Reddit meme m as,

$$\begin{aligned} \textrm{top}_m = {\underline{\tau }_{d_m}}\cdot {\underline{\tau }_{m}} \end{aligned}$$

(3)

where $\underline{\tau }_{d_m}$ is the topicality distribution of the reference data (NYT for extrinsic topics, Reddit for intrinsic topics) on the given date $d_m$ that the meme m was posted, and $\underline{\tau }_{m}$ is the topic distribution assigned to the particular meme instance.

Note that again, this can be calculated for either a daily or monthly granularity, meaning $d_{m}$ is either the day on which the meme m was posted or the month in which the meme was posted. Thus, Eq. (3) was used to define the “daily topicality” and “monthly topicality” features listed in Table 1. We suspect the monthly topicality variable to have reduced noise. For example, the monthly average topicality for the holidays topic, seen in Fig. 3, is higher for all memes posted in December due to the effect of Christmas, but the daily average topicality variable has higher values only on the few days around Christmas and the other holidays.

In addition to the daily and monthly topicality features, reflecting how frequently a given topic was being discussed in the NYT or Reddit, we designed a feature describing the slope of the daily topicality distributions (e.g., in Fig. 6). This was calculated as the average slope of the reference data distribution for the $\pm 2$ days around when the meme was posted. Given that topics about current events tend to spike rapidly, we used a small time frame. This is the “slope topicality” feature in Table 1. We suspect that a topic will be more popular on Reddit when the topicality distribution has a positive slope, indicating increasing topicality trends.

Two binary topicality features were engineered based on the slope and daily topicality features. These binary variables were used for statistical tests showing that topical memes are more popular than non-topical memes. The first binary topicality variable, visualized in Fig. 7a, indicates whether the meme’s topic is being posted about more (labeled 1) or less (labeled 0) than average in NYT on the date the meme was posted. The second binary variable, visualized in Fig. 7b, was calculated based on the slope topicality. This variable takes the values 1 when the slope of the reference topicality distribution is positive, and 0 when the slope of the distribution is negative.

The cross-tabulation of these two binary topicality variables makes a categorical topicality feature with 4 categories. This categorical feature indicates both whether the slope of the reference distribution was positive or negative on the date when the meme was posted and whether the topic was being published about more or less than average on the date when the meme was posted. Figure 7c highlights when the slope of the reference distribution is positive but the topic is not yet being posted about more than average, called “combined”.

Table 4 Pooled t-tests results comparing memes’ scores, groups determined by three binary topicality features.

Full size table

Table 4 reports t-tests comparing the number of upvotes received by topical and non-topical memes according to these binary variables. On average, memes about topics that show increasing, positive slope, topicality trends receive a higher score than memes about topics with decreasing topicality trends. There is an additional innovators’ advantage for memes posted right when a topic is beginning to gain popularity in the NYT (Fig. 7c), as shown by the results for the combined topicality variable. Differences between the other groups, for which T-test results were not reported, were not significant. Taken together, these results indicate that memes posted about a topic that has recently hit the news receive more upvotes on average. Topicality has a positive effect on Reddit popularity, however, this effect doesn’t endure long after the date of the event, as indicated by the non-significant results of the Fig. 7a variable.

5 Identifying viral memes

In order to gain a better understanding of how topicality influences the virality of a meme, we trained a CatBoost classifier. CatBoost, an ensemble learning algorithm for gradient boosting on decision trees, has also been shown to outperform other algorithms on a variety of problems (Ibrahim et al. 2020). The classifier is able to parse categorical features, meaning the 120-category topic features did not have to be manually one-hot-encoded before training. We used the image, text, and topicality features summarized in the shaded rows of Table 1 for the binary classification task of identifying viral memes in our Reddit data. Figure 8 shows the algorithm’s performance in identifying viral memes.

The binary target variable was highly unbalanced because it was defined as a percentile: only the top 5 percent of Reddit memes were considered viral. We used the CatBoost weighting parameter to improve performance on the imbalanced class prediction. Additionally, using the feature set of the top-performing model, we experimented with defining the viral target variable based on different percentiles, where the top 1%, 5%, 10%, 15%, 20% or 25% of memes were marked as viral. These cut-off values resulted in AUC scores of 0.73, 0.73, 0.72, 0.71, 0.71, and 0.71 respectively. Interestingly, these experiments indicate a slight advantage to defining viral memes as those in the top 1 or 5% of the data, but overall the change seems to have a small impact on the outcome. The fact that the AUC is independent of the threshold used to define the viral class indicates that the machine learning model can effectively rank memes based on their scores. This suggests that approaching the problem through quantile regression could be a viable alternative.

The CatBoost classifier was trained using data from 763,388 memes and tested using 84,821 memes. The model was fine-tuned with 5-fold cross-validation, and sklearn’s GridSearchCV for parameter selection. Notably, the title length feature is the most important for classifying viral memes. Memes posted with short titles perform better than memes accompanied by a lot of text. This mirrors earlier work showing brevity is the most important feature for viral tweets (Tsur and Rappoport 2015). The reason could be that messages that are easier to convey are understood better. Given the importance of the text length variable, we used CatBoost’s per-float-feature-quantization parameter to increase the number of decision boundaries allowed for this feature, taking advantage of its predictive power.

The classifier was trained separately for the LDA-based and BERTopic-based topicality feature sets. In order to estimate the importance of the topic-based features for predicting meme virality, we trained the CatBoost model with different subsets of features. The first models were trained using only topicality features, highlighted in red in Table 1. Alone, the topicality features (AUCs: LDA=0.64, BERTopic=0.57) are already able to predict meme virality better than a random guess (AUC=0.5).

It could be the case that the explanatory power of the topicality features is already contained by information in the control feature set. To test this, we trained the model with only the control features described in Sects. 3.3 and 3.4. Then, we assessed the incremental predictive power of topicality features over control features. Models trained with the entire feature set, all shaded rows in Table 1, were better able to identify viral memes than the model trained with only the control feature set. We tested the statistical significance of these AUC differences using a procedure from previous research (Hanley and McNeil 1982), finding the engineered topicality features have significant predictive power over the control features for both BERTopic ($AUC=0.71$, $p<0.01$) and LDA ($AUC=0.73$, $p<0.001$). Results were more significant for the LDA topic model, and topicality features ranked higher in the LDA features importances than for BERTopic, as seen in Fig. 8.

Figure 9 shows the SHapley Additive exPlanations (SHAP) values for given extrinsic topicality features (Lundberg and Lee 2017). The text length feature, which is negatively correlated with meme virality, is provided for reference —memes with fewer words have a higher probability of being viral. The topicality features on the other hand correspond positively with the CatBoost’s estimation of meme virality. Memes with higher daily and monthly topicality values are generally labeled 1, viral, with a higher probability by the CatBoost model than memes with lower topicality values. Interestingly, higher values of topic entropy and topic probability also have a positive effect on the CatBoost outcome. This suggests that memes with a strong dominant topic, and high probability, but otherwise mixed topic distributions, and high entropy have higher probabilities of going viral. The topicality features positively impact the CatBoost’s estimation of meme virality.

6 Discussion

The findings of this paper can be interpreted from the perspective of agenda-setting theory, which states that media sources shape the interests of society by selecting which topics to publish about. Our finding that news topicality is positively correlated with social media virality indicates that it may be harder to get attention for and spread non-topical information. This study does not unveil a causal link between topics in NYT and on Reddit, however, the great overlap between the sources indicates that they exert similar influence on their audiences.

Previous work performing topic analysis on social media data has focused on user topical interests (Grabowicz et al. 2016; Weng et al. 2012; Ferrara et al. 2014). Our analyses of general topics across a social media platform, and the use of a reference document to assess the topicality of the topics is a unique contribution. We furthermore show that social media sites may align significantly with the news and should be considered as part of the news cycle. Earlier findings from research on the news cycle suggest that early adoption is of crucial importance (Leskovec et al. 2009; Yang and Leskovec 2011), which we confirm here in our analysis of Reddit data. Our finding that topicality plays a role in meme popularity suggests that topics may influence a more complex diffusion process in online media. This complicates the existing account of meme diffusion as a simple contagion process (Wang and Wood 2011).

Finally, the results presented here contribute to the growing body of research modeling multi-modal data. With the rise of internet communication, multi-modality is a paradigm shift in the forms discourse takes in general. Predicting the popularity of multimedia content is more difficult than text-only data. For example, impressive results have been achieved predicting viral tweets (Weng et al. 2012; Tsur and Rappoport 2015), but we have not yet seen such success predicting the popularity of image-with-text memes. Apart from one article that worked with a small, hand-labeled data set (Ling et al. 2021), our AUC of 0.73 is the best result we know of from the literature. Recent improvements in generative AI will soon make modeling multi-modal content more tractable.

With the increasing availability of internet data, social media has become a popular source of data for academics. Reddit is an especially popular data source, and the ethical guidelines around using this data are not yet firmly established. In light of this, we heed recommendations from studies surveying social media users about research data use. Namely, we collected data from a large, public community, assured no personal identifying information was associated with the data, and we plan to share our results back to Reddit upon completion (Proferes et al. 2021).

7 Conclusion

While NYT and Reddit are qualitatively very different sources of information, they do show significant alignment. NYT-based topics are good descriptors and predictors of popular Reddit memes. Here we showed:

Many topics on Reddit and the NYT exhibit similar topicality distributions.
Memes that are about topics with increasing topicality trends in the NYT receive more upvotes on average.
“Innovator” memes posted when a topic is just beginning to be prevalent in the news receive the most upvotes on average.
Interest in topics from the news cycle declines more rapidly on Reddit than in the NYT.
A set of engineered features able to identify viral memes with a leading AUC of 0.73.

The scope of topicality in this study is shaped by its reliance on NYT data. Although the NYT covers a range of global events and subjects beyond news, the choice of this source may lead to a focus that is somewhat more centered on U.S. and political themes. The USA-skewed results were appropriate for our purposes to match the large proportion of Reddit users based out of the USA, 48% of all users (Tiago Biachi 2023). However, this means our results are less relevant to other parts of the globe.

Our use of BLIP (Li et al. 2022) image captioning ensured that the topics assigned to Reddit memes were based on image as well as text content. Given the importance of the image to a meme, this was a crucial step. One drawback to this method is that BLIP cannot recognize celebrities. Identifying politicians in meme images doubtlessly would have improved our topic assignment. Furthermore, our research did not take community structure into account, although previous research has shown that accounting for community structure can improve topic assignment (Kalyanam et al. 2015), especially when modeling a social media phenomenon such as memes.

Many interesting implications of this research could be expanded on in future studies. Our work highlighted significant topical alignment between Reddit and NYT but also showed that the most common topics differ between the two sources. Further work could identify which topics from the news cycle are most likely to appear and gain popularity on Reddit. Similarly, our work indicated that interest in news topics declines more rapidly on social media than in the news, but future research could quantify these differences in topic dynamics.

Data availability

The Reddit data was collected using the Pushshift API, and it consists of all posts starting from January 1st, 2018 to November 14th, 2022 from the r/memes subreddits. The New York Times data was collected using the NYT Archive API, and it consists of all articles published in the same time frame.

Notes

https://developer.nytimes.com/apis.
OCR.space Free OCR API and Online OCR, https://ocr.space/

References

Amir Salihefendic: How Reddit Ranking Algorithms Work (2015) Accessed June 1, 2023. https://www.yale.edu/about-yale/yale-facts
Barnes K, Riesenmy T, Trinh MD, Lleshi E, Balogh N, Molontay R (2021) Dank or not? Analyzing and predicting the popularity of memes on Reddit. Appl Netw Sci 6(1):1–24
Article Google Scholar
Baumgartner J, Zannettou S, Keegan B, Squire M, Blackburn J (2020) The pushshift Reddit dataset. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 14, pp. 830–839
Bird S, Klein E, Loper E (2009) Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media Inc, Sebastopol
Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
Google Scholar
Boe B (2016) PRAW: the Python Reddit API Wrapper. https://github.com/praw-dev/praw. Accessed: 2022-12-15
Brodie R (2009). Virus of the mind: the new science of the meme. Hay House, Carlsbad, CA
Google Scholar
Carman M, Koerber M, Li J, Choo KKR, Ashman H (2018) Manipulating visibility of political and apolitical threads on Reddit via score boosting. In: 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), pp. 184–190. IEEE
Cialdini RB, Goldstein NJ (2004) Social influence: compliance and conformity. Ann Rev Psychol 55:591–621
Article Google Scholar
Coscia M (2014) Average is boring: How similarity kills a meme’s success. Sci Rep 4(1):6477
Article Google Scholar
Datta R, Joshi D, Li J, Wang JZ (2006) Studying aesthetics in photographic images using a computational approach. In: Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Springer, Graz, Austria, May 7–13, Proceedings, Part III 9, pp. 288–301.
Du Y, Masood MA, Joseph K (2020) Understanding visual memes: An empirical analysis of text superimposed on memes shared on Twitter. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 14, pp. 153–164
Dynel M, Messerli TC (2020) On a cross-cultural memescape: Switzerland through nation memes from within and from the outside. Contrast Pragmat 1(2):210–241
Article Google Scholar
Egger R, Yu J (2022) A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify Twitter posts. Front Sociol 7:886498
Article Google Scholar
Ferrara E, Interdonato R, Tagarelli A (2014) Online popularity and topical interests through the lens of instagram. In: Proceedings of the 25th ACM Conference on Hypertext and Social Media, pp. 24–34
Figueiredo F, Almeida JM, Gonçalves MA, Benevenuto F (2014) On the dynamics of social media popularity: a youtube case study. ACM Trans Int Technol (TOIT) 14(4):1–23
Article Google Scholar
Grabowicz P, Ganguly N, Gummadi K (2016) Distinguishing between topical and non-topical information diffusion mechanisms in social media. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 10, pp. 151–160
Grootendorst M (2022) BERTopic: neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794
HSDS: Topical allignments between the NYT and Reddit (2024) Accessed Feb 1, 2024. https://github.com/hsdslab/topicality-online
Halevy A, Canton-Ferrer C, Ma H, Ozertem U, Pantel P, Saeidi M, Silvestri F, Stoyanov V (2022) Preserving integrity in online social networks. Commun ACM 65(2):92–98
Article Google Scholar
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143(1):29–36
Article Google Scholar
Ibrahim AA, Ridwan RL, Muhammed MM, Abdulaziz RO, Saheed GA (2020) Comparison of the CatBoost classifier with other machine learning methods. Int J Adv Comp Sci Appl 11(11):738–748
Google Scholar
Jacob Liedke and Luxuan Wang: Social Media and News Fact Sheet (2023) Accessed Dec 1, 2023. https://www.pewresearch.org/journalism/fact-sheet/social-media-and-news-fact-sheet/
Kalyanam J, Mantrach A, Saez-Trumper D, Vahabi H, Lanckriet G (2015) Leveraging social context for modeling topic evolution. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 517–526
Khosla A, Das Sarma A, Hamid R (2014) What makes an image popular? In: Proceedings of the 23rd International Conference on World Wide Web, pp. 867–876
Kim T, Wurster K (2022) Emoji 2.8.0 for Python. Accessed May 1, https://pypi.org/project/emoji/ (2023)
Leach CW, Allen AM (2017) The social psychology of the black lives matter meme and movement. Curr Direct Psychol Sci 26(6):543–547
Article Google Scholar
Leskovec J, Backstrom L, Kleinberg J (2009) Meme-tracking and the dynamics of the news cycle. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 497–506
Li J, Li D, Xiong C, Hoi S (2022) Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR
Ling C, AbuHilal I, Blackburn J, De Cristofaro E, Zannettou S, Stringhini G (2021) Dissecting the meme magic: understanding indicators of virality in image memes. Proc ACM Human Comp Interact 5(CSCW1):1–24
Article Google Scholar
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems 30, pp. 4765–4774. Curran Associates Inc. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
Mascarenhas M, Friedman DA, Cordes RJ (2024) Bridging gaps in image meme research: a multidisciplinary paradigm for scaling up qualitative analyses. J Assoc Inf Sci Technol. https://doi.org/10.1002/asi.24900
Article Google Scholar
McClure B (2016) Discovering the discourse of internet political memes. In: Adult education research conference. https://newprairiepress.org/aerc/2016/roundtables/12
Nelson DL, Reed VS, Walling JR (1976) Pictorial superiority effect. J Exp Psychol Human Learn Mem 2(5):523
Article Google Scholar
Podolak M (2021) PMAW: pushshift multithread API wrapper. https://github.com/mattpodolak/pmaw. Accessed: 2022-12-15
Poux-Médard G, Velcin J, Loudcher S (2022) Properties of Reddit news topical interactions, pp. 16–28. Springer
Proferes N, Jones N, Zimmer M (2021) Studying reddit: a systematic overview of disciplines, approaches, methods, and ethics. Soc Med Soc. https://doi.org/10.1177/20563051211019004
Article Google Scholar
Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A (2018) Catboost: unbiased boosting with categorical features. Adv Neural Inf Process Syst 31
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comp Vis 115:211–252
Article MathSciNet Google Scholar
Sanderson B, Rigby M (2013) We’ve Reddit, have you?: What librarians can learn from a site full of memes. Coll Res Lib News 74(10):518–521
Google Scholar
Shreyas P (2019) Sentiment analysis for text with Deep learning. Medium. https://towardsdatascience.com/sentiment-analysis-for-text-with-deep-learning-2f0a0c6472b5 Accessed 2020-10-01
Simmons M, Adamic L, Adar E (2011) Memes online: extracted, subtracted, injected, and recollected. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 5, pp. 353–360
Spitzberg BH (2014) Toward a model of meme diffusion (m3d). Commun Theory 24(3):311–339
Article Google Scholar
Stone, R Image Segmentation Using Color Spaces in OpenCv+Python (2018). https://realpython.com/python-opencv-color- spaces/ Accessed 2020-10-01
Tiago Biachi: Reddit—Statistics & Facts (2023). Accessed August 1, 2023. https://www.statista.com/topics/5672/reddit/#topicOverview
Tsur O, Rappoport A (2015) Don’t let me be# misunderstood: Linguistically motivated algorithm for predicting the popularity of textual memes. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 9, pp. 426–435
Tsur O, Rappoport A (2015) Don’t let me be# misunderstood: Linguistically motivated algorithm for predicting the popularity of textual memes. In: Ninth International AAAI Conference on Web and Social Media
Wang L, Wood BC (2011) An epidemiological approach to model the viral propagation of memes. Appl Math Modell 35(11):5442–5447
Article Google Scholar
Weng L, Menczer F (2015) Topicality and impact in social media: diverse messages, focused messengers. PloS One 10(2):0118410
Article Google Scholar
Weng L, Flammini A, Vespignani A, Menczer F (2012) Competition among memes in a world with limited attention. Sci Rep 2:335. https://doi.org/10.1038/srep00335
Article Google Scholar
Yang J, Leskovec J (2011) Patterns of temporal variation in online media. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 177–186
Yarchi M, Boxman-Shabtai L (2023) The image war moves to TikTok evidence from the may 2021 round of the Israeli-Palestinian conflict. Digital Journalism, pp. 1–21
Zannettou S, Caulfield T, Blackburn J, De Cristofaro E, Sirivianos M, Stringhini G, Suarez-Tangil G (2018) On the origins of memes by means of fringe web communities. In: Proceedings of the Internet Measurement Conference 2018, pp. 188–202
Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sig Process Lett 23(10):1499–1503
Article Google Scholar

Download references

Acknowledgements

We thank József Pintér for collecting Reddit data, moreover, Donát Köller and Levente Murgás for their research insights. The authors declare they have no competing interests. The research was supported by Fulbright Hungary and the European Union project RRF$-$2.3.1-21-2022-00004 within the AI National Laboratory. Roland Molontay is also supported by the National Research, Development, and Innovation Fund through the OTKA Grant PD-142585.

Funding

Open access funding provided by Budapest University of Technology and Economics.

Author information

Authors and Affiliations

Department of Stochastics, Institute of Mathematics, Budapest University of Technology and Economics, Műegyetem rkp. 3, Budapest, 1111, Hungary
Kate Barnes, Marcell Nagy & Roland Molontay
Department of Mathematics, Aarhus University, Aarhus, Denmark
Péter Juhász
HUN-REN-BME Stochastics Research Group, Budapest, 1111, Hungary
Roland Molontay

Authors

Kate Barnes
View author publications
You can also search for this author in PubMed Google Scholar
Péter Juhász
View author publications
You can also search for this author in PubMed Google Scholar
Marcell Nagy
View author publications
You can also search for this author in PubMed Google Scholar
Roland Molontay
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R. M and J.P. conceived the study. K.B reviewed the literature, collected the data, and performed the analysis. K.B, R.M, and N.M interpreted the results. K.B and N.M prepared the figures. R.M supervised the research activities. K.B wrote the main manuscript with input from all authors. All authors critically reviewed and approved the final manuscript.

Corresponding author

Correspondence to Roland Molontay.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Barnes, K., Juhász, P., Nagy, M. et al. Topicality boosts popularity: a comparative analysis of NYT articles and Reddit memes. Soc. Netw. Anal. Min. 14, 119 (2024). https://doi.org/10.1007/s13278-024-01272-3

Download citation

Received: 15 February 2024
Revised: 21 April 2024
Accepted: 22 May 2024
Published: 23 June 2024
DOI: https://doi.org/10.1007/s13278-024-01272-3

Topicality boosts popularity: a comparative analysis of NYT articles and Reddit memes

Abstract

Similar content being viewed by others

Dank or not? Analyzing and predicting the popularity of memes on Reddit

Multiple bursts of highly retweeted articles on social media

What Drives Online Popularity: Author, Content or Sharers? Estimating Spread Dynamics with Bayesian Mixture Hawkes

1 Introduction

2 Related work