All Exp Lab

SOCIAL MEDIA MINING LAB
EXP 1
To implement the task of collecting and visualizing complex social networks from Twitter and
Wikipedia using NodeXL and Python, you can follow these steps:
### 1. Collect Data:

#### From Twitter:
- Use the Tweepy library to access the Twitter API and fetch relevant data such as tweets, user
information, and relationships.
- Collect tweets based on keywords, hashtags, or user handles to build your social network.
#### From Wikipedia:

- Utilize web scraping libraries like BeautifulSoup or Scrapy to extract information from
Wikipedia pages.
- Gather data such as page links, categories, and content related to your topic of interest.
### 2. Preprocess Data:

- Clean and preprocess the collected data to remove noise, handle missing values, and format it
appropriately for analysis.
- Extract relevant features such as user mentions, hashtags, URLs, and user interactions from
Twitter data.
- Extract relevant information such as page titles, links, and categories from Wikipedia data.
### 3. Create Social Network Graphs:

- Use the NodeXL library in Python to create social network graphs.
- For Twitter data, nodes can represent users, and edges can represent interactions such as
retweets, mentions, or follows.
- For Wikipedia data, nodes can represent Wikipedia pages, and edges can represent links between
pages or shared categories.
### 4. Visualize Networks:

- Once you have created the social network graphs, use NodeXL's built-in visualization features to
visualize the networks.
- Customize the visualization settings to highlight important nodes, edges, or clusters within the
networks.
- Experiment with different layout algorithms to find the most suitable layout for your data.
### Example Code:

```python
import tweepy
import networkx as nx
import matplotlib.pyplot as plt
import wikipedia
# Twitter API credentials

consumer_key = 'your_consumer_key'
consumer_secret = 'your_consumer_secret'
access_token = 'your_access_token'
access_token_secret = 'your_access_token_secret'
# Authenticate with Twitter API

auth = tweepy.OAuth1UserHandler(consumer_key, consumer_secret, access_token,
access_token_secret)
api = tweepy.API(auth)
# Fetch tweets based on a keyword

tweets = api.search(q='data science', count=100)
# Create an empty graph

twitter_graph = nx.Graph()
# Add nodes for users

for tweet in tweets:
user_id = tweet.user.id
user_name = tweet.user.screen_name
twitter_graph.add_node(user_id, label=user_name)
# Add edges for retweets and mentions

for tweet in tweets:
user_id = tweet.user.id
for mentioned_user in tweet.entities['user_mentions']:
mentioned_user_id = mentioned_user['id']
twitter_graph.add_edge(user_id, mentioned_user_id)
# Visualize the Twitter network

nx.draw(twitter_graph, with_labels=True)
plt.show()
# Fetch Wikipedia page links

page = wikipedia.page("Data science")
links = page.links
# Create an empty graph

wiki_graph = nx.Graph()
# Add nodes for Wikipedia pages

for link in links:
wiki_graph.add_node(link)
# Add edges for page links

for link in links:
linked_pages = wikipedia.page(link).links
for linked_page in linked_pages:
if linked_page in links:
wiki_graph.add_edge(link, linked_page)
# Visualize the Wikipedia network

nx.draw(wiki_graph, with_labels=True)
plt.show()
```
Make sure to replace `'your_consumer_key'`, `'your_consumer_secret'`, `'your_access_token'`, and

`'your_access_token_secret'` with your actual Twitter API credentials. Also, ensure you have
installed the required libraries (`tweepy`, `networkx`, `matplotlib`, `wikipedia`) using pip.
EXP 2
To compute various vertex and network metrics for social graphs using NodeXL and Python, you
can utilize the NetworkX library, which provides implementations for these metrics. Below is an
example code demonstrating how to compute each of the specified metrics:
```python
# Load your social graph data into a NetworkX graph object

# Example:
# G = nx.read_edgelist('your_graph_file.txt')
# Assuming you have loaded your graph, let's compute the metrics:
# (i) Degree Centrality

degree_centrality = nx.degree_centrality(G)
# (ii) Eigenvector Centrality

eigenvector_centrality = nx.eigenvector_centrality(G)
# (iii) Betweenness Centrality

betweenness_centrality = nx.betweenness_centrality(G)
# (iv) PageRank
pagerank = nx.pagerank(G)
# (v) Closeness Centrality

closeness_centrality = nx.closeness_centrality(G)
# (vi) Group Centrality (Average degree)

group_centrality = nx.k_core(G)
# (vii) Clustering Coefficient

clustering_coefficient = nx.clustering(G)
# Print or visualize the computed metrics

print("Degree Centrality:", degree_centrality)
print("Eigenvector Centrality:", eigenvector_centrality)
print("Betweenness Centrality:", betweenness_centrality)
print("PageRank:", pagerank)
print("Closeness Centrality:", closeness_centrality)
print("Group Centrality:", group_centrality)
print("Clustering Coefficient:", clustering_coefficient)
# Example: Visualize the network with node sizes representing degree centrality
# You can customize the visualization as per your preference
node_sizes = [degree_centrality[node] * 1000 for node in G.nodes()]
nx.draw(G, with_labels=True, node_size=node_sizes)
plt.show()
```
Make sure to replace `'your_graph_file.txt'` with the path to your social graph file if you're loading
it from a file. Also, ensure you have installed the required libraries (`networkx`, `matplotlib`) using
pip.
This code snippet computes the specified metrics for the given social graph and prints them out.
You can also visualize the network with customized node sizes based on degree centrality as shown
in the example. Adjustments and customizations can be made according to your specific
requirements.
EXP 3
To visualize social graphs reflecting various metrics using NodeXL and Python, you can use
NetworkX for graph manipulation and Matplotlib for visualization. Below is a sample code
demonstrating how to visualize a social graph while reflecting the computed metrics:
```python

# Example:
# Assuming you have loaded your graph, let's compute some metrics
# For demonstration purposes, let's compute Degree Centrality and PageRank
# (i) Degree Centrality

degree_centrality = nx.degree_centrality(G)
# (iv) PageRank
pagerank = nx.pagerank(G)
# Create a new figure for plotting

plt.figure(figsize=(10, 8))
# Draw the social graph

pos = nx.spring_layout(G) # Define a layout for the graph
nx.draw(G, pos, with_labels=True, node_size=300, edge_color='gray')
# Draw nodes colored by degree centrality
nx.draw_networkx_nodes(G, pos, node_color=list(degree_centrality.values()), cmap=plt.cm.Blues,
node_size=300)
# Draw edges
nx.draw_networkx_edges(G, pos, alpha=0.5)
# Add color bar for degree centrality

plt.colorbar(label='Degree Centrality')
# Add labels for nodes with PageRank

nx.draw_networkx_labels(G, pos, labels={node: f"{node}\nPR: {pagerank[node]:.2f}" for node in
G.nodes()}, font_size=8)
# Add title
plt.title("Social Network Graph with Degree Centrality and PageRank")
# Show plot
plt.axis('off') # Turn off axis
plt.show()
```
In this code:
- We compute Degree Centrality and PageRank for the given social graph.
- The graph is visualized using a spring layout.
- Nodes are colored based on their degree centrality, and node size remains constant.
- Edges are drawn with a semi-transparent gray color.
- Node labels are added, displaying both the node ID and its corresponding PageRank value.
- A color bar is added to indicate the degree centrality of nodes.
You can customize the visualization further according to your specific requirements or include
additional metrics to reflect in the visualization.
EXP 4
Detecting bridges in a social graph helps identify edges whose removal would disconnect the graph.
You can accomplish this using NetworkX in Python. Below is how you can implement it:
```python

# Example:
# Assuming you have loaded your graph, let's detect bridges

# Detect bridges in the graph
bridges = list(nx.bridges(G))
# Print the bridges

if bridges:
print("Bridges found in the graph:")
for bridge in bridges:
print(bridge)
else:
print("No bridges found in the graph.")
```
In this code:
- We utilize NetworkX's `nx.bridges(G)` function to detect bridges in the graph `G`.
- Bridges are edges whose removal would disconnect the graph.
- The function returns a list of tuples, where each tuple represents a bridge edge `(u, v)`.
You can adapt this code to your specific graph data and further analyze or visualize the detected
bridges as required.
EXP 5
Detecting communities and influencers in a social graph can provide insights into the structure and
key players within the network. One approach to identifying communities is through clique
identification, where cliques represent densely connected subgraphs. Below is a basic
implementation of brute-force clique identification on Enron email data using NetworkX in Python:
```python
# Load Enron email data into a NetworkX graph object

# Example:
# G = nx.read_edgelist('enron_emails.txt')
# Assuming you have loaded your graph, let's identify cliques (communities)
# Brute-force approach to find all maximal cliques in the graph

cliques = list(nx.find_cliques(G))
# Print the identified cliques

if cliques:
print("Cliques found in the graph:")
for clique in cliques:
print(clique)
else:
print("No cliques found in the graph.")
```
In this code:
- We utilize NetworkX's `nx.find_cliques(G)` function to identify all maximal cliques in the Enron
email graph `G`.
- Maximal cliques are complete subgraphs where every node is connected to every other node in the
subgraph.
- The function returns a list of lists, where each inner list represents a maximal clique.
This approach is a brute-force method and may not be efficient for large graphs. You can explore
more advanced community detection algorithms such as Louvain or Girvan-Newman for more
scalable solutions. Additionally, you can analyze the identified cliques further to identify influencers
within each community based on their centrality metrics or other criteria.
EXP 6
Implementing the Girvan-Newman algorithm for community detection involves iteratively

removing edges from the graph based on edge betweenness centrality until the graph is divided into
separate communities. Below is how you can implement Girvan-Newman algorithm using
NetworkX in Python:
```python

# Example:
# Assuming you have loaded your graph, let's apply Girvan-Newman algorithm
# Function to find communities using Girvan-Newman algorithm

def girvan_newman(G):
communities = list(nx.community.girvan_newman(G))
return communities
# Detect communities using Girvan-Newman algorithm

communities = girvan_newman(G)
# Visualize the communities

plt.figure(figsize=(12, 8))
pos = nx.spring_layout(G)
nx.draw(G, pos, node_color='lightblue', with_labels=True)
# Draw communities with different colors

for i, community in enumerate(communities):
nx.draw_networkx_nodes(G, pos, nodelist=community, node_color=plt.cm.jet(i /
len(communities)))
plt.title("Graph with Communities Identified by Girvan-Newman Algorithm")

plt.show()
```
In this code:
- We define a function `girvan_newman()` to apply the Girvan-Newman algorithm to the graph.
- We use NetworkX's built-in implementation of the Girvan-Newman algorithm.
- The algorithm returns a generator of tuples, where each tuple represents a partition of the graph
into communities.
- We visualize the graph with nodes colored based on their communities using matplotlib.
You can customize this code according to your specific graph data and further analyze or visualize
the detected communities as required.
EXP 7
Performing classification with network information involves leveraging features derived from the
network structure to classify nodes. One approach is the Weighted Vote Relational Neighbor (WV-
RN) classifier, which uses the relational information of neighboring nodes to make predictions.
Below is a basic implementation of WV-RN classifier for Twitter data using NodeXL and Python:
```python
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load your Twitter data into a NetworkX graph object

# Example:
# G = nx.read_edgelist('twitter_data.txt')
# Assuming you have loaded your graph and labeled nodes for classification
# Function to extract features from the graph for classification

def extract_features(G, node):
# Degree of the node
degree = G.degree(node)
# Average neighbor degree of the node
avg_neighbor_degree = sum(G.degree(neighbor) for neighbor in G.neighbors(node)) / (degree +
1)
return [degree, avg_neighbor_degree]
# Create feature matrix and target labels

X = []
y = []
for node in G.nodes():

features = extract_features(G, node)
X.append(features)
# Assuming each node has a label
label = 1 if node in positive_nodes else 0
y.append(label)
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the classifier

classifier = KNeighborsClassifier(n_neighbors=5, weights='distance')
classifier.fit(X_train, y_train)
# Predict the labels for test data

y_pred = classifier.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
```
In this code:
- We define a function èxtract_features()` to extract features from the graph for classification. In
this example, we extract the node's degree and average neighbor degree.
- We iterate over all nodes in the graph and extract features along with their labels for classification.
- We split the data into training and test sets.
- We train a K-nearest neighbors classifier using the extracted features.
- We evaluate the classifier's performance by predicting labels for the test data and calculating
accuracy.
You can customize this code according to your specific Twitter data and classification requirements.
Additionally, you can explore more sophisticated classifiers and feature extraction techniques for
better classification performance.
EXP 8
Performing sentiment analysis on an IMDb dataset involves analyzing the sentiment (positive or
negative) associated with movie reviews. Below is a basic implementation of sentiment analysis on
an IMDb dataset using Python with the NLTK library:
```python
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Load your IMDb dataset (CSV file) into a pandas DataFrame

# Example:
# imdb_data = pd.read_csv('imdb_dataset.csv')
# Assuming you have loaded your IMDb dataset, let's perform sentiment analysis
# Initialize the SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
# Function to assign sentiment polarity scores to text
def get_sentiment(text):
# Calculate sentiment polarity scores using VADER
scores = sid.polarity_scores(text)
# Classify sentiment based on compound score
if scores['compound'] >= 0.05:
return 'Positive'
elif scores['compound'] <= -0.05:
return 'Negative'
else:
return 'Neutral'
# Apply sentiment analysis to the IMDb dataset

imdb_data['Sentiment'] = imdb_data['Review'].apply(get_sentiment)
# Print the sentiment analysis results

print("Sentiment Analysis Results:")
print(imdb_data['Sentiment'].value_counts())
```
In this code:
- We use the NLTK library's VADER (Valence Aware Dictionary and sEntiment Reasoner)
sentiment analysis tool for sentiment analysis.
- We load the IMDb dataset into a pandas DataFrame.
- We define a function `get_sentiment()` to calculate sentiment polarity scores using VADER and
classify the sentiment as positive, negative, or neutral based on the compound score.
- We apply sentiment analysis to the 'Review' column of the IMDb dataset and add a new column
'Sentiment' to store the sentiment labels.
- Finally, we print the sentiment analysis results, showing the counts of positive, negative, and
neutral sentiments in the dataset.
You need to ensure you have the NLTK library installed (`pip install nltk`) and have downloaded
the VADER lexicon (`nltk.download('vader_lexicon')`). Additionally, replace `'imdb_dataset.csv'`
with the path to your IMDb dataset CSV file.
EXP 9
To apply the k-means clustering algorithm on an IMDb dataset using Python, you can use libraries
such as scikit-learn. Below is a basic implementation:
```python
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
# Load your IMDb dataset (CSV file) into a pandas DataFrame

# Example:
# imdb_data = pd.read_csv('imdb_dataset.csv')
# Assuming you have loaded your IMDb dataset, let's apply k-means clustering
# Extract text data for clustering (e.g., movie reviews)

text_data = imdb_data['Review']
# Initialize TF-IDF vectorizer

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=2, stop_words='english')
# Fit and transform the text data to TF-IDF features

tfidf_features = tfidf_vectorizer.fit_transform(text_data)
# Apply k-means clustering

k = 3 # Number of clusters
kmeans = KMeans(n_clusters=k)
kmeans.fit(tfidf_features)
# Get cluster labels for each data point

cluster_labels = kmeans.labels_
# Add cluster labels to the DataFrame

imdb_data['Cluster'] = cluster_labels
# Print the count of movies in each cluster

print("Number of movies in each cluster:")
print(imdb_data['Cluster'].value_counts())
```
In this code:
- We use the scikit-learn library to perform k-means clustering.
- We extract text data (e.g., movie reviews) from the IMDb dataset.
- We initialize a TF-IDF vectorizer to convert text data into numerical features.
- We fit and transform the text data into TF-IDF features.
- We apply k-means clustering with a specified number of clusters (k).
- We add cluster labels to the IMDb dataset.
- Finally, we print the count of movies in each cluster.
You need to ensure you have scikit-learn and pandas installed (`pip install scikit-learn pandas`).
Additionally, replace `'imdb_dataset.csv'` with the path to your IMDb dataset CSV file. Adjust the
parameters of the TF-IDF vectorizer and the number of clusters (k) according to your specific
dataset and requirements.
EXP 10
To apply user-based collaborative filtering on Amazon review data using Python, you can use
libraries such as Surprise. Below is a basic implementation:
```python
from surprise import Dataset, Reader, KNNBasic
from surprise.model_selection import train_test_split
from surprise.accuracy import rmse
# Load your Amazon review data into a Surprise dataset object

# Example:
# reader = Reader(line_format='user item rating', sep=',')
# amazon_data = Dataset.load_from_file('amazon_reviews.csv', reader)
# Assuming you have loaded your Amazon review data, let's apply user-based collaborative filtering
# Split the data into train and test sets

trainset, testset = train_test_split(amazon_data, test_size=0.2)
# Define the user-based collaborative filtering model (KNN)

sim_options = {'name': 'cosine', 'user_based': True} # Use cosine similarity
knn_model = KNNBasic(sim_options=sim_options)
# Train the model on the training set

knn_model.fit(trainset)
# Make predictions on the test set

predictions = knn_model.test(testset)
# Compute RMSE (Root Mean Squared Error) to evaluate the model performance
accuracy = rmse(predictions)
print("RMSE:", accuracy)
```
In this code:
- We use the Surprise library, which provides collaborative filtering algorithms and evaluation
metrics.
- We load the Amazon review data into a Surprise dataset object.
- We define the user-based collaborative filtering model using the KNNBasic algorithm with cosine
similarity.
- We train the model on the training set.
- We make predictions on the test set.
- We compute RMSE to evaluate the model's performance.
You need to ensure you have Surprise installed (`pip install scikit-surprise`). Additionally, replace
`'amazon_reviews.csv'` with the path to your Amazon review data CSV file. Adjust the parameters
of the model and evaluation metrics according to your specific dataset and requirements.
EXP 11
To apply item-based collaborative filtering on Amazon review data using Python, you can still use
the Surprise library. However, you'll need to set ùser_based` parameter to `False` to perform item-
based collaborative filtering. Below is how you can implement it:
```python
from surprise import Dataset, Reader, KNNBasic
from surprise.model_selection import train_test_split
from surprise.accuracy import rmse
# Load your Amazon review data into a Surprise dataset object

# Example:
# reader = Reader(line_format='user item rating', sep=',')
# amazon_data = Dataset.load_from_file('amazon_reviews.csv', reader)
# Assuming you have loaded your Amazon review data, let's apply item-based collaborative
filtering

trainset, testset = train_test_split(amazon_data, test_size=0.2)
# Define the item-based collaborative filtering model (KNN)

sim_options = {'name': 'cosine', 'user_based': False} # Use cosine similarity
knn_model = KNNBasic(sim_options=sim_options)

knn_model.fit(trainset)

predictions = knn_model.test(testset)
# Compute RMSE (Root Mean Squared Error) to evaluate the model performance
accuracy = rmse(predictions)
print("RMSE:", accuracy)
```
In this code:
- We still use the Surprise library for collaborative filtering.
- We load the Amazon review data into a Surprise dataset object.
- We define the item-based collaborative filtering model using the KNNBasic algorithm with cosine
similarity and ùser_based` parameter set to `False`.
- We compute RMSE to evaluate the model's performance.
You need to ensure you have Surprise installed (`pip install scikit-surprise`). Additionally, replace
`'amazon_reviews.csv'` with the path to your Amazon review data CSV file. Adjust the parameters
of the model and evaluation metrics according to your specific dataset and requirements.
EXP 12
Predicting individual behavior of users in social media can involve various techniques depending on
the specific behavior you're interested in. One common approach is to use machine learning
algorithms to predict user actions or preferences based on historical data and user features. Below is
a basic example of how you can predict user behavior in social media using Python with scikit-
learn:
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Load your social media data into a pandas DataFrame

# Example:
# social_media_data = pd.read_csv('social_media_data.csv')
# Assuming you have loaded your social media data, let's predict user behavior
# Define features and target variable

X = social_media_data.drop(columns=['target_column']) # Features
y = social_media_data['target_column'] # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the machine learning model (e.g., RandomForestClassifier)

model = RandomForestClassifier(n_estimators=100, random_state=42)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
# Evaluate the model's performance

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
```
In this code:
- We load social media data into a pandas DataFrame. This data should contain features related to
user behavior and a target column representing the behavior you want to predict.
- We define features (X) and the target variable (y).
- We define a machine learning model (e.g., RandomForestClassifier) to predict user behavior.
- We evaluate the model's performance using accuracy score.
You need to ensure you have pandas and scikit-learn installed (`pip install pandas scikit-learn`).
Additionally, replace `'social_media_data.csv'` with the path to your social media data CSV file and
`'target_column'` with the name of the target column representing the behavior you want to predict.
Adjust the machine learning model and parameters according to your specific dataset and prediction
task.

All Exp Lab

Uploaded by

Copyright:

Available Formats

All Exp Lab

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

All Exp Lab

Uploaded by

Copyright:

Available Formats

SOCIAL MEDIA MINING LAB

### 1. Collect Data:

#### From Wikipedia:

### 2. Preprocess Data:

### 3. Create Social Network Graphs:

### 4. Visualize Networks:

### Example Code:

# Twitter API credentials

# Authenticate with Twitter API

# Fetch tweets based on a keyword

# Create an empty graph

# Add nodes for users

# Add edges for retweets and mentions

# Visualize the Twitter network

# Fetch Wikipedia page links

# Create an empty graph

# Add nodes for Wikipedia pages

# Add edges for page links

# Visualize the Wikipedia network

Make sure to replace `'your_consumer_key'`, `'your_consumer_secret'`, `'your_access_token'`, and

# Load your social graph data into a NetworkX graph object

# (i) Degree Centrality

# (ii) Eigenvector Centrality

# (iii) Betweenness Centrality

# (v) Closeness Centrality

# (vi) Group Centrality (Average degree)

# (vii) Clustering Coefficient

# Print or visualize the computed metrics

# Load your social graph data into a NetworkX graph object

# (i) Degree Centrality

# Create a new figure for plotting

# Draw the social graph

# Add color bar for degree centrality

# Add labels for nodes with PageRank

# Load your social graph data into a NetworkX graph object

# Assuming you have loaded your graph, let's detect bridges

# Print the bridges

# Load Enron email data into a NetworkX graph object

# Brute-force approach to find all maximal cliques in the graph

# Print the identified cliques

Implementing the Girvan-Newman algorithm for community detection involves iteratively

# Load your social graph data into a NetworkX graph object

# Function to find communities using Girvan-Newman algorithm

# Detect communities using Girvan-Newman algorithm

# Visualize the communities

# Draw communities with different colors

plt.title("Graph with Communities Identified by Girvan-Newman Algorithm")

# Load your Twitter data into a NetworkX graph object

# Function to extract features from the graph for classification

# Create feature matrix and target labels

for node in G.nodes():

# Train the classifier

# Predict the labels for test data

# Load your IMDb dataset (CSV file) into a pandas DataFrame

# Initialize the SentimentIntensityAnalyzer

# Apply sentiment analysis to the IMDb dataset

# Print the sentiment analysis results

# Load your IMDb dataset (CSV file) into a pandas DataFrame

# Extract text data for clustering (e.g., movie reviews)