Etsy Art Analysis: Enhancing Consumer Traffic
Project Summary
This project focuses on analyzing Etsy listings to determine best practices for listing art-related items on Etsy. Data was obtained via the Etsy API and stored in an SQLite database (etsy_data.db). Although data for Etsy shops and reviews was also extracted and saved, this analysis exclusively centers on optimizing listings to enhance consumer traffic through data driven insights. This analysis used both exploratory data analysis and statistical testing, including t-tests and chi-square tests. The findings will inform strategies for maximizing the visibility and engagement of Etsy listings.
Project files available on GitHub
Key Skills Demonstrated
- Data Extraction: Utilized Etsy API for data retrieval. Etsy API Documentation
- Data Cleaning and Transformation: Employed Python and Jupyter Notebooks for data preparation, including cleaning, filtering, and transforming data for analysis.
- Data Storage: Managed and organized data using SQLite database (etsy_data.db).
- SQL Queries: Leveraged SQL queries for data retrieval and analysis, integrated within Python (sqlite3) and Jupyter Notebooks.
- Statistical Analysis: Conducted t-tests, chi-square tests, and word frequency analysis to validate significant differences in listing characteristics and derive optimization insights.
-
Exploratory Data Analysis and Visualization: Performed exploratory data analysis (EDA) and created visualizations using Python (
Matplotlib
,WordCloud
). - Data-Driven Market Optimization: Conducted performance analysis and extracted marketing insights for optimized Etsy listings.
- Presentation Skills: Compiled project insights into a visual presentation using Canva Slides to effectively communicate findings.
Data Extraction, Cleaning, Transformation and Storage
I accessed the Etsy API using a keystring provided by Etsy after requesting and receiving developer approval. (Note: The API key provided by Etsy is not shown in any code snippets below for security reasons.)
To obtain the necessary data for marketing insights, I created three tables: etsy_listings
, etsy_shops
, and etsy_reviews
. However, this analysis focuses exclusively on the etsy_listings
table to derive insights for optimizing listing traffic. Data for etsy_shops
and etsy_reviews
was saved for potential use in future research or more in-depth analysis regarding shop performance and customer feedback.These cleaned and transformed datasets were stored in an SQLite database (etsy_data.db
).
Data available via the GitHub Data Folder.
Etsy Listings: This table serves as the primary focus of this project. To ensure sufficient and relevant data, I applied keyword filtering and pagination to collect 200 listings related to "art," "painting," and "prints."
-
Transformation and Cleaning:
- Flattened Data: To prepare the data for SQLite storage, I flattened all list-type fields into comma-separated strings.
-
Price Column Transformation: The
price
column contained nested dictionaries with fields foramount
,divisor
, andcurrency_code
. I extracted these values to create two new columns:price_amount
(numeric value of the price) andprice_currency
(currency type). -
Removal of Nested Data: The original
price
column was dropped to avoid issues during analysis.
Click to Show Code for Etsy Listings Extraction and Cleaning
import requests
API_KEY = '*******************'
BASE_URL = 'https://openapi.etsy.com/v3/application'
url = f"{BASE_URL}/listings/active"
headers = {
'x-api-key': API_KEY # API key for authentication
}
# Initialize an empty list to store all listings
all_listings = []
# Loop through 4 pages of 50 listings each (200 total listings)
for offset in range(0, 200, 50):
params = {
'keywords': 'painting, art, print',
'limit': 50,
'offset': offset,
'language': 'en-US' # Filter to English listings
}
response = requests.get(url, headers=headers, params=params)
if response.status_code == 200:
listings = response.json()['results']
all_listings.extend(listings)
else:
print(f"Error at offset {offset}: {response.status_code} - {response.text}")
import pandas as pd
# Convert listings to a DataFrame
df = pd.DataFrame(all_listings)
# Flatten all list fields to comma-separated strings
df['tags'] = df['tags'].apply(lambda x: ', '.join(map(str, x)) if isinstance(x, list) else '')
df['materials'] = df['materials'].apply(lambda x: ', '.join(map(str, x)) if isinstance(x, list) else '')
df['production_partners'] = df['production_partners'].apply(lambda x: ', '.join(map(str, x)) if isinstance(x, list) else '')
df['skus'] = df['skus'].apply(lambda x: ', '.join(map(str, x)) if isinstance(x, list) else '')
df['style'] = df['style'].apply(lambda x: ', '.join(map(str, x)) if isinstance(x, list) else '')
# Flatten the price column into individual fields
df['price_amount'] = df['price'].apply(lambda x: x['amount'] / x['divisor'] if isinstance(x, dict) else None)
df['price_currency'] = df['price'].apply(lambda x: x['currency_code'] if isinstance(x, dict) else None)
# Drop the original price column to avoid issues
df = df.drop('price', axis=1)
After viewing the data, I identified some irrelevant listings, such as art supplies (e.g., Pentart), TV art, and nail art, which were removed from the dataset. Additionally, I excluded listings where the seller was not also the creator (i.e., 'who_made' = 'someone_else'
), focusing specifically on artists selling their own work. After cleaning, 167 listings remained for analysis, saved as the etsy_listings
table in the etsy_data.db
SQLite database.
Click to Show Code for Filtering and Saving Listings Data
# Remove rows where who_made is 'someone_else' or 'collective'
df = df[~df['who_made'].str.strip().str.lower().isin(['someone_else', 'collective'])]
# Apply exclusions based on discoveries made after previously retrieving data
exclusions = ['Nail Art', 'TV Art', 'pentart']
# Filter out unwanted titles
filtered_listings_df = df[
~df['title'].str.contains('|'.join(exclusions), case=False)
]
import sqlite3
# SQLite database
conn = sqlite3.connect('etsy_data.db')
# Save the DataFrame to a SQL table
df.to_sql('etsy_listings', conn, if_exists='replace', index=False)
conn.close()
The other tables created (etsy_shops
and etsy_reviews
) were saved in SQLite but are not directly analyzed in this project. They remain available for future research or for more in-depth analysis into shop performance and customer feedback.
Coding available in Notebooks Folder on GitHub.
Exploratory Data Analysis and Insights
Once the data was cleaned, transformed and stored in in SQLite, I conducted a series of analyses to explore factors that contribute to the success of art-related listings on Etsy. The insights gained help guide how to optimize listings for increased visibility and engagement.
Key Areas of Focus:
- Title Length and Keyword Analysis
- Tag Usage and Keyword Analysis
- Listing Description Insights
Each of these areas was analyzed to determine the characteristics that differentiate the most successful listings from the rest, with the goal of providing actionable recommendations for artists looking to enhance consumer traffic to their Etsy pages.
Optimizing Etsy Listings to Drive Traffic
To identify the most popular listings, I queried the database for the top 50 listings based on teh number of views. I then explored key characteristics such as title length, description details, tags, and listing types.
Title Length Analysis:
Approach:
- I began by querying the SQLite database to extract all listings and the top 50 listings by views.
- The titles were then cleaned to remove any special characters and extra spaces, allowing for accurate word counts.
- To determine if there was a significant difference between the title length of the top 50 listings and the rest of the listings, I conducted a two-sample t-test.
The mean title word count was calculated for the top 50 listings, for all listings, and for listings excluding the top 50.
Click to Show Code for Data Querying and Statistical Analysis
import sqlite3
import pandas as pd
from scipy import stats
import re
# Connect to the SQLite database
conn = sqlite3.connect('etsy_data.db')
# Query for all listings
all_listings_query = """
SELECT
listing_id,
views,
title
FROM etsy_listings;
"""
# Load all listings into a DataFrame
all_listings_df = pd.read_sql(all_listings_query, conn)
# Query for top 50 listings by views
top_50_query = """
SELECT
listing_id,
views,
title
FROM etsy_listings
ORDER BY views DESC
LIMIT 50;
"""
# Load the top 50 listings into a DataFrame
top_50_listings_df = pd.read_sql(top_50_query, conn)
# Close the database connection
conn.close()
# Clean 'title' column and add word count for all listings
all_listings_df['cleaned_title'] = all_listings_df['title'].apply(lambda x: re.sub(r'[^a-zA-Z0-9\s]', '', str(x)))
all_listings_df['cleaned_title'] = all_listings_df['cleaned_title'].apply(lambda x: re.sub(r'\s+', ' ', x).strip())
all_listings_df['word_count'] = all_listings_df['cleaned_title'].apply(lambda x: len(x.split()))
# Repeat for the top 50 listings
top_50_listings_df['cleaned_title'] = top_50_listings_df['title'].apply(lambda x: re.sub(r'[^a-zA-Z0-9\s]', '', str(x)))
top_50_listings_df['cleaned_title'] = top_50_listings_df['cleaned_title'].apply(lambda x: re.sub(r'\s+', ' ', x).strip())
top_50_listings_df['word_count'] = top_50_listings_df['cleaned_title'].apply(lambda x: len(x.split()))
# Mean word count for all listings
mean_all_listings = all_listings_df['word_count'].mean()
# Mean word count for top 50 listings
mean_top_50 = top_50_listings_df['word_count'].mean()
# Perform one-sample t-test
t_stat, p_value = stats.ttest_1samp(top_50_listings_df['word_count'], mean_all_listings)
# Get the listing_ids of the top 50 listings
top_50_ids = top_50_listings_df['listing_id']
# Create a DataFrame for all listings excluding the top 50
other_listings_df = all_listings_df[~all_listings_df['listing_id'].isin(top_50_ids)]
# Mean word count for listings not in the top 50 by views
mean_other_listings = other_listings_df['word_count'].mean()
# Perform two-sample t-test
t_stat, p_value = stats.ttest_ind(top_50_listings_df['word_count'], other_listings_df['word_count'], equal_var=False)
Title Length Insights:
- The mean word count for the top 50 listings was 18.48 words, while the mean word count for all listings was 15.67 words.
- The two-sample t-test confirmed that the difference in title lengths between the top 50 listings and the rest of the dataset was statistically significant (p-value: 0.0002).
- Takeaway: Longer titles that range from 18-20 words in length tend to attract more views, suggesting that Etsy sellers should focus on crafting detailed and engaging titles for their listings.
Visualization of Title Word Count:
The distribution of title word count for the top 50 listings was visualized using a bar chart with additional notation to show the difference between the top 50 listings and the rest of the dataset.
Click to Show Code for Graph Creation
import matplotlib.pyplot as plt
# Create a bar chart for title word count distribution in the top 50 listings
word_count_distribution = top_50_listings_df['word_count'].value_counts().sort_index()
# Create the plot
plt.figure(figsize=(12, 6))
plt.bar(word_count_distribution.index, word_count_distribution.values, color='skyblue', edgecolor='black')
plt.xlabel('Number of Words in Title')
plt.ylabel('Number of Listings')
plt.title('Distribution of Title Word Count for Top 50 Etsy Listings by Views')
plt.xticks(range(min(word_count_distribution.index), max(word_count_distribution.index) + 1))
# Add lines for mean of the top 50 listings
mean_word_count = top_50_listings_df['word_count'].mean()
# Add the line for the mean of the top 50 listings
plt.axvline(mean_word_count, color='red', linestyle='dashed', linewidth=1.5, label=f'Mean (Top 50): {mean_word_count:.2f}')
# Add a line for the mean of the rest of the listings
mean_word_count_rest = other_listings_df['word_count'].mean()
plt.axvline(mean_word_count_rest, color='orange', linestyle='dashed', linewidth=1.5, label=f'Mean (Rest): {mean_word_count_rest:.2f}')
# Add an arrow pointing to the mean of the rest of the listings with text on the left
plt.annotate(
'Mean for All Other Listings: {:.2f} words'.format(mean_word_count_rest),
xy=(mean_word_count_rest, max(word_count_distribution.values) * 0.8),
xytext=(mean_word_count_rest - 8, max(word_count_distribution.values)),
arrowprops=dict(facecolor='black', arrowstyle='->', lw=1.5),
fontsize=10,
color='black'
)
# Add an arrow pointing to the mean of the top 50 listings with text on the right
plt.annotate(
'Mean for Top 50 Listings: {:.2f} words'.format(mean_word_count),
xy=(mean_word_count, max(word_count_distribution.values) * 0.6),
xytext=(mean_word_count + 4, max(word_count_distribution.values) * 0.5),
arrowprops=dict(facecolor='black', arrowstyle='->', lw=1.5),
fontsize=10,
color='black'
)
# Add a legend to explain the lines
plt.legend()
# Save and show the plot
plt.savefig('top_50_title_word_count_distribution_without_median_mode.png', dpi=300, bbox_inches='tight')
plt.show()
Title Keyword Analysis:
Approach:
- The analysis focused on identifying and comparing the most common single words used in the titles of the top 50 listings versus the bottom 50 listings (based on views).
- After cleaning the data, a chi-square test was conducted to determine if there was a statistically significant difference in word usage between the two groups.
Click to Show Code for Data Querying and Statistical Analysis
import sqlite3
import pandas as pd
from collections import Counter
import re
import html # Library to unescape HTML entities
from scipy.stats import chi2_contingency
# Connect to the SQLite database
conn = sqlite3.connect('etsy_data.db')
# Retrieve the top 50 listings by views
top_50_query = """
SELECT
title
FROM etsy_listings
ORDER BY views DESC
LIMIT 50;
"""
# Retrieve the bottom 50 listings by views
bottom_50_query = """
SELECT
title
FROM etsy_listings
ORDER BY views ASC
LIMIT 50;
"""
# Execute the queries and load the results into DataFrames
top_50_listings_df = pd.read_sql(top_50_query, conn)
bottom_50_listings_df = pd.read_sql(bottom_50_query, conn)
# Close the database connection
conn.close()
# Concatenate all titles into a single string for both top 50 and bottom 50 listings
all_titles_top_50 = ' '.join(top_50_listings_df['title'].tolist())
all_titles_bottom_50 = ' '.join(bottom_50_listings_df['title'].tolist())
# Unescape HTML entities to deal with things like "
all_titles_top_50 = html.unescape(all_titles_top_50)
all_titles_bottom_50 = html.unescape(all_titles_bottom_50)
# Remove unwanted characters, leaving only letters, numbers, spaces, and hyphens
all_titles_top_50 = re.sub(r'[^a-zA-Z0-9\s\-]', '', all_titles_top_50)
all_titles_bottom_50 = re.sub(r'[^a-zA-Z0-9\s\-]', '', all_titles_bottom_50)
# Extract single words
single_words_top_50 = re.findall(r'\b[a-zA-Z0-9]+\b', all_titles_top_50.lower())
single_words_bottom_50 = re.findall(r'\b[a-zA-Z0-9]+\b', all_titles_bottom_50.lower())
# Count the frequency of each word
top_50_word_counts = Counter(single_words_top_50)
bottom_50_word_counts = Counter(single_words_bottom_50)
# Get the top 10 single words and their frequencies for both top 50 and bottom 50 listings
top_10_single_words_top_50 = top_50_word_counts.most_common(10)
top_10_single_words_bottom_50 = bottom_50_word_counts.most_common(10)
# Convert to DataFrames for easier viewing
top_10_top_50_df = pd.DataFrame(top_10_single_words_top_50, columns=['Word', 'Frequency'])
top_10_bottom_50_df = pd.DataFrame(top_10_single_words_bottom_50, columns=['Word', 'Frequency'])
# Print the top 10 words for the top 50 listings
print("Top 10 Single Words in Top 50 Listings:")
print(top_10_top_50_df)
# Print the top 10 words for the bottom 50 listings
print("\nTop 10 Single Words in Bottom 50 Listings:")
print(top_10_bottom_50_df)
# Chi-Square Analysis for Word Usage Between Top 50 and Bottom 50 Listings
contingency_table = []
for word, _ in top_10_single_words_top_50:
top_50_count = top_50_word_counts.get(word, 0)
bottom_50_count = bottom_50_word_counts.get(word, 0)
contingency_table.append([top_50_count, bottom_50_count])
# Perform chi-square test for single words
chi2, p_value, _, _ = chi2_contingency(contingency_table)
print("\nChi-Square Test Results for Single Words:")
print(f"Chi-Square Statistic: {chi2:.4f}")
print(f"P-Value: {p_value:.4f}")
if p_value < 0.05:
print("The difference in word usage between the top 50 and bottom 50 listings is statistically significant.")
else:
print("There is no statistically significant difference in word usage between the top 50 and bottom 50 listings.")
Title Keyword Insights:
- The chi-square test confirmed a statistically significant difference in word usage between the top 50 and bottom 50 listings (p-value < 0.05).
- Certain keywords were more commonly used in the top listings, highlighting the effectiveness of particular words in attracting more views.
Visualization of Title Keyword Usage:
A word cloud was generated to visually represent the keywords that were most commonly used in the titles of the top 50 listings by views. This visualization helps to identify which keywords Etsy sellers can leverage to boost their listing visibility.
Click to Show Code for Word Cloud Generation
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Generate word cloud for the top 50 listings
wordcloud_top_50 = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(top_50_word_counts)
# Display the word cloud for the top 50 listings
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_top_50, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for Top 50 Etsy Listings by Views', fontsize=20)
plt.savefig('top_50_title_word_cloud.png', format='png', bbox_inches='tight')
plt.show()
Tag Number Analysis:
Tags are a key feature in Etsy listings that help improve their discoverability by associating the listing with relevant search terms. Etsy allows up to 13 tags per listing, and it is commonly believed that maximizing the number of tags helps listings rank better in search results.
Approach:
- To explore whether there is a difference in how many tags are used by high-performing listings, I analyzed tag usage in the top 50 listings by views and compared it to the bottom 50 listings by views.
- The number of tags per listing was counted, and a statistical analysis (using a two-proportion z-test) was conducted to determine if there was a statistically significant difference between the two groups in terms of utilizing the maximum number of 13 tags.
Click to Show Code for Data Querying and Statistical Analysis
import sqlite3
import pandas as pd
from statsmodels.stats.proportion import proportions_ztest
# Connect to the SQLite database
conn = sqlite3.connect('etsy_data.db')
# Retrieve the top 50 listings by views
top_50_query = """
SELECT
listing_id,
tags
FROM etsy_listings
ORDER BY views DESC
LIMIT 50;
"""
# Retrieve the bottom 50 listings by views
bottom_50_query = """
SELECT
listing_id,
tags
FROM etsy_listings
ORDER BY views ASC
LIMIT 50;
"""
# Execute the queries and load the results into DataFrames
top_50_listings_df = pd.read_sql(top_50_query, conn)
bottom_50_listings_df = pd.read_sql(bottom_50_query, conn)
# Close the database connection
conn.close()
# Count the number of tags for each listing
top_50_listings_df['num_tags'] = top_50_listings_df['tags'].apply(lambda x: len(x.split(',')) if isinstance(x, str) else 0)
bottom_50_listings_df['num_tags'] = bottom_50_listings_df['tags'].apply(lambda x: len(x.split(',')) if isinstance(x, str) else 0)
# Count of listings with 13 tags
count_13_tags_top_50 = (top_50_listings_df['num_tags'] == 13).sum()
count_13_tags_bottom_50 = (bottom_50_listings_df['num_tags'] == 13).sum()
# Total number of listings in each group
total_top_50 = len(top_50_listings_df)
total_bottom_50 = len(bottom_50_listings_df)
# Perform two-proportion z-test
count_13_tags = [count_13_tags_top_50, count_13_tags_bottom_50]
n_obs = [total_top_50, total_bottom_50]
from statsmodels.stats.proportion import proportions_ztest
z_stat, p_value = proportions_ztest(count_13_tags, n_obs)
print("Two-Proportion Z-Test Results for Listings with 13 Tags:")
print(f"Z-Statistic: {z_stat:.4f}")
print(f"P-Value: {p_value:.4f}")
if p_value < 0.05:
print("The difference in the proportion of listings with 13 tags between the top 50 and bottom 50 listings is statistically significant.")
else:
print("There is no statistically significant difference in the proportion of listings with 13 tags between the top 50 and bottom 50 listings.")
Tag Number Insights:
- The two-proportion z-test results indicated a statistically significant difference between the top 50 listings and the bottom 50 listings in terms of tag usage.
- Listings in the top 50 had a higher proportion of listings using all 13 tags compared to the bottom 50 listings, suggesting that maximizing tag usage could help improve discoverability.
Visualization of Tag Number:
To provide a clearer picture of the distribution, I created a bar chart showing the proportion of listings that use all 13 tags in both the top 50 and bottom 50 groups. This visualization helps sellers understand the potential benefit of fully utilizing tags.
Click to Show Code for Generating Pie Charts for Tag Usage Comparison
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
# Connect to the SQLite database
conn = sqlite3.connect('etsy_data.db')
# Retrieve the top 50 listings by views
top_50_query = """
SELECT
listing_id,
tags
FROM etsy_listings
ORDER BY views DESC
LIMIT 50;
"""
# Retrieve the bottom 50 listings by views
bottom_50_query = """
SELECT
listing_id,
tags
FROM etsy_listings
ORDER BY views ASC
LIMIT 50;
"""
# Execute the queries and load the results into DataFrames
top_50_listings_df = pd.read_sql(top_50_query, conn)
bottom_50_listings_df = pd.read_sql(bottom_50_query, conn)
# Close the database connection
conn.close()
# Count the number of tags for each listing
top_50_listings_df['num_tags'] = top_50_listings_df['tags'].apply(lambda x: len(x.split(',')) if isinstance(x, str) else 0)
bottom_50_listings_df['num_tags'] = bottom_50_listings_df['tags'].apply(lambda x: len(x.split(',')) if isinstance(x, str) else 0)
# Count the number of listings with 13 tags vs. those with fewer than 13 tags in each group
top_50_count_13_tags = (top_50_listings_df['num_tags'] == 13).sum()
top_50_count_less_than_13_tags = len(top_50_listings_df) - top_50_count_13_tags
bottom_50_count_13_tags = (bottom_50_listings_df['num_tags'] == 13).sum()
bottom_50_count_less_than_13_tags = len(bottom_50_listings_df) - bottom_50_count_13_tags
# Data for pie charts
top_50_data = [top_50_count_13_tags, top_50_count_less_than_13_tags]
bottom_50_data = [bottom_50_count_13_tags, bottom_50_count_less_than_13_tags]
# Labels for the slices
labels = ['13 Tags', 'Fewer than 13 Tags']
# Colors: using shades consistent with previous figures (e.g., greens for positive actions)
colors = ['#66c2a5', '#e5f5f9'] # Green shade for 13 Tags, light color for fewer tags
# Create pie charts
fig, axs = plt.subplots(1, 2, figsize=(14, 7))
# Top 50 Listings Pie Chart
axs[0].pie(
top_50_data,
labels=labels,
colors=colors,
autopct=lambda pct: f'{pct:.1f}%' if pct > 0 and labels[top_50_data.index(max(top_50_data))] == '13 Tags' else '',
startangle=90,
wedgeprops={'edgecolor': 'black'}
)
axs[0].set_title('Tag Usage in Top 50 Listings by Views')
# Bottom 50 Listings Pie Chart
axs[1].pie(
bottom_50_data,
labels=labels,
colors=colors,
autopct=lambda pct: f'{pct:.1f}%' if pct > 0 and labels[bottom_50_data.index(max(bottom_50_data))] == '13 Tags' else '',
startangle=90,
wedgeprops={'edgecolor': 'black'}
)
axs[1].set_title('Tag Usage in Bottom 50 Listings by Views')
# Set equal aspect ratio to ensure pie is a circle
for ax in axs:
ax.axis('equal')
# Add a main title to the figure
plt.suptitle('Comparison of Tag Usage in Top 50 vs Bottom 50 Etsy Listings', fontsize=16)
# Save the figure
plt.savefig('tag_usage_comparison_top_vs_bottom_50.png', format='png', bbox_inches='tight', dpi=300)
# Show the plot
plt.show()
Tag Keyword Analysis:
Tags are vital for Etsy listings as they directly influence the listing's visibility by associating the item with relevant search terms. Effective tag keywords can significantly increase exposure by matching with the phrases potential customers use when searching on Etsy.
Approach:
- Queried the SQLite database to retrieve tags for the top 50 and bottom 50 listings based on views.
- Cleaned the data to count the frequency of each tag word.
- Conducted a chi-square analysis to test if there were statistically significant differences in tag keyword usage between the top 50 and bottom 50 listings.
- Created a word cloud visualization for the top 50 listings to highlight the most frequently used tags.
Click to Show Code for Data Querying and Statistical Analysis
import sqlite3
import pandas as pd
from collections import Counter
from scipy.stats import chi2_contingency
# Connect to the SQLite database
conn = sqlite3.connect('etsy_data.db')
# Retrieve the top 50 listings by views
top_50_query = """
SELECT
tags
FROM etsy_listings
ORDER BY views DESC
LIMIT 50;
"""
# Retrieve the bottom 50 listings by views
bottom_50_query = """
SELECT
tags
FROM etsy_listings
ORDER BY views ASC
LIMIT 50;
"""
# Execute the queries and load the results into DataFrames
top_50_tags_df = pd.read_sql(top_50_query, conn)
bottom_50_tags_df = pd.read_sql(bottom_50_query, conn)
# Close the database connection
conn.close()
# Split the tags by comma and flatten them into a single list for counting
top_50_tags = ','.join(top_50_tags_df['tags']).split(',')
bottom_50_tags = ','.join(bottom_50_tags_df['tags']).split(',')
# Strip whitespace from each tag and convert to lowercase
top_50_tags = [tag.strip().lower() for tag in top_50_tags if tag.strip()]
bottom_50_tags = [tag.strip().lower() for tag in bottom_50_tags if tag.strip()]
# Count the frequency of each tag for both top 50 and bottom 50 listings
top_50_tag_counts = Counter(top_50_tags)
bottom_50_tag_counts = Counter(bottom_50_tags)
# Get the top 10 tags from both groups
top_tags_top_50 = [tag for tag, _ in top_50_tag_counts.most_common(10)]
# Create the contingency table for chi-square test
contingency_table = []
for tag in top_tags_top_50:
top_50_count = top_50_tag_counts.get(tag, 0)
bottom_50_count = bottom_50_tag_counts.get(tag, 0)
contingency_table.append([top_50_count, bottom_50_count])
# Convert to a DataFrame for better viewing (optional)
contingency_df = pd.DataFrame(contingency_table, index=top_tags_top_50, columns=['Top 50 Listings', 'Bottom 50 Listings'])
print(contingency_df)
# Perform chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-Square Statistic: {chi2:.4f}")
print(f"P-Value: {p:.4f}")
# Interpretation
if p < 0.05:
print("The difference in tag usage between the top 50 listings and the bottom 50 listings is statistically significant.")
else:
print("There is no statistically significant difference in tag usage between the top 50 listings and the bottom 50 listings.")
Tag Keyword Insights:
- The chi-square test showed that there was a statistically significant difference in the use of tags between the top 50 and bottom 50 listings by views (p-value < 0.05).
- Specific tags were more common in the top listings, indicating that these keywords effectively attract more views.
Visualization of Tag Keywords:
A word cloud was generated to visualize the most commonly used tags among the top 50 listings by views, helping sellers identify keywords that might be effective in improving their own listings' visibility.
Click to Show Code for Word Cloud Generation
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Generate the word cloud from the tag frequencies of the top 50 listings
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(top_50_tag_counts)
# Display the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Tag Word Cloud for Top 50 Etsy Listings by Views', fontsize=20)
# Save the figure
plt.savefig('tag_keyword_wordcloud.png', format='png', bbox_inches='tight', dpi=300)
plt.show()
Description Keyword Analysis:
In this section, I analyzed the keywords in the descriptions of the top 50 and bottom 50 listings to understand what differentiates more successful descriptions from less successful ones.
Approach:
- I queried the SQLite database to extract the descriptions of the top 50 and bottom 50 listings by views.
- After extracting the descriptions, I isolated the first sentence from each listing to capture the keywords most likely to affect search visibility. SEO practices often emphasize the importance of the first sentence, which impacts how listings are indexed and discovered by potential buyers.
- I tokenized the first sentences into individual words and removed common stopwords (e.g., "the", "and", "in") to ensure the analysis focused on meaningful content.
- I performed a chi-square analysis to determine if there was a statistically significant difference in keyword usage between the top and bottom 50 listings.
Click to Show Code for Data Querying and Statistical Analysis
import sqlite3
import pandas as pd
from collections import Counter
import re
# Connect to the SQLite database
conn = sqlite3.connect('etsy_data.db')
# Retrieve the top 50 listings by views and the bottom 50 listings by views
query_top_50 = """
SELECT
title,
description
FROM etsy_listings
ORDER BY views DESC
LIMIT 50;
"""
query_bottom_50 = """
SELECT
title,
description
FROM etsy_listings
ORDER BY views ASC
LIMIT 50;
"""
# Execute the queries and load the results into DataFrames
top_50_listings_df = pd.read_sql(query_top_50, conn)
bottom_50_listings_df = pd.read_sql(query_bottom_50, conn)
# Close the database connection
conn.close()
# Function to extract the first sentence from a description
def extract_first_sentence(description):
if pd.isna(description) or description.strip() == "":
return ""
# Split by punctuation that could end a sentence (., !, ?)
sentences = re.split(r'[.!?]', description)
return sentences[0].strip() if sentences else ""
# Apply the function to extract the first sentence for both top and bottom listings
top_50_listings_df['first_sentence'] = top_50_listings_df['description'].apply(extract_first_sentence)
bottom_50_listings_df['first_sentence'] = bottom_50_listings_df['description'].apply(extract_first_sentence)
# Tokenize the first sentence into individual words and count frequencies
def tokenize_and_count(sentences, stopwords):
words = []
for sentence in sentences:
words.extend([word for word in re.findall(r'\b\w+\b', sentence.lower()) if word not in stopwords])
return Counter(words)
# Define a set of stopwords to filter out
stopwords = set(["the", "of", "a", "in", "and", "this", "with", "is", "on", "for", "to", "it", "an", "as", "that", "or", "by", "be", "are", "at", "from"])
# Get word counts for the first sentences of top 50 and bottom 50 listings
top_50_word_counts = tokenize_and_count(top_50_listings_df['first_sentence'], stopwords)
bottom_50_word_counts = tokenize_and_count(bottom_50_listings_df['first_sentence'], stopwords)
# Get the top 10 words for both groups
top_10_words_top_50 = top_50_word_counts.most_common(10)
top_10_words_bottom_50 = bottom_50_word_counts.most_common(10)
# Convert to DataFrames for better readability
top_10_top_50_df = pd.DataFrame(top_10_words_top_50, columns=['Word', 'Frequency'])
top_10_bottom_50_df = pd.DataFrame(top_10_words_bottom_50, columns=['Word', 'Frequency'])
# Chi-square analysis
from scipy.stats import chi2_contingency
# Create contingency table for top 10 words in both groups
top_10_words = list(set(top_10_top_50_df['Word']).union(set(top_10_bottom_50_df['Word'])))
contingency_table = []
for word in top_10_words:
count_top = top_50_word_counts.get(word, 0)
count_bottom = bottom_50_word_counts.get(word, 0)
contingency_table.append([count_top, count_bottom])
# Perform chi-square test
chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)
print(f"Chi-Square Statistic: {chi2_stat:.4f}")
print(f"P-Value: {p_value:.4f}")
Description Keyword Insights:
- The chi-square test showed a statistically significant difference in word usage between the top 50 listings and the bottom 50 listings (p-value < 0.05).
- Certain keywords were used more frequently in the descriptions of the top listings, suggesting their effectiveness in driving visibility.
The most frequent keywords in the top 50 listings often overlap with those used in the listing titles, emphasizing the importance of repetition for SEO purposes. This reinforces the idea that repeating important keywords across different fields of a listing (title, tags, and description) can improve visibility.
Word Cloud Visualization: A word cloud was generated to visually represent the most common keywords found in the descriptions of the top 50 listings.
Click to Show Code for Word Cloud Generation
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Generate word cloud for top 50 listings by views
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(top_50_word_counts)
# Display the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off') # Turn off axis
# Add a title to the figure
plt.title('Keyword Usage in First Sentences of Top 50 Listings by Views', fontsize=20)
# Save the figure
plt.savefig('description_word_cloud.png', format='png', bbox_inches='tight')
# Show the word cloud
plt.show()
Key Client Takeaways (Sample Method for Sharing Insights with Clients and Interested Parties):
Etsy Art Analysis: Enhancing Consumer Traffic by Melissa Fisher
Methodology Insights
Throughout this project, leveraging SQL allowed me to work more effectively by enabling targeted filtering and querying, which significantly reduced the need for extensive data cleaning. Unlike handling large datasets directly in Pandas, where everything is imported and often needs cleaning, SQL enabled me to interact only with the relevant subsets of the data right from the start.
I found it particularly beneficial to play with different parts of the data, assessing which aspects were most relevant without having to recreate entire datasets or tables. This ability to run different SQL queries on the original data helped refine the analysis process. For example, as I determined which listing attributes, such as tags and views, were most pertinent, I could easily focus only on those parts without distractions from irrelevant fields. This flexibility ultimately led to more efficient data exploration and insight generation.
In the initial stages of the analysis, I considered comparing the top 50 listings to all listings, as well as comparing the top 50 listings to the rest of the dataset. Ultimately, I chose the latter, as it provided more relevant insights into what differentiates high-performing listings from the rest, giving artists actionable strategies to improve visibility.
To gain insights into the factors contributing to the success of Etsy listings, I used two complementary approaches. In analyzing title word count, I compared the top 50 listings by views with all other listings to capture general patterns in title length across a broad dataset. However, when analyzing single-word and two-word phrases, I compared the top 50 with the bottom 50 listings to directly identify differences between the most and least successful listings. This allowed me to provide clear and actionable keyword recommendations for sellers, highlighting not only effective practices but also potential pitfalls to avoid.
Conclusion
The goal of this project was to identify key factors that contribute to successful Etsy listings, particularly for art-related items. As an artist myself, I designed this project not only to showcase my skills but also to generate actionable insights that I could directly apply to my own practice and share with fellow artists. The findings highlight several best practices for driving traffic, including the importance of descriptive titles, keyword optimization, and effective tag usage.
By focusing on consumer traffic strategies, I aimed to offer practical solutions that Etsy sellers could implement to enhance visibility. This project demonstrated the value of data-driven insights in an artist's business and illustrated my ability to apply both statistical analysis and practical data insights to real-world problems.
For future studies, I plan to explore conversion optimization—delving into the elements that influence not only listing visibility but also purchasing behavior. This will help further uncover the full journey from visibility to sales success on platforms like Etsy.