5 Comments
User's avatar
Neela 🌶️'s avatar

I love your growth story Finn.

Especially how you changed your focus from writing to diving deep into data and building tools for fellow creators. I admit, I initially sucked at checking my stats on Substack, but I got better at it thanks to a brother from another mother, Mack Collier. Now I have you.

I hope you have a good week ahead.

Expand full comment
Jenny Ouyang's avatar

There are so many brilliant ideas you are bringing to us, thank you for sharing this work Finn!

I love what you are doing with the chrome extension and data analysis skills.

I am curious about what your comprehensive analysis report would look like 🤩

Expand full comment
Finn Tropy's avatar

I wrote this article https://finntropy.substack.com/p/how-often-should-you-publish-notes based on a dataset of 1,360,173 Notes published by 13,534 newsletter authors. I built a data pipeline using Substack APIs and Python scripts to pull the data.

The data is organized into a SQLite3 and Postgres SQL database, so creating new views and graphs is relatively easy. I'm using Grafana to develop charts and some Python libraries like Seaborn and Matplotlib.

Do you have an interest in this type of dataset?

Expand full comment
Jenny Ouyang's avatar

Wow thank you for the detailed response!

I really want to say yes I’m interested in the data. But I’m more interested in the questions you inspired:

1. Are you updating the db using your pipeline regularly?

2. Are you hosting the data somewhere or just kept locally?

3. Are you storing the unfiltered response? (which might be huge?)

4. I assume the number is a subset of “all” the notes, so how do you decide which notes to not add to the db?

So glad to encounter your newsletter and the work is genuinely great!

Expand full comment
Finn Tropy's avatar

1. I did run it for a few weeks back in November. I should probably do another run.

2. I'm just running on my local Mac Mini for now. I did a test running the pipeline in AWS using Lambda and Step Functions, but I didn't finish that project. I did build exponential backoff in case APIs return 429 errors to protect Substack servers from overload.

3. I built the caching of JSON responses to the local file system as an option but didn't store all the JSON data after initial testing. It consumes a lot of disk space, so storing the data in an AWS S3 bucket for later analysis might make sense. In my previous roles, I've used tools like AWS Glue for ETL jobs.

4. For each newsletter author, I pulled all posts and notes available back in November. I created a simple relational database model using Python Sqlalchemy ORM that wrote the objects into the SQL database, keeping relationships.

This allows me to build views of 13K newsletters, categories, bestseller tiers, posting frequency, word counts, time of day of the posts and notes, and other topics of interest. I also stored the URLs to the notes and posts in case a more thorough content analysis was needed.

Expand full comment