Tracking Sustainable IT: Introducing my Arxiv Frontpage
In Short
Keeping up with the ever-evolving world of scientific research can be a bit like chasing your hyperactive cat through a maze. And when it comes to the intersection of sustainability and IT, things seem to be moving faster and faster. But hey, it’s exciting! And doing things more sustainably while keeping our standard of living might be the most difficult thing our generation has to solve!
One day, while scrolling through my linkedin feed, I stumbled upon a post from Vincent Warmerdam. He had this brilliant idea: a personal Arxiv Frontpage. Genius! I absolutely love it! So I thought, “Why not do something similar, but with a sustainable twist?”
Long story short, here’s the result:
Some of the Code Behind It
To bring this idea to life, I turned to a nifty little Python package, arxiv. This gem handles all the heavy lifting, from API requests to object parsing, making life a whole lot easier.
Here’s a snippet of the code that makes that happen:
import arxiv
SEARCH_WORDS = ['hello', 'world']
cls = "CS.AI"
search = arxiv.Search(
query="(" + ' OR '.join(f"ti:{word} OR abs:{word}" for word in SEARCH_WORDS) + f') AND cat:{cls}',
max_results=100,
sort_by=arxiv.SortCriterion.SubmittedDate
)
In a nutshell, I set it up to:
- Look for articles that mention words like “Sustainable,” “Sustainability,” “Carbon,” or “Emissions” in their titles or abstracts.
- Focus on Computer Science subcategories in Arxiv and a few other intriguing ones.
- Scope out the last 30 days of research.
To avoid clutter, I filter out duplicates, sort the results by date, and keep around 30 or so of the latest papers. I store these gems in a Git repo because, hey, sometimes simplicity is key, and online storage or databases can be overkill.
Now, every morning, a GitHub action kicks in, like clockwork. Voilà, mission accomplished!
When a Personal Project Takes a Life of Its Own
Then you revisit your solution, determine that you should do more tracking than necessary, do some d3-graphing because that seems easy, and end up with a bunch of additional json-files to monitor categories over time. Personal projects tend to evolve and suck you in, but now I really stop.
Then you decide to not stop and add a count for the total number of articles over time, including non-GDPR-violating details even though this information is all public but you’re doing this solo and want to stay far away from any potential legal conflicts, aggregatings of those details because you don’t want your repository to be overlown by details, and then you actually stop.
And then, I finally stopped. But that’s the magic of personal projects, right? They have a way of pulling you back in. Oh, but don’t worry, I won’t be setting up databases, load balancing, or AKS clusters. Not this time, at least. Maybe a little refactoring, though. A little refactoring never hurt nobody!
The Automation Behind It
I just love the simplicity of Github Actions. Plus, you can re-use open-source actions to outsource some of the heavy lifting! Thanks Stefan Zweifel for creatin the auto commit action!
The result:
name: Run Arxiv.py
on:
schedule:
- cron: "30 1 * * *"
jobs:
build:
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- name: Checkout Repo Content
uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install Python Packages
run: |
python -m pip install --upgrade pip
pip install -r ./arxiv/requirements.txt
- name: Execute Python Script
run: python ./arxiv/main.py
- uses: stefanzweifel/git-auto-commit-action@v5
with:
commit_message: Automated Arxiv Update
And a bit of hacky D3
It would be interesting to keep check of how many articles appear each day. Not only to check whether I need to adjust the python code, but also to see whether the field is accelerating!
So, here is the result! And since it’s essentially the graph as you find on the Sustainable IT Arxiv Feed, it’s automatically updated too!
Note that the search might be quite broad and is heavily dependent on the eagerness of authors to make us of my keywords for unexpected purposes. So the results should be taken with some scrutiny.
And here is some of the code that’s responsible for that. I can share more in case you’re interested!
d3.json("/json/my-json.json",
// Now I can use this dataset:
function(data) {
data = data.map(function(d){ return {date: d3.timeParse("%Y-%m-%d")(d.date), count: d.count} });
var cutoffDate = new Date();
cutoffDate.setDate(cutoffDate.getDate() - days);
data = data.filter(function(d){ return d.date > cutoffDate })
// Add X axis --> it is a date format
var x = d3.scaleTime()
.domain(d3.extent(data, function(d) { return d.date }))
.range([ 0, width ]);
svg.append("g")
.attr("transform", "translate(0," + height + ")")
.call(d3.axisBottom(x));
// Add Y axis
var y = d3.scaleLinear()
.domain([0, d3.max(data, function(d) { return +d.count; })])
.range([ height, 0 ]);
svg.append("g")
.call(d3.axisLeft(y));
// Add the line
svg.append("path")
.datum(data)
.attr("fill", "none")
.attr("stroke", "steelblue")
.attr("stroke-width", 1.5)
.attr("d", d3.line()
.x(function(d) { return x(d.date) })
.y(function(d) { return y(d.count) })
)
}
)
The python code
For those who want to dive into the hacky Python code, shoot me a message and I’ll share it with you!