Tracking Sustainable IT: Introducing my Arxiv Frontpage | Tom Kennes

Tracking Sustainable IT: Introducing my Arxiv Frontpage

In Short

Keeping up with the ever-evolving world of scientific research can be a bit like chasing your hyperactive cat through a maze. And when it comes to the intersection of sustainability and IT, things seem to be moving faster and faster. But hey, it’s exciting! And doing things more sustainably while keeping our standard of living might be the most difficult thing our generation has to solve!

One day, while scrolling through my linkedin feed, I stumbled upon a post from Vincent Warmerdam. He had this brilliant idea: a personal Arxiv Frontpage. Genius! I absolutely love it! So I thought, “Why not do something similar, but with a sustainable twist?”

Long story short, here’s the result:



Some of the Code Behind It

To bring this idea to life, I turned to a nifty little Python package, arxiv. This gem handles all the heavy lifting, from API requests to object parsing, making life a whole lot easier.

Here’s a snippet of the code that makes that happen:

import arxiv

SEARCH_WORDS = ['hello', 'world']
cls = "CS.AI"

search = arxiv.Search(
    query="(" + ' OR '.join(f"ti:{word} OR abs:{word}" for word in SEARCH_WORDS) + f') AND cat:{cls}',
    max_results=100,
    sort_by=arxiv.SortCriterion.SubmittedDate
)

In a nutshell, I set it up to:

  • Look for articles that mention words like “Sustainable,” “Sustainability,” “Carbon,” or “Emissions” in their titles or abstracts.
  • Focus on Computer Science subcategories in Arxiv and a few other intriguing ones.
  • Scope out the last 30 days of research.

To avoid clutter, I filter out duplicates, sort the results by date, and keep around 30 or so of the latest papers. I store these gems in a Git repo because, hey, sometimes simplicity is key, and online storage or databases can be overkill.

Now, every morning, a GitHub action kicks in, like clockwork. Voilà, mission accomplished!

When a Personal Project Takes a Life of Its Own

Then you revisit your solution, determine that you should do more tracking than necessary, do some d3-graphing because that seems easy, and end up with a bunch of additional json-files to monitor categories over time. Personal projects tend to evolve and suck you in, but now I really stop.

Then you decide to not stop and add a count for the total number of articles over time, including non-GDPR-violating details even though this information is all public but you’re doing this solo and want to stay far away from any potential legal conflicts, aggregatings of those details because you don’t want your repository to be overlown by details, and then you actually stop.

And then, I finally stopped. But that’s the magic of personal projects, right? They have a way of pulling you back in. Oh, but don’t worry, I won’t be setting up databases, load balancing, or AKS clusters. Not this time, at least. Maybe a little refactoring, though. A little refactoring never hurt nobody!

The Automation Behind It

I just love the simplicity of Github Actions. Plus, you can re-use open-source actions to outsource some of the heavy lifting! Thanks Stefan Zweifel for creatin the auto commit action!

The result:

name: Run Arxiv.py

on:
  schedule:
    - cron: "30 1 * * *"

jobs:
  build:
    runs-on: ubuntu-latest
    permissions:
      contents: write
    steps:
      - name: Checkout Repo Content
        uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - name: Install Python Packages
        run: |
          python -m pip install --upgrade pip
          pip install -r ./arxiv/requirements.txt          

      - name: Execute Python Script
        run: python ./arxiv/main.py

      - uses: stefanzweifel/git-auto-commit-action@v5
        with:
          commit_message: Automated Arxiv Update

And a bit of hacky D3

It would be interesting to keep check of how many articles appear each day. Not only to check whether I need to adjust the python code, but also to see whether the field is accelerating!

So, here is the result! And since it’s essentially the graph as you find on the Sustainable IT Arxiv Feed, it’s automatically updated too!

Note that the search might be quite broad and is heavily dependent on the eagerness of authors to make us of my keywords for unexpected purposes. So the results should be taken with some scrutiny.

And here is some of the code that’s responsible for that. I can share more in case you’re interested!

    d3.json("/json/my-json.json",

        // Now I can use this dataset:
      function(data) {
        data = data.map(function(d){ return {date: d3.timeParse("%Y-%m-%d")(d.date), count: d.count} });

        var cutoffDate = new Date();
        cutoffDate.setDate(cutoffDate.getDate() - days);

        data = data.filter(function(d){ return d.date > cutoffDate })

        // Add X axis --> it is a date format
        var x = d3.scaleTime()
            .domain(d3.extent(data, function(d) { return d.date }))
            .range([ 0, width ]);

        svg.append("g")
          .attr("transform", "translate(0," + height + ")")
          .call(d3.axisBottom(x));

        // Add Y axis
        var y = d3.scaleLinear()
            .domain([0, d3.max(data, function(d) { return +d.count; })])
            .range([ height, 0 ]);

        svg.append("g")
          .call(d3.axisLeft(y));

        // Add the line
        svg.append("path")
          .datum(data)
          .attr("fill", "none")
          .attr("stroke", "steelblue")
          .attr("stroke-width", 1.5)
          .attr("d", d3.line()
              .x(function(d) { return x(d.date) })
              .y(function(d) { return y(d.count) })
              )
      }
    )

The python code

For those who want to dive into the hacky Python code, shoot me a message and I’ll share it with you!