Nothing Special   »   [go: up one dir, main page]

Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, May 27, 2024

So many feed readers, so many bizarre behaviors

It's been well over a year since I started serving 429s to clients which are hitting the feed too often. Since then, much has happened, and most of it is generally good news.

I've heard from users and authors alike of feed software. Sometimes the users have filed bug reports and/or feature requests and have gotten positive results from the project (or vendor). Other times, the authors of such software have gotten in touch, did some digging, found a few nuances of how their libraries work, and improved the situation.

Some of them are trying but are still not quite making it right.

Here's some of what's been going on.

...

At least one reader improved to not send a date from 1800. Unfortunately, it's now sending the wrong value in its If-Modified-Since headers. Instead of sending the value it obtained from the Last-Modified header on the past fetch, it's using the value from the "updated" header in the feed itself.

These are different layers of the system, and you can't mix their values together. They're close, but not exactly the same. This is how it works.

I get the impression from stalking their issues that they don't really control their HTTP requests very well because it's done by some other library. That's where the 1800 thing came from in the first place. It sounds like yet another case of using libraries that really don't do their jobs properly.

...

There's another one which hits every 2 minutes without fail, and there's no way to change it. I've even installed that app on a test account and verified this myself - it's hard-coded. As a result, it's the one feed user-agent I've had to block outright. It doesn't stop requesting things even when it hits a brick wall of 403s. Clearly, I need to use a bigger hammer.

...

A fair number of people are sending conditional requests, but are doing it every 5 or 10 minutes. This is ridiculous. I don't write that often, and never have. Polling more is not going to get you anywhere, and indeed, will now get you delayed so you get your updates much later than the well-behaved people. Knock it off.

It seems like most of these come from things which appear to just be jammed into web browsers as some kind of extension. From hearing from at least one developer, it seems like they don't do conditional requests as a matter of course. This, despite being part of a web browser ecosystem which has understood the notion of a conditional request and caching things locally for nearly three decades. Amazing.

...

A while back, I added a "Retry-After: " header to the feed. Anyone who gets a 429 will also get intel on when they should try back. It's in seconds, so it'll be something like 3600 or 86400 depending on which kind of request was sent in the first place.

There are feed services which will actually reset their countdowns every time someone trips a 429. I'm not doing that. Yet.

This is why noticing and honoring that header matters.

...

Oh, here's a new thing: goofy programs that try to "guess" the feed URL. I see all kinds of stupid requests to paths that might have a feed on it. This is a new level of density on the part of the authors of those programs.

Here's the thing. I've had metadata in the top of every single /w/ post *and* its index since some time in 2012. It looks like this:

<link rel="alternate" type="application/atom+xml" href="/w/atom.xml">

If you view source on this post or any other on the web, you'll see it up there, just hanging out.

I did that way back then because browsers used to care about RSS and Atom, and they'd put that little yellow feed icon somewhere in the top bar when they spotted this sort of thing in a page. At least in the case of Firefox, you could click on it, and it would throw the target URL to a helper of your choice.

I wrote a feed reader system at the time (remember fred?), and indeed, I could click on that icon and it would flip the feed URL over to my "subscribe to new feed" handler. It was easy.

Then, something happened, and browsers gave up on feeds, and the icon disappeared. I kept it there anyway, figuring people would make use of it. It's still the right way to programmatically find out where to get an Atom feed for the content you're looking at.

So what's with all of the groping around in the dark with made-up URLs?

...

This one blows my mind. I put together a page which has the feed URL on it as just plain text, not a link. I've seen people paste it into their feed reader and include spaces and even newlines. Seriously!

I know this because I get requests for things like "/w/atom.xml%20" over and over from feed readers which obviously don't notice they get a 404 every time.

...

Now we get to the part where I pitch a way forward, and nobody takes me up on the offer. The idea is basically this: I get some kind of commitment and support from the people who do feed reader stuff, and in turn, I build a new kind of web site which amounts to a "feed reader correctness score".

It would probably work like this: you load up a page and it hands you a special (fake) feed URL that is keyed to you and you alone. You plug it into your feed reader program through whatever flow and it will keep track of every single request to that keyed URL.

Then, after it had collected data for a while, a report would eventually become available. Just off the top of my head, the kinds of things it might say could look like this:

* Poll history: 46 checks in the past 48 hours (average 62 minutes)

* Request types: (1) unconditional (45) conditional

* If-Modified-Since timestamps: (45) matches (0) made up from whole cloth

* ETag hashes: (45) matches (0) made up from whole cloth

* Useless cookies sent: none!

* Useless referrers sent: none!

* Useless CGI arguments sent: none!

* User-agents: (40) FooGronk/1.0 +http://fg.example.org/ (6) FooGronk/1.01 +http://fg.example.org/

That's the kind of stuff I'd expect to see from a nigh-perfect reader. It connects at a reasonable pace, it sends headers with correct values, and it doesn't send along stuff like cookies that I never set in the first place.

But, okay, this is nothing but vaporware unless someone actually wants it, is willing to support it, and will commit to take actions based what it says.

There's a bigger lesson here: don't measure stuff if nobody's going to take actions based on the results. It only ever ends in misery. I wanted to write a separate post about this very topic, but figured I'd give a preview of it right here.

Okay world, surprise me. Do the right thing.