I rebuilt my blog's cache. Bots are the audience now

(hoeijmakers.net)

21 points | by robhoeijmakers 3 hours ago

6 comments

chrismorgan 1 minute ago
I’m very confused about why you’d have such a complex cache arrangement. Sounds like you’re using Cloudflare and Fastly to do roughly the same thing. That sounds like a recipe for more expense and more problems.
Also the diagram near the end is pretty much incoherent. GenAI, I presume.
pavel_lishin 48 minutes ago
> Not because I expect a person in Singapore to shave 200ms off their pageload, but because the next request for that page is more likely to come from a retrieval system than a browser, and the request after that, and the one after that.
Why do I care if I shave off 200ms from a crawler's request, instead of a human's?
[-]
- Brybry 18 minutes ago
  The graphic in the article seems to be the only significant content.
  Based on that I think it's more about requests from bots/scrapers having the greatest chance possible of hitting a cache before hitting the blog's origin/real host. Bots will hit some layer of Cloudflare first then they'll hit Fastly and then if not in Fastly they'll hit the Ghost blog's server.
  To me, this makes a lot of sense if it's self-hosted but I also thought it was already the standard to shove your self-hosted blog behind a reverse-proxy and cache as much as possible.
  And I'm not a professional web developer but all the extra caching layers for a static personal blog seem a bit overkill.
  Aside from the graphic, the article is a lot of words about engaging with an LLM to get a full understanding of how caching works for their blog hosting and how it enabled them to change their setup for the better.
  It's kind of hard to understand because there are no words about what they actually did or how what they actually did was better.
- m0rde 27 minutes ago
  From the post:
  > If you care about how your content moves through the world now, including through AI systems, you have to care about caching. Not as a performance optimisation for human browsers, but as infrastructure for machine readership.
- rodw 34 minutes ago
  Page load time can impact index coverage (depth of crawl), freshness (revisit rate), and ranking.
steve_adams_86 4 minutes ago
I went through a similar process recently. For a while I saw readership of my site gradually increasing, and eventually it became clear that it wasn't human beings.
I also used Claude to help me drill into what's going on. Bizarrely, about 80% of my traffic comes from Singapore, which the author mentioned. I don't know why. A lot of the traffic looks real; it stays for a while, clicks different links in different orders. But no one in Singapore has ever read a thing I've written on my site as far as I'm concerned.
I thought Cloudflare would help protect my site from bots, but it utterly fails. I'm not sure if they're too sophisticated or people overestimate how well CF works for these things. I paid for advanced features for a while and reverted to the free plan once I realized it made no difference. It's a great platform in general, but hasn't been great for allowing me to see how many humans actually read my content.
I know some do because they email me occasionally. If I had to guess, of the ~200 visits per week reported in analytics, around 15 are real.
jdw64 2 hours ago
Personally, I think this is a good idea. But the core problem is this: How is a newcomer supposed to build reputation now? Without exaggerated business promises or capital, basic online reputation usually depends on writing. In fact, my own first step into freelancing came because someone found the articles on my Korean blog interesting. So the question is: if the subscribers are bots, what benefit do they actually give me? If bots become the readers, then what matters is whether they can provide any kind of symbolic capital or real capital. I can build caching with Redis without much difficulty, but I worry that if this continues, the result may simply be that LLMs learn from my writing while no benefit returns to me. People write partly to organize their thoughts, but also partly to gain symbolic capital. That is one reason why I write my own posts instead of using an LLM to write them for me.
[-]
- nilirl 4 minutes ago
  I feel that pressure of not knowing how to definitively compete on the internet, especially when there's so much AI created noise.
  I'm a copywriter and I used to get hired to write posts on behalf of founders on LinkedIn or for their company blog.
  Now, the last three jobs I had were all focused on sending cold email.
- pixl97 52 minutes ago
  >How is a newcomer supposed to build reputation now
  Dead internet manifest.
- johng 56 minutes ago
  What's worse, is they train on your content, and very often you don't even get an attribution link. So the end user never even knows it was your site that provided the information and you never even get a single clickthrough. It's not like the SERPs where someone would click through, read your site, hopefully find it interesting and useful and come back.
  It's going to be a serious problem and I've already seen sites that are down 90% in traffic simply because AI is scraiping them, answering the questions themselves and never providing a linkback.
  [-]
  - 01284a7e 40 minutes ago
    I pulled all the websites I had - some existed for a decade plus and made me hundreds of thousands of dollars. All that is left is bots that theft the value of my work. Until something changes, goodbye.
    [-]
    - gbgarbeb 15 minutes ago
      This is like choosing to be an elementary school teacher and then quitting because it turns out your students for the year aren't your pets in perpetuity.
      [-]
      - diatone 8 minutes ago
        If your students were growing up to subvert your line of work, sure. Pretty sure that’s not the case though!
- robhoeijmakers 38 minutes ago
  [flagged]
cullumsmith 43 minutes ago
I simply block all AI crawlers with a user-agent check in nginx.conf.
[-]
- microtonal 29 minutes ago
  I also block all AI crawlers. I am not sure why I should give them my content for them to rip it off and make money from it through training or agents. Sadly, a lot of AI companies are trying to make requests indistinguishable from regular browsers from residential connections, so unfortunately I have to use Cloudflare to block them.
  Ideally I'd make the content available to crawlers for training open models, but that seems to be nearly impossible. It would be possible if other AI companies behaved.
  [-]
  - Barbing 14 minutes ago
    >so unfortunately I have to use Cloudflare to block them.
    That can’t block Grok, can it?
    (You might have a fake iPhone or something visit your site if you ask Grok to retrieve information from it)
- orf 24 minutes ago
  *some AI crawlers. Not many
- robhoeijmakers 34 minutes ago
  I started blocking some of them. But for now I want to improve visibility before further blocking or optimising. The dashboard helps with this.
Hackbraten 2 hours ago
Why do I get just an empty page?
[-]
- robhoeijmakers 36 minutes ago
  Thanks. It seems to be very local/incidental. The page works from the locations I can test, but I’ll check whether one edge cache or request path served a bad response.
- consumer451 42 minutes ago
  Same here via VPN. No VPN, and I get the actual content.
- ksk23 46 minutes ago
  Caching gone wrong.. (Works for me)