We had a problem today with a 3rd party aggregator, News Now, who have been using wget to scrape the Daily Express site to gather content. All of a sudden their files were ending prematurely by around 100 characters.
I tried it myself with curl and reproduced the same problem, but there were a few odd things about this:
So, after eliminating the impossible (which I won’t bore you with), we were left with a problem that looked very improbable: New Relic were inserting JS code into the head and before the closing html tag to monitor users, but were not updating the HTTP Content-Length header.
Browsers are smart enough to ignore the Content-Length if it’s missing or incorrect, but wget and curl are set up by default to adhere strictly to the content length, hence the discrepancy.
1. Add the ‘–ignore-length’ option to wget.
2. Take New Relic off the live servers.
We spoke to New Relic, who told us we could take the automated JS injection off and instead insert it ourselves onto every page. Doesn’t sound like much fun.
The long term solution for this would be for New Relic to update the Content-Length after it has messed around with the HTML, or even remove it entirely, but it doesn’t look like this is going to happen.