Initial Forays Into Scraping Hacker News

Like many other technically minded programmers, I've spent some time on Paul Graham's Hacker News[0] web site. There are some gems there, although there is also some dross. I've decided that it's important for me to be more aggressive in my filtering.

So I've done what I always do and indulged in some Yak Shaving[1][2]. I've been thinking about automatic and semi-automatic filtering, writing a system that automatically finds stories it thinks I'll find interesting, and then leavening that with some other material to try help prevent the "echo chamber" effect[3].

I've done a little reading around to see what might be tolerated. An old comment by Paul Graham[4] seems to indicate that a pull every 30 seconds or so would be OK, and that's supported by the robots file[5]. So I've written a script that pulls an item with its discussion, extracts the hierarchy and saves it, then waits a minute or so and goes again.

Actually, I started by looking at HNSearch[6]. Another comment suggested that it was requested that rather than scraping HN directly we should use that API. Well, I've done that, and it seems not to have the first million or so items, and the searches I've done are just full of holes. It seems a reasonable first step, but the database I've pulled from it is just woeful, with only about 2% to 3% of items being present. I've randomly chosen items that should be covered by a search query, and they're just not there.

So I'm intending to rethink that, to look for other sources, or better interrogating that source, and in the meantime I've started up the direct scraping.

And been IP blocked.

OK, I've backed off, reset my modem to change my IP (just this time - I don't do that very often), changed to pulling only every five minutes, and been blocked again. At a rate of one pull every 5 minutes I expect to get the first million or so entries by late 2017.

So it's time to reconsider. Do I check for failed requests, back off, try later, and hope to get my IP unblocked? Do I expect to find a sustainable rate? When the robots.txt file says 30 seconds, PG's comment says 30 seconds, and yet I get IP blocked for querying less than every 5 minutes, it seems that there's more going on.

And when the officially sanctioned source of material only gives 3% of the result, you start to think that there has to be a better way.

I think I'd better think it out again.

[0] https://news.ycombinator.com/news

[1] http://en.wiktionary.org/wiki/yak_shaving

[2] http://sethgodin.typepad.com/seths_blog/2005/03/dont_shave_that.html

[3] http://en.wikipedia.org/wiki/Echo_chamber_%28media%29

[4] https://news.ycombinator.com/item?id=1721997

[5] https://news.ycombinator.com/robots.txt

[6] https://www.hnsearch.com/api

There were no headings
in the main text so there
is no table of contents.

Links on this page

Site hosted by Colin and Rachel Wright:

Maths, Design, Juggling, Computing,
Embroidery, Proof-reading,
and other clever stuff.

Suggest a change ( <-- What does this mean?) / Send me email
Front Page / All pages by date / Site overview / Top of page

Quotation from
Tim Berners-Lee

Initial Forays Into Scraping Hacker News

Contents

Links on this page