Data collection methods

Each comment on an Ask HN thread is parsed for book names or popular acronyms (like GEB or SICP)

A book's name being included in a comment counts as a "hit" as long as the comment about the book is positive or lacks context (i.e. the comment has just a list of books)

A hit's uniqueness constraint is: the book name, HN username, Ask HN thread. So the same user's recommendation for a single book counts multiple times if it's on a new HN thread. I wanted to include how users recommend a book across categories and time.

Each hit is just one "point" for a book, no karma or other data was used in book rankings.

Full comments are truncated to just the part about the book in question, on an individual book's page you can click the HN username next to a comment to see the full comment on HN. This process grew over time and I'm still trying to improve it.

If a comment only listed the name of the book, or distilling the parts of a comment that are about a book proved to difficult, the comment appears as a "mention" on an individual book's page.

These methods are just guidelines I tried to adhere to. There's still a manual component to approving each parsed hit and I'm sure mistakes were made.


Some books have editions available freely online. I didn't set the time aside to track down the rights for these links, so none are included in this first pass, but I would like to include links to poper free editions in the future.

Currently, around 100 top Ask HN threads have been parsed for book data. I'll continue to parse older threads and new ones as they come up.


For corrections, friendly criticism, or a chat: me {at} matthewodette {dot} com