- Data Structure
- Crawler Design
- Parsing HTML
- Using RabbitMQ
- Handling Sparse Data For Ranking
FindLectures.com is a discovery engine for tech talks, historic speeches, and academic lectures. The site rates audio and video content for quality, showing different recommended talks each day on a variety of topics.
FindLectures.com crawls conference sites to get talk metadata, such as speaker names and bios, descriptions, and the date a video was recorded. Often these attributes are sparsely populated, or available across multiple websites. Additional attributes are inferred from audio and video content, but require more sophisticated data extraction to be useful in a text- oriented search engine like Solr.
This essay will discuss interesting lessons learned from crawling historical videos, demonstrate information extraction with machine learning, and show how to map real world problems to search engine functionality.
In 2015, VentureBeat announced that Wordpress powered 25% of the internet 6 - the rise of content management applications means sites are typically structured in a formulaic way, which makes them easy to scrape.
Building a custom crawler is not the most efficient way to obtain data, but few alternatives exist, and obtaining data directly from websites is getting easier. Well-funded sites often offer APIs to deter scraping, but these tend impose severe restrictions on what data you can obtain. Very few sites have APIs that are easy to use (notable exceptions are the DPLA 9 and Confreaks.tv 10)
In the future, more sites will incorporate structured data 7 into pages, which tags information within a page. Major search engines are the driver for this change - Google offers custom search result rendering for recipes and reviews, for instance, and seems to adding additional search integrations as sites adopt the technology.
FindLectures.com crawls a site at a time, which limits the amount of data it can collect, but this allows obtaining high quality metadata and prevents the inclusion of spam videos. The crawler extracts information from different types of content: HTML, text, video, and audio. Additionally, talks are the central concept, so if a talk is mentioned by multiple sites, new metadata is merged as it is found.
Search engines maintain many attributes that affect ranking, but which are hidden from end users. Google claims to consider over 200 ranking factors, for instance. 5 Youtube ranks popularity heavily, which allows manifestos and conspiracy theories to dominate legitimate content. A key decision in FindLectures.com was to encourage quality content in search results, while also discouraging homogeneity of topics (i.e. ideological diversity for political videos, a range of difficulties for science videos)
Each video is stored in a single JSON file. All files are checked into a git repository.
This makes it easy to test a change by inspecting file diffs:
This also gives you a simple way to track changes that need to be published to your Solr index:
JSON files are easy to handle in most programming languages (except Scala Play, which do not follow conventional JSON formatting). Multi-language support is important to me - this allows using the best libraries for any language. There are many good machine learning tools for Python, but the search engine runs on the JVM.
In the future, Apache Arrow 11 may be a compelling alternative to JSON.
Some miscellaneous notes on data types:
- FindLectures.com uses Solr’s auto-field syntax - by using a special naming scheme, Solr knows the type of every attribute, so you don’t have to change Solr configs.
- Ending a field with “_s” is a string, “_i” is a number.
- Ending a field with “_ss” is an array (e.g. speaker list, topic list).
- Fields can hierarchical (e.g. collection, topic).
- Fields can be integers or floating point (length, year given)
- Many fields are sparsely populated.
- Fields can have uncertainty associated to them. When the search engine is populated, this is resolved in a step that computes a ranking boost.
- Attributes can correspond to features in the product (e.g. “can_play_inline”, “speaker_has_books_for_sale”). This is common - e-commerce sites include fields like gross margin in ranking 4.
- Descriptions, transcripts, and closed captions are run through the IBM Watson API, to obtain more attributes.
- Speakers typically write their own bios for conferences, so you get high quality biographical information (e.g. preferred gender pronouns).
Conceptually, building a crawler is simple:
- Load the site’s robots.txt file, parse it, and test future URLs against it
- Load a starting page, or a sitemap to obtain a list of URLs
- Parse any desired metadata from the page
- For each URL on the page, note any links, filtering out previously seen URLs
- Load remaining pages, in sequence
Regardless of language, crawling involves hard problems: handling HTTP (especially HTTPS), parsing HTML, or working with unknown video/audio formats. Any of these “standards” could take years to handle in a robust way, so it’s important to choose strong libraries.
Currently, I’m using several fantastic and well-maintained command line tools, which are available for basically any OS: curl (for HTTP), sox (audio), ffmpeg (video decoding), Youtube-dl (obtaining video titles, descriptions, subtitles).
To scale the process, I store the list of URLs to crawl in a RabbitMQ database, which allows new processes to pull from the stack. This also aids reliability - if a process crashes or runs out of RAM, it can simply be restarted.
The crawler is designed around the principle of progressive enhancement. Each content type runs at a different pace (crawling, video metadata from Youtube, analysis from Watson, audio, and video processing). Each successive stage is substantially slower, so they get their own scripts, allowing intermediate results to be uploaded and useful while later stages run. Additionally, metadata on the same video found by crawling multiple sites is merged.
To support the principle of progressive enhancement, the crawer implements an upsert routine, which detects many different forms of URL to the same video.
Upon completion, results are pushed to a Solr index. This is re-built from scratch periodically, for instance each time the ranking algorithm is changed.
Recently, a security firm uncovered a major botnet that automated fake page views on video ads to generate ad revenue. This botnet was written in Node.js, and used a lot of libraries that you might consider in a cralwer 2 3
In this application, there are two types of URLs we care about: links to video or audio files, and links to other pages on the current site.
Video links can be iframes for embedded content, hrefs, or pure text (this is sometimes true in forums).
From there, we can apply a filter to establish whether the URL is one we want to follow or not. If a site has speaker pages, you may wish to treat this differently from video pages, for instance.
We need to do a second pass on the page to get metadata. To do this, we’ll want to apply our list of jQuery selectors. Since jQuery is a browser library, we’ll use the excellent cheerio library, which gives us a jQuery API on the command line.
We can put all of this together, and build a full configuration object, which when applied, gives us every interesting piece of metadata about a video:
Before we decide to go to the next page, we’ll want to check against robots.txt, to ensure the site owner is ok with us crawling.
To make the site useful with potentially sparse metadata, I model this process after progressive enhancement. If you can only find basic metadata about a video on a pass, like the title and length, it’s still more useful than youtube, because youtube doesn’t let you filter by the length, year, or topic - I have a script that does an upsert, and combines the output of a new data processsing script to existing data.
URLs to a video can come in many forms:
“You should visit youtube.com/123…” (text comment on a forum)
Subtitles can have duplicated content - one line for each highlighted word. This resembles the classic DNA re-assmbly problem in Bioinformatics - need to find overlaps an get the minimum set of text.
Once we do this, we also need to store text with timings, so that we can show where in a video text is.
If you build a crawler, chances are you’ll want to start more threads. Using RabbitMQ to store a work list is a great way to do this.
If you come from a non-Javacript background, it’s really tempting to write code like this:
While a simple implementation, this has two potentially slow blocking calls (writeFileSync and JSON.stringify) - only the file writing can be converted to async.
You’ll recall from above that the crawler for each site is a series of lambdas or regular expressions, as well as static data:
Code serialize this way can be re-awoken using “eval”, which is safe because we wrote these lambdas ourselves. For the purposes of my own sanity, I’ve included GUIDs in the payloads sent to RabbitMQ, which you can also see in the code example:
Even though this system has no pre-defined schema, I found it valuable to write a script to do do some basic validation. This includes a list of fields that can’t be blank, and verifying that type names match Solr definitions (i.e. a type named “speakerName_ss” should be an array of strings)
Coding in a functional style makes it easier to spot defects. Side-effecting code can change behavior far from where it acts. Seamless-immutable 12 is a great library for flushing out defects.
On the file system, you’ll need to watch how much content you download. If you put 10s of thousands of files or folders in one location, querying the file system gets incredibly slow (NTFS, at least). A simple fix is to hash some part of a file name into a hex code (e.g. A1BDE1), and create an intermediate layer of folders.
For node, you almost always want to use graceful-fs, which does a backoff when encountering the dreaded “too many open files” error. This library can easily monkey-patch itself into Node:
If you try to build a crawler and run it on your home network, you will likely get rate-limited DNS queries (thanks FiOS!)
- [TODO] Make a report that shows how sparse attributes are
- [TODO] Model examples with Wolfram Alpha
- [TODO] Put data in Postgres, fiddle with adding stuff together
TODO - CTA for scraping weekly
TODO - Github link?
- 1.Robots.txt[https://www.npmjs.com/package/robots-parser] ↩
- 2.http://go.whiteops.com/rs/179-SQE-823/images/WO_Methbot_Operation_WP.pdf ↩
- 3.Cheerio.js: https://github.com/cheeriojs/cheerio ↩
- 4.Relevant Search [TODO amazon link] ↩
- 5.http://backlinko.com/google-ranking-factors ↩
- 6.https://venturebeat.com/2015/11/08/wordpress-now-powers-25-of-the-web/ ↩
- 7.https://developers.google.com/search/docs/guides/intro-structured-data?visit_id=1-636306694207203455-1887896684&hl=en&rd=1 ↩