- Effect on culture
- Benefits for the leader / developer
Conference software I’ve used:
- GotoMeeting (compare to GoToWebinar?)
- AT&T connect
- Google Hangouts
- The flash based on [TODO]
FindLectures.com is a discovery engine talks, including historic speeches and interviews. The site organizes audio and video content by topic, speaker, and time period. I’ve included videos of every U.S. President from Theodore Rooselt on, some hard to find speeches by post-colonial African leaders, talks by activists recorded in the 70s at UCLA, interviews from libraries (via DPLA). This essay will discuss interesting lessons learned from crawling and organizing historical videos.
The earliest audio recordings date from the late 1800s [citation], and the earliest video from the early 1900s [citation] - the first recorded speech by a U.S. President is from Woodrow Wilson [citation]. Most videos from this time were silent, with still frames containing a caption - this text is significantly easier to extract than modern videos of a speaker with their slides, although the typeface is unusual.
[graph length vs. time]
If you survey famous Youtube channels, many of them have hundreds of short videos, so when I built FindLectures.com, I had an eye to find longer talks, that would be suitable for someone to watch on their lunch break (15-45 minutes). With older videos this is hard to do, because the recording mediums on early recordings on wax cylinders and silent films didn’t lend themselves to long content.
As recording technologies improved, the content gets much longer. It’s worth noting that online videos didn’t take off until Youtube became the primary destination for consumption, which was in the late 90s [citation].
Looking at U.S. Presidential archives, Obama was the first president to target Youtube as a primary video destination - the George W. Bush library claims to have 10s of thousands of hours of VHS content that is not indexed. Presidential libraries outsource curation work to the National Archives, but the libraries provide funding - how many videos are eventually digitized depend on the popularity of the president, and how much they care about historical preservation. There are numerous small museums with collections of video, but who don’t have the resources to do digitization (e.g. there is an Ayn Rand Library, or the Foxfire library).
In a few cases, major libraries digitize a signficiant portions of their collections - the Digital Public Library of America is helping a lot in this area. When this is done by fan clubs of the individual, like presidential libraries, these tend to only be videos that help the image of the individual (try finding videos about Bill Clinton’s extracurricular activities in his library, for instance)
Many important historical figures from the era of video are just not available online, because their are not well-known enough. Many countries that gained independence from colonialism post WWII had important leaders, who are rarely available online, unless their are really popular. For example, Haillie Selassie, the Ethiopian leader is available online, because of the popularity of Rastafarianism and his involvement in the League of Nations, but other English speaking leaders are harder to find - to the best of my knowledge, the first leader of Nigeria (Kwame Nkrumah) is unavailable.
For older videos, categorization can be tricky, because the modern video production styles didn’t exist.
Documentaries are a modern video format [insert definition]. Precursors to this video style were actualities and [insert] - actualities are just videos of life - more like videos people record with camera phones to social media.
When I built the first version of FindLectures, I used the IBM Watson API to tag videos by topic, and to find countries referenced in a video. The most common topic tagging systems use one of two taxonomys: the IAB (ad industry) or [insert news format] which is used for newspaper tagging. The ad industry taxonomy is the richest (five layers), but it has some categories which are really strange for text.
Libraries use several taxonomies for books which initially appear to be promising alternatives. The Dewey Decimal system is popular, but requires paid licensing, and the licensing is not priced to accomadate my use case. There is a book store tagging system which is promsing [citation], which is more “user friendly.” There is also a European book cataloging system, which is licensed freely, but lacks depths in technical areas (some companies sell add-ons for legal or medical taxonomies, for instance - I had to build my own for software). The Library of Congress system is fantastically rich, but in order to be used effectively, one would need a ton of text to train each category, or find a way to map videos into a subset. Essentially, there is no way to train an AI against it, without the text for hundreds of books for each category.
Tools which do entity recognitiion often can tag countries, but assume a “modern” concept of countries, and a list that more or less matches what is on a world map today, although they may deal well with disputed terrorities. Prior to the modern diplomatic system, and during colonialism, these categories were not as they currently are, both in definition, name, and boundaries.
Language curation is tricky, because it is hard to identify some similar languages that are Romance languages - I found accuracy of language detection of Spanish, Portuguese, and French to be poor, for instance. Tools that do language identification are also biased towards European languages, and not trained for native texts.
Presenting historical information is also a challenge in several ways. TItles have specific cultural meaning - it means a different thing to be a baptist bishop than catholic, and many countries have certifications that are official titles, which do not translate well. For this site, I’ve tended to remove titles, for egalitarian purposes, and to make it harder to discriminate for or against someone on this basis.
Years are also interesting - for presidentail talks I’ve fixed the facets so they are chronological. This makes it so you can learn things about the talks, just by seeing the lists over time.
Amazon retains author identifiers, but these are managed and created by publishers, so dead presidents from earlier eras do not have their metadata managed particularly well, and you have to do it yourself.
Many libraries have tons of cool artifcats, wihch they’ve digitized into obscene formats, like realplayer, or custom applications that are a pain to scrape and have no metadata.
You have a lot more metadata from wikipedia - rich bios about speakers lives. What they were involved in. Measure of controversy, influence. Find people with similar stuff in their bio. Ability to correct for what I learned in school
to talk about: cultural sensitivity, accent folding, titles
- Data Structure
- Crawler Design
- Parsing HTML
- Using RabbitMQ
- Handling Sparse Data For Ranking
FindLectures.com is a discovery engine for tech talks, historic speeches, and academic lectures. The site rates audio and video content for quality, showing different recommended talks each day on a variety of topics.
FindLectures.com crawls conference sites to get talk metadata, such as speaker names and bios, descriptions, and the date a video was recorded. Often these attributes are sparsely populated, or available across multiple websites. Additional attributes are inferred from audio and video content, but require more sophisticated data extraction to be useful in a text- oriented search engine like Solr.
This essay will discuss interesting lessons learned from crawling historical videos, demonstrate information extraction with machine learning, and show how to map real world problems to search engine functionality.
In 2015, VentureBeat announced that Wordpress powered 25% of the internet 6 - the rise of content management applications means sites are typically structured in a formulaic way, which makes them easy to scrape.
Building a custom crawler is not the most efficient way to obtain data, but few alternatives exist, and obtaining data directly from websites is getting easier. Well-funded sites often offer APIs to deter scraping, but these tend impose severe restrictions on what data you can obtain. Very few sites have APIs that are easy to use (notable exceptions are the DPLA 9 and Confreaks.tv 10)
In the future, more sites will incorporate structured data 7 into pages, which tags information within a page. Major search engines are the driver for this change - Google offers custom search result rendering for recipes and reviews, for instance, and seems to adding additional search integrations as sites adopt the technology.
FindLectures.com crawls a site at a time, which limits the amount of data it can collect, but this allows obtaining high quality metadata and prevents the inclusion of spam videos. The crawler extracts information from different types of content: HTML, text, video, and audio. Additionally, talks are the central concept, so if a talk is mentioned by multiple sites, new metadata is merged as it is found.
Search engines maintain many attributes that affect ranking, but which are hidden from end users. Google claims to consider over 200 ranking factors, for instance. 5 Youtube ranks popularity heavily, which allows manifestos and conspiracy theories to dominate legitimate content. A key decision in FindLectures.com was to encourage quality content in search results, while also discouraging homogeneity of topics (i.e. ideological diversity for political videos, a range of difficulties for science videos)
Each video is stored in a single JSON file. All files are checked into a git repository.
This makes it easy to test a change by inspecting file diffs:
This also gives you a simple way to track changes that need to be published to your Solr index:
JSON files are easy to handle in most programming languages (except Scala Play, which do not follow conventional JSON formatting). Multi-language support is important to me - this allows using the best libraries for any language. There are many good machine learning tools for Python, but the search engine runs on the JVM.
In the future, Apache Arrow 11 may be a compelling alternative to JSON.
Some miscellaneous notes on data types:
- FindLectures.com uses Solr’s auto-field syntax - by using a special naming scheme, Solr knows the type of every attribute, so you don’t have to change Solr configs.
- Ending a field with “_s” is a string, “_i” is a number.
- Ending a field with “_ss” is an array (e.g. speaker list, topic list).
- Fields can hierarchical (e.g. collection, topic).
- Fields can be integers or floating point (length, year given)
- Many fields are sparsely populated.
- Fields can have uncertainty associated to them. When the search engine is populated, this is resolved in a step that computes a ranking boost.
- Attributes can correspond to features in the product (e.g. “can_play_inline”, “speaker_has_books_for_sale”). This is common - e-commerce sites include fields like gross margin in ranking 4.
- Descriptions, transcripts, and closed captions are run through the IBM Watson API, to obtain more attributes.
- Speakers typically write their own bios for conferences, so you get high quality biographical information (e.g. preferred gender pronouns).
Conceptually, building a crawler is simple:
- Load the site’s robots.txt file, parse it, and test future URLs against it
- Load a starting page, or a sitemap to obtain a list of URLs
- Parse any desired metadata from the page
- For each URL on the page, note any links, filtering out previously seen URLs
- Load remaining pages, in sequence
Regardless of language, crawling involves hard problems: handling HTTP (especially HTTPS), parsing HTML, or working with unknown video/audio formats. Any of these “standards” could take years to handle in a robust way, so it’s important to choose strong libraries.
Currently, I’m using several fantastic and well-maintained command line tools, which are available for basically any OS: curl (for HTTP), sox (audio), ffmpeg (video decoding), Youtube-dl (obtaining video titles, descriptions, subtitles).
To scale the process, I store the list of URLs to crawl in a RabbitMQ database, which allows new processes to pull from the stack. This also aids reliability - if a process crashes or runs out of RAM, it can simply be restarted.
The crawler is designed around the principle of progressive enhancement. Each content type runs at a different pace (crawling, video metadata from Youtube, analysis from Watson, audio, and video processing). Each successive stage is substantially slower, so they get their own scripts, allowing intermediate results to be uploaded and useful while later stages run. Additionally, metadata on the same video found by crawling multiple sites is merged.
To support the principle of progressive enhancement, the crawer implements an upsert routine, which detects many different forms of URL to the same video.
Upon completion, results are pushed to a Solr index. This is re-built from scratch periodically, for instance each time the ranking algorithm is changed.
Recently, a security firm uncovered a major botnet that automated fake page views on video ads to generate ad revenue. This botnet was written in Node.js, and used a lot of libraries that you might consider in a cralwer 2 3
In this application, there are two types of URLs we care about: links to video or audio files, and links to other pages on the current site.
Video links can be iframes for embedded content, hrefs, or pure text (this is sometimes true in forums).
From there, we can apply a filter to establish whether the URL is one we want to follow or not. If a site has speaker pages, you may wish to treat this differently from video pages, for instance.
We need to do a second pass on the page to get metadata. To do this, we’ll want to apply our list of jQuery selectors. Since jQuery is a browser library, we’ll use the excellent cheerio library, which gives us a jQuery API on the command line.
We can put all of this together, and build a full configuration object, which when applied, gives us every interesting piece of metadata about a video:
Before we decide to go to the next page, we’ll want to check against robots.txt, to ensure the site owner is ok with us crawling.
To make the site useful with potentially sparse metadata, I model this process after progressive enhancement. If you can only find basic metadata about a video on a pass, like the title and length, it’s still more useful than youtube, because youtube doesn’t let you filter by the length, year, or topic - I have a script that does an upsert, and combines the output of a new data processsing script to existing data.
URLs to a video can come in many forms:
“You should visit youtube.com/123…” (text comment on a forum)
Subtitles can have duplicated content - one line for each highlighted word. This resembles the classic DNA re-assmbly problem in Bioinformatics - need to find overlaps an get the minimum set of text.
Once we do this, we also need to store text with timings, so that we can show where in a video text is.
If you build a crawler, chances are you’ll want to start more threads. Using RabbitMQ to store a work list is a great way to do this.
If you come from a non-Javacript background, it’s really tempting to write code like this:
While a simple implementation, this has two potentially slow blocking calls (writeFileSync and JSON.stringify) - only the file writing can be converted to async.
You’ll recall from above that the crawler for each site is a series of lambdas or regular expressions, as well as static data:
Code serialize this way can be re-awoken using “eval”, which is safe because we wrote these lambdas ourselves. For the purposes of my own sanity, I’ve included GUIDs in the payloads sent to RabbitMQ, which you can also see in the code example:
Even though this system has no pre-defined schema, I found it valuable to write a script to do do some basic validation. This includes a list of fields that can’t be blank, and verifying that type names match Solr definitions (i.e. a type named “speakerName_ss” should be an array of strings)
Coding in a functional style makes it easier to spot defects. Side-effecting code can change behavior far from where it acts. Seamless-immutable 12 is a great library for flushing out defects.
On the file system, you’ll need to watch how much content you download. If you put 10s of thousands of files or folders in one location, querying the file system gets incredibly slow (NTFS, at least). A simple fix is to hash some part of a file name into a hex code (e.g. A1BDE1), and create an intermediate layer of folders.
For node, you almost always want to use graceful-fs, which does a backoff when encountering the dreaded “too many open files” error. This library can easily monkey-patch itself into Node:
If you try to build a crawler and run it on your home network, you will likely get rate-limited DNS queries (thanks FiOS!)
- [TODO] Make a report that shows how sparse attributes are
- [TODO] Model examples with Wolfram Alpha
- [TODO] Put data in Postgres, fiddle with adding stuff together
TODO - CTA for scraping weekly
TODO - Github link?
- 1.Robots.txt[https://www.npmjs.com/package/robots-parser] ↩
- 2.http://go.whiteops.com/rs/179-SQE-823/images/WO_Methbot_Operation_WP.pdf ↩
- 3.Cheerio.js: https://github.com/cheeriojs/cheerio ↩
- 4.Relevant Search [TODO amazon link] ↩
- 5.http://backlinko.com/google-ranking-factors ↩
- 6.https://venturebeat.com/2015/11/08/wordpress-now-powers-25-of-the-web/ ↩
- 7.https://developers.google.com/search/docs/guides/intro-structured-data?visit_id=1-636306694207203455-1887896684&hl=en&rd=1 ↩
A few months ago, I launched a discovery engine for lectures, called FindLectures.com. To keep quality high, I manually select individual presentations, speakers, and video collections for inclusion — now over 125,000 talks.
Many people struggle to finish Coursera courses, so I prioritize standalone talks. Even so, a search engine full of options can be overwhelming, so I offer an email list where I send the best talks I’ve seen. I was initially was unsure if anyone would care, so I didn’t write any emails until a dozen people signed up.
As it turned out, once I launched the site over 500 people signed up. This isn’t a huge list, but it’s comparable in size to the email list of a typical church or small non-profit.
I periodically get wonderful responses like this:
“I just love this newsletter! I usually find at least one of the lectures very interesting and watch it later.
Thank you so much for this =)”
Most people sign up through a pop-up in the lower right hand corner. One person wrote a blog post reviewing FindLectures.com, including a note about how this annoyed him. Another asked me to add a form so he could send a link to his coworkers and students.
So far, 3% of people who visit the site request the emails — but a surprising number fail to confirm the subscription. I’m not sure why this is. I found that the meaning of “subscribe” varies by cultural background—one gentleman wrote in concerned that he had purchased something.
I thought I might run out of material to send, but it hasn’t happened yet!
Emails go out at 9:00 AM on a Monday, and “opens” spike around this time. I suspect a lot of people watch videos over their lunch break — you can some evidence in the spike around noon:
Recommending that someone watch a half hour video is not a neutral act. The best videos seep into your mind and change how you think. It’s difficult to know if a talk will interest a global audience — depending on your perspective, a single lecture could be too technical, fluffy, or edgy. Many people also prefer the convenience of podcasts, especially for listening in the car or at jobs without wi-fi.
When people don’t get value from they emails, they eventually unsubscribe. Typically this is visible in their usage history, by either not opening emails or not clicking links.
The email list is set up like a class — every person starts from the beginning, so these emails are still useful if I’m ever unable to continue. A nice benefit of this is that if there is spelling error in an email, it only goes out to a few people.
Occasionally someone unsubscribes and later returns, so for the time being they will resume where they left off.
Email software removes an address if the recipient hits ‘report spam’ or the address bounces. For a list of this size, I only need 2–3 new emails per week to keep growing, at the rate people are unsubscribing.
A few friends asked if I have plans to “monetize” this project. I did include Amazon affiliate links to speaker’s books — they get a few clicks, but on average each click is worth pennies. I imagine this would be more valuable if there was a call-to-action, or if I was selling an item directly.
The following screenshot shows the clicks for an individual email:
Notice that I copied a marketing tactic from Cooper Press — URLs to talks include UTM parameters, which causes click-throughs to display in Google Analytics for the target website as an advertising campaign.
If you read along this far, you might be interested to know where all these people came from — there was an initial spike from The Next Web and Life Hacker, followed by smaller traffic spikes from some smaller blogs and large Facebook pages.
Below, I’ve taken screenshots of each month of each month:
This is still a fun project, so I’m going to keep writing the emails.
I moved the list to AWeber. This has allowed me to seamlessly add a Typeform survey on signup. Several people already filled out the survey, and I learned that there is huge demand for curated software development talks.
AWeber is big on ‘personalization’ of emails, so I’m looking forward to experimenting with this aspect of the product. I’m considering adding a personalized ‘track’ to the emails, which would give me more opportunities to promote speakers directly.
“Lunch and Learn” programs are a great way for a team to develop common knowledge by watching presentations together. This can be helpful for research or training - software development teams often start lunch and learn programs to learn about the latest tools in the field. Lunch and learn programs also change the culture of a team - ideas that come from an outside party tend to have inherent credibility.
Here are some tips for ensuring your lunch and learn is a success:
Consistency is important - having the same time and location each week helps people get in the habit of attending.
If you are choosing videos, you’ll get better attendence if you cater topics to the audience. Highly technical content is suitable for a very focused team, whereas general interest content is suitable for building relationships in a heterogeneous team.
Often, the best material from a lunch and learn is the discussion at the end of a presentation. To facilitate this, it’s important to leave time.
If you’re watching a video, make sure to leave time to set up! This can be more time consuming than you’d expect.
If you don’t have lunch places nearby, have a team member get a group order. It takes extra work to coordinate, but food is usually more interesting than any video.
Use lunch and learn program to practice talks for conferences
Invite sales team members to demonstrate what they show to clients - this is a great way to get a different perspective on your business.
If you choose videos, preview them in advance to check for problems with the audio or speaker. Videos that discuss sensitive issues can be “not safe for work” in a mixed audience, even if the content is otherwise of high quality.
Use a shared spreadsheet to vote on lunch places or video topics
Encourage team members to develop their public speaking skills. If you have summer interns, have them work on a project and demo their work at the end of the term.
FindLectures.com started as a list of videos from our lunch and learn.
If you’re looking for a place to start, here are ten places with great collections of videos, that aren’t just TED talks:
- Gresham College - This UK university has offered public lectures for over 400 years. Many of these are fascinating historical topics.
- Startup School - This educational series by Y Combinator covers aspects of starting a tech business.
- Hacker News - Also a Y Combinator production, searching for interviews and conference talks turns up great crowd-sourced suggestions.
- InfoQ puts on sizeable tech conferences every year - they have a ton of content.
- Lanyrd - this is a social media site for conference speakers / organizers - lots of great material buried in the site.
- Confreaks.tv is a video recording company that does a lot of tech-oriented conferences.
- Google has a great lunch and learn program - why not use their videos?
- University of California TV - This is a public broadcasting channel with tons of great videos.
- UCLA Archives - UCLA maintains a fantastic collection of speeches by cultural icons from the 1960s and 1970s. I’ve found that videos by famous musicians are especially popular for company lunch and learns.
- The Chicago Humanities Festival brings in cultural icons to speak about a range of contemporary topics.
FindLectures.com is a curated list of collections of video and audio selections. Entries in a collection go through automated checks which eliminates or lower the rank of the talk.
Now that the collection have grown, more people have been asking what “curated” means. When I started FindLectures.com, I used videos selected for Wingspan’s lunch and learn program. Most came from larger collections - talks at specific conferences, unversity lecture series, book atuhors, or lectures by renowned speakers. For this site, I’ve expanded the last category to include anyone of historical significance, as these are often quite interesting, and this technique avoids me having to distinguish between truth and alternative facts.
While not all conferences are organized the same, they tend to include speakers that are a draw for attendees, or that at least allow the conference organizers to enter the social networks of respected people in their chosen field.
For universities, good commencement speakers get the institution in the news, and lecture series are a way to showcase the research of staff. In some cases, a generous alumnus will set aside money to pay for a lecture series. From what I have seen, the “vetting” of speakers for these tends to be based on seniority, but this often allows an experienced professor explore a topic of interest.
I typically use the presence of a wikipedia page as crude indication of influence. Having written a book for a recognized publisher is also a positive signal, although some publishers clearly do much better vetting of their author.
There are two common types of presentation that don’t map well to this method of curation: youtube stars, and people paid to hold a specific opinion (e.g. lobbyists, some activist organizations, politicians). Both types tend to be prolific, which risks starving out better content, so when these are included, they get significant ranking penalties.
Length tends to be a good quality filter - most universities youtube channels include dozens of short interviews with students, so by filtering to topics over the 5-10 minute range, you get a significantly better experience. FindLectures.com is often used by people on their lunch break, so talks in the 20-40 minute range get a ranking boost. Similarly, for historic videos, newer videos are not ranked as highly.
Social media sites can be good sources of recommended videos, but come with some issues. Reddit tends to have a lot of bootlegged videos, for instance. If a video is recommended by multiple sources (e.g. a Hacker News link to a tech conference video), it gets a ranking boost.
Finally, videos may be removed entirely if the audio is unintelligible (Youtube does not do this), or ranked lower for significant issues (clipping, lots of ums, and so on).
Like software architecture, there is some structure around this - this is like debugging.
Drip emails set up like a course.
Google analytics - tracks usage, search terms, facet usage
- HN Comments (lot of individual feedback on missing talks)
- Reddit posts in /r/startups (being found by writers)
- dev.to (90k twitter list)
- Something that probably will work is Cooper Press (weekly dev emails - they do a lot from reddit posts)
This shows changes after adding ‘cards’ about speakers:
- The feature seems really instructive, to me. Getting data was not difficult, as most of the people who are famous enough to end up here are the first results in book searches already.
- Data comes from Open Library. There is an opportunity to use Wikidata to get some additional trivia, but that will require learning their arcane query language.
- These could also be useful as topic cards (books on Python, etc).
- These changes are to make the site look better, so that it’s more credible.
- Semantic UI is mostly compatible with Bootstrap, so I was able to get this to work piecemeal
- Semantic UI has a lot more components than Boostrap - I chose it because it had cards. A lot of the “Bootstrap alternatives” or themes have obvious rendering defects on their sites.
- The footer looks much more professional than before.
- Cards are more complex than I expected - you have to establish rules for when to show them (e.g. I filtered to a speaker, all the talks from a search result are by one speaker, I searched for a speaker by name)
- I set it up so that hovering over a speaker name brings up the card, like Rapportive. There is a bit of code to ignore the case where your mouse moves fast across the screen, catching the hover in the process.
- I tried to ‘sticky’ the cards. I had to do a custom implementation - I think you need to convert the whole site for this Semantic UI feature to work. It also would otherwise require jQuery. Right now there are some issues, e.g. if you get a big card and scroll to the bottom, it can overlaps the footer.
- Introducing Semantic UI made the search results look better, but in some cases the rendering is glitchy, especially on a Mac. The play buttons in headers don’t wrap correctly, and the vertical alignment is wrong for some of the metadata (speaker names).
This shows the search suggestion box after adding Semantic UI:
- This is based on the top prior searches - a fixed list of around 400.
- It pulls suggestions doing ‘starts with’.
- There is an opportunity to use full text search (e.g. lunr) to make that better.
- The search box is from Semantic UI, and I think it looks much more professional than the original Bootstrap one.
- People have some trouble finding download buttons.
- Some people have suggested it could be “slicker”
- I think the facet changing is distracting, because it refreshes too fast.
- The inline help doesn’t render well.
- No one uses the inline player (Reddit has this, but their icons are larger)
- A couple people have suggested “tabs” for categories
- Maybe show some options for things you can search for (to help with discovery)
- I’d like a version where the search box is tied to the top of the screen
- This doesn’t show what matched your query, or why you should watch a talk
- Show talks from a variety of topics (novel sorting technique)
- Show talks from speakers of a different background
- Include historical talks as a category, as well as personal interviews
- Topics assigned by a machine (lets you browse stuff you didn’t know was there)
- Year filtering
- Speaker name filtering