- TypeScript + React + React-Bootstrap
- Node (TypeScript)
JSON stored in Git.
The size of the Solr index is approximately 2GB per 100k videos.
- Speaker Bio
- Speaker Names
- Speaker Amazon IDs
- Speaker Wikipedia URLs
- IAB Categories (actual taxonomy extended a bit)
- Collection name
- Closed captions
- Auto-generated Transcript
- Audio Quality Score
- Text Quality Score
- Metadata Quality Score
- Year Given
- Talk Type
- Features (Closed Captions, etc)
- Link to video / audio player
- Small files
- Easy to data correct
- Any language can access this trivially (except Scala play, which uses a bad JSON format)
- Diffs work great for testing [TODO: get image]
Types of attributes:
- Datatype (string, integer, date)
- Array valued vs non-array
More things are hierarchies and arrays than you’d think -
e.g. the ‘collection’ a video is from. Video might come
from two places (hacker news, some agency website). Could
be a hierarchy, if you pull from Reddit and mark the subreddit
as something that you can drill into.
Designed around Progressive enhancement:
This allows using the “best” tool for the job:
Node, Python, Scala, Bash scripts
Progressive enhancement occurs by timescale.
Short - scraping a site. Typically 1-500 pages.
E.g., a speaker agency might have 20 pages, one per speaker.
Some conferences put everything on one page.
At this stage, you might get:
- title - length - speaker bio (Typically written by speaker) - transcripts (sometimes) - talk description
This means getting basic info from youtube, vimeo, soundcloud, etc.
This might get you:
- title (can tell year video made)
- video description (can tell you year video made)
- closed captions
- category (youtube uses ad industry taxonomy, like Watson)
These are scripts that get run once in a while.
- Checking for 404’ed pages
- Checking for “ums” in texts
- Adding flesch-kincaid difficulty score
- Batch uploads to watson for category tagging
- Batch uploads to speech recognition API
Index time changes
- Quality scoring
- Merge captions into a “transcript”
- Normalize how frequently people / topics show up so you get breadth
Processing heavy data
Download videos, audio, run these files through various processing techniques.
This is very dependent on “use the best tool for the job”, so language independence
is most important at this stage.
Attributes obtained at this stage:
- Audio quality (clipping, SNR, channel imbalance)
- Speech to text
- Facial recognition?
First version was python + beautifulsoup (lots of mini scrapers.)
Migrated to Node.js for the “glue” code of ETL.
The good parts:
Use “best tools for the job”, i.e. use ‘child_process’.spawn to call curl to do downloads. Some tools
write to stderr when they should do stdout, so you have to fiddle with this to get the right thing.
If it takes a long time, just use the video ID, make files for the processed output of each source video.
Node’s async behavior lets you parallelize file system operations but you can get ‘too many open files’
errors. I used graceful-fs to fix this.
Curl gives you text. Do “unescape” on this, then use “url-regex” library on NPM to get ALL urls.
This avoids differences between URLs in hrefs, URLs written in text (e.g on a forum), iframes.
Youtube has a ton of different formats. These will get normalized later (we just want the video ID)
For sites with structure, use Cheerio -
jQuery selector API in Node server side
To parallelize this, make a bunch of jobs in RabbitMQ. The definition
of a job is:
- Links to crawl
- Rules (regex) for links to traverse
- Pre-set data (e.g. if you know the speaker’s name already)
Rules to find data (lambdas with jQuery selectors)
There is also a script that does ‘upsert’ at the end of crawling.
If a video has been already seen, we want to update any new fields we
found so there aren’t duplicates.
I had to add locking at this stage, since I’m not using a real database.
Use Youtube-dl here, ffmpeg, sox. Not hosing videos puts DMCA enforcement mostly on Youtube
(there are a ton of bootlegged videos posted to Reddit either from Youtube or sites designed
to facilitate copyright infringement)
Captions are generated by machines, if not by a person (on youtube). Captions are interesting,
in that they are just a timing + text, potentially with words highlighted. For big players in
music, they do fun CSS things on their captions. This means the caption will repeat (as each
word is highlighted). These can be spliced (essentially the same idea genome sequencing algorithms).
Highlighting in search results is a bit of a pain, since you can only highlight the timed text
within a boundary. However, this lets you skip to a portion of a video you care about.
70,000 videos @ 45 minutes each - around 24 TB
This would cost about $300/mo on S3, similar on other services.
Hard drives cost ~$45/TB at the moment, and you can purchase up to
12 TB at once. It’s double if you do RAID, however big performance / electricity
hit for this.
Can download / split 4-6k videos per day.
Ways to mitigate:
- Use ffmpeg to split audio / video portions (audio - reencode as mp3) - Download lower quality video (risky, for OCR) - Quality tasks may only need first N minutes of video - ffmpeg can compress video, but this is very time consuming
- Use tesseract for OCR. Basically grab frames every N seconds, to find out if there's text (this indicates either slides or an intro) - Store in same format as closed captions (time window + text) - [TODO insert paper reference] does a more sophisticated approach (finding out when a slide becomes visible). They indicate their processing takes 1/10th (verify?) the time of the actual video. I think this can be improved a lot with my cruder approach.
Some people train models on these. One neat trick is to use partial video.
Easy to compute SNR in Python.
Many machine learning tools are native to Python, e.g. numpy for matrix stuff.
Scala / Spark has some useful tools as well (good thing is it can max your CPU).
Sox (audio swiss army knife) detects clipping as a side-effect, which is useful. also
spits out video stats.