My apologies for my laxness in posting of late––it’s the general “ugh school life everything such big much busy” story, so I’ll spare you any excuses and jump straight in.
Last week, Matt told us about Beautiful Soup, a Python package that allows for web scraping and helps clean up the data extracted. He’s been using it to build corpora for some of the field folks whose languages don’t have any available, and says it’s working like a charm. We went over some of the code last week, and it was extensive and pretty and somewhat overwhelming. If you’re interested in building your own text databases, definitely talk to Matt!
Tomorrow, I’ll be presenting on gensim, a library designed for finding semantic structure. I use it with Word2Vec from Google, which builds (massive) LSA matrices. I build my models off of the pre-trained vectors available here; Google started from a 100 billion word corpus of news articles and has provided the 300-dimensional vectors for 3 million words and phrases for use in research. It’s pretty rad. Also pretty huge. You can do all sorts of neat things with it and gensim, which I’ll tell you all about tomorrow!