I cracked on with the code to find things to display on the FACT Metroscopes this week. We're restoring the original artwork, so need to search the Web to find pages that complete the phrases "Liverpool is...", "Dublin is...", etc. for Liverpool and her twin cities: Dublin, Odessa, Shanghai and Cologne.
I'd already created a custom Google search engine to let us programmatically perform the searches. I've been running the search periodically in the background over the past few weeks, so have a reasonable set of pages to then parse to find the right sentence.
A naive parsing with Python's Beautiful Soup library showed we were on the right track, but also threw up some gotchas. We needed a better way to filter out extraneous text like menus and sidebar text, and also to better understand the sentences so that we could only match those that start with "[City] is..." rather than just contain those words.
Using some of the page navigation features of Beautiful Soup I wrote a better parser, which broke the page into blocks of text that we could process further. Then I pulled in the Natural Language ToolKit library for its ability to identify full sentences for matching.
That tidied things up nicely. Then I just needed to add a denylist, which we can populate with words or phrases to use to further filter the results. That lets us avoid matches related to the football clubs rather than the cities themselves.
The next Museum in a Box production batch is starting to come together. The custom PCBs are in production, so it's time to get the RFID readers to go with them ready. I spent Friday separating out the parts we need (the spare cards and tags go into the DoES Liverpool stock cupboard for member access passes), soldering on the header sockets and testing that they were working.
Still need some more to get enough for this manufacturing run, then they'll get sent down to Museum inn a Box HQ for assembly.