Friday, September 9, 2022

Looking it up with computer vision

My mother introduced me to a wide range of topics when I was growing up. She had fascinations with botany, ornithology, entomology and paleontology, among the so-called hard sciences. As she was a teacher, she had adapted certain behaviors she'd learned from studying child development and psychology in her masters degree program on the best way to help a young mind to learn without just teaching at them. One of her greatest mantras from my childhood was "Let's look it up!" Naturally she probably already knew the Latin name for the plant, animal or rock I was asking about. But rather than just telling me, which would make me come to her again next time, she taught me to always be seeking the answers to questions on my own. 

This habit of always-be-looking-things-up proved a valuable skill when it came to learning languages beyond Latin terms. I would seek out new mysteries and complex problems everywhere I went. When I traveled through lands with complex written scripts that were different from English, I was fascinated to learn the etymologies of words and the way that languages were shaped. Chinese/Japanese script became a particularly deep well that has rewarded me with years of fascinating study. Chinese pictographs are images that represent objects and narrative themes in shape representation rather than in sound, much like the gestures of sign language. I'd read that pictographic languages are considered right brain dominant because understanding them depends of pattern recognition rather than decryption of alphabetic syllables and names which are typically processed in the left brain. I had long been fascinated by psychology, so I thought that learning a right brain language would give me an interesting new avenue to conceive language differently and potentially thereby think in new ways. It didn't ultimately change me that much. But it did give me a fascinating depth of perspective into new cultures.

Japanese study became easier by degrees. The more characters I recognized, the faster the network of comprehensible compound words accelerated. The complexity of learning Japanese as a non-native had to do with the idea of representing language by brush strokes instead of phonemes. To look up a word you don't know how to pronounce, you must look up a particular shape within the broader character, called a radical. You then look through a list of potential matches by total brush stroke count that contain that specific radical. It takes a while to get used to. I'd started, while living in Japan with the paper dictionary look-up process, which is like using a slide rule to zero in on the character which can then be researched elsewhere. Computer manufacturers have invented calculator-like dictionaries that sped up the process of search by radical. Still it typically took me 40-60 seconds with a kanji computer to identify a random character I'd seen for the first time. That's not so convenient when you're walking around outside in Tokyo. So I got in the habit of photographing characters for future reference when I had the time for the somewhat tedious process.

Last month I was reviewing some vocabulary on my phone, when I noticed that Apple had introduced optical-character-recognition (OCR) into the operating system of new iPhones. OCR is a process that's been around for years for large desktop computers with supplemental expensive software. But having this at my fingertips made the lookup of kanji characters very swift. I could read any text through a camera capture and copy it into my favorite kanji dictionaries (jisho.org or imiwa app). From there I could explore compound words using those characters and their potential translations. Phones have been able to read barcodes for a decade. Why hadn't it been applied to Chinese characters until now? Just like barcodes, they are a specific image block that has a direct reference to a specific meaning. My guess is that recognizing barcodes had a financial convenience behind it. Deciphering words for poly-linguists was an afterthought that was finally worth supporting. This is now my favorite feature of my phone! 

What's more, the same Vision API allows you to select any text from any language and even objects in pictures and send it to search engines for further assistance. For instance, if you remember taking a picture of a tree recently, but don't know what folder or album you put it in, the Spotlight search can allow you to query across your photo library on your phone even if you never tagged the photo with a label for "tree." Below you can see how the device-based OCR indexing looked for the occurrence of the word "tree" and picked up the image of the General Sherman Tree exhibit sign in my photo collection of a trip to Sequoia National Park. You can see how many different parts of the sign there were where the Vision API detected the word "tree" in a static image. 

But then I noticed that even if I put in the word "leaf" in my Spotlight search, my photos app would pull up images that had the shape of a leaf in them, often on trees or nearby flowers that I had photographed. The automatic semantic identification takes place inside of the Photos application with a machine learning process, which then has a hook to show relevant potential matches to the phone's search index. This works much like the face identification feature in the camera which allows the phone to isolate and focus the image on faces in the viewfinder when taking a picture. There are several different layers of technology that achieve this. First identifying figure/ground relationships in the photo, which is usually done at the time the photo is taken with the adjustable focus option selected by the user. (Automated focus hovers over the viewfinder when you're selecting the area of the photo to pinpoint as the subject or depth of focus of the photo.) Once the subject and ground can be isolated from the background, a machine learning algorithm can run on a batch of photos to find inferred patterns, like whose face matches to which person in your photo library. 

From this you can imagine how powerful a semantic-discovery tool would be if you had such a camera in your eye glasses, helping you to read signs in the world around, you whether in a foreign language or even your own native language. It makes me think of Morgan Freeman's character "Easy Reader" who'd go around New York looking for signs to read in the popular children's show Electric Company. The search engines of yester-decade looked for semantic connections between words written and hypertext-linked on blogs to string together. This utility we draw on every day uses machine derived indication of significance by the way people write web pages about subjects then based on which terms the blog authors link to which subject webpages. The underlying architecture of web search is all based on human action. Then the secondary layer of interpretation of those inferences is based on the amount of times people click on results that address their query well. Algorithms are used to make the inferences of relevancy. But it's human authorship of the underlying webpages and human preference for those links thereafter that informs the machine learning. Consider that all of web search is just based on what people decide to publish to the web. Then think about all that is not published to the web at present, such as much of our offline world around us. So, you can just imagine the semantic connections that can be drawn through the interconnectedness of our tangible world we move through everyday. Assistive devices that see the code we humans use to thread together our spatially navigable society are a web of inter-relations that will be easily mapped by the optical web crawlers we employ over the next decade.

To see test out how the Vision API deals with ambiguity, you can throw a picture of any flower of varying shape or size into it. The image will be compared to potential matches that can be inferred against a database of millions of flowers in the image archives of WikiCommons the public domain files which appear on Wikipedia. This is accessed via the "Siri knowledge" engine at the bottom of the screen on your phone when you look at an image (See below the small star shape next to "i"). While WikiCommons is a public database of free-use images, it could easily be expanded to any corpus of information in the future. For instance, there could be a semantic optical search engines that only matches against images in the Encyclopedia Britannica. Or if you'd just bought a book on classic cars, the optical search engine could fuzzy-match input data from your future augmented reality lenses to only match against cars you see in the real world that match the model type you're interested in.


Our world is full of meanings that we interpret from it or layer onto it. The future semantic web of the spatial world won't be limited to only what is on Wikipedia. The utility of our future internet will be as boundless as our collective collaborative minds are. If we think of the great things Wikimedia has given us, including the birth of utilities like the brains of Siri and Alexa, you can understand that our machines only face the limits that humans themselves impose on the extensibility of their architectures.