In recent weeks, I have been looking into the intersection between the languages. On the one hand, we have English as widely understood by human beings in this world. On the other, we have a host of programming languages with English-like syntax, designed to be understood by some human beings, before being compiled down to bytes for machines to understand.
The context of my study was triggered by my desire to extract insight from text — a form of unstructured data critically understood by the average human but had been notoriously difficult for machines to parse and analyse in a systematic fashion. Until recently.
Currently I am looking into natural language processing for some old classification problem that had been collecting dust in one of my personal spreadsheets. Tensorflow with Keras has some good libraries for preprocessing of text and building of neural networks suitable for text problems, e.g. embedding and dense layers. I am still scaling the steep learning curve with lots of data wrangling to do so it would probably take a while before I share anything worthwhile.
But for today I could talk about something simpler.
Let’s say we want to process a sentence in English.
The section below is inspired by this Edabit challenge so there are some partial spoilers ahead!
To simplify the example, we would ignore any punctuations. Consider the string “This is a sentence”.
How do we reverse the entire sentence?
Howe do we reverse individual words while keeping the original order in the sentence intact?
In Python, string slicing and generator comprehension provide us with the tools to accomplish the tasks. Both lists and strings could be sliced. The third input in the [start:stop:step] slicing operator could take in a negative number indicating that we want to reverse the characters of a string — as shown in line 1 of the Python code example below.
To reverse individual words while keeping the order intact, we could convert the string into a generator of individually reversed words, before rejoining them below as per line 2 below.
In PHP, we have the dramatically-named explode and implode functions to break the sentence into an array of words if necessary, before applying strrev to reverse the words.
Sentence manipulation is useful in tokenisation, an important data pre-processing technique that allows us to convert unstructured text data to structured numerical data that could be fed into a suitable machine learning model — e.g. a neural network with embedding and dense layers — to extract insight from the information. Packages such as NLTK, Scikit-Learn or Tensorflow contain high-level functions that abstract away the complexity in implementing the algorithms, but it would good to appreciate what goes on under the hood once in a while.