This week is the ALTA’s Machine Learning for Digital ELT Summer School here in Crete, and ELTjam will be blogging (hopefully each day) from the event.

Things kicked off last night with a drinks reception at the Santa Maria beach resort just outside Chania, and then today, the conference began! I think I speak for all of us here when I say that Day 1 was an intense and at times almost overwhelming insight into the fields of Natural Language Processing (NLP), Machine Learning (ML), automated assessment and error correction in English language. I hope this post effectively summarises some of the key points and that this will be helpful for those of us here on the island as well as those with an interest in this technology who weren’t able to be here.

As a disclaimer I must point out that the info here is my understanding of the content from today’s sessions and may contain some inaccuracies (that I will edit if brought to my attention) or omissions. As a request I would ask that anyone add clarification, corrections or additional info into the comments below.

(You can read the summaries of the other days here: Day 2, Day 3, Day 4, Day 5)

Machine Learning for Natural Language Processing

The day started with Andreas Vlachos of the University of Sheffield, who discussed two key elements of natural language processing and machine learning: Classification and Language Modelling. With both, the basic concept is the use of existing data sets to help an algorithm acquire knowledge which it is then able to use to make inferences about new data that is presented to it.

Andreas showing us how the learning happens!

Text Classification

The important thing is the features you judge a text by and the amount of data you have to train the model

Text classification can be used to assign a text a category (e.g spam/not spam, positive/negative). A big part of this classification is sentiment analysis; understanding what is meant by a combination of words and sentences, what the intention or aim is. Take the example of film reviews. By feeding a model a number of reviews that are known to be generally positive or generally negative the model can build up an understanding of what to expect from different types of review. Then, when shown a new review, it is able to analyse it and make a judgement about what the author felt about the film. This falls into the category of machine learning as the algorithm is learning what constitutes a good and bad review and then applying that knowledge to make inferences about data it has never seen before.

An important algorithm used to build up this understanding is the perceptron. In simplified terms (the only I’m capable of) this algorithm assigns a ‘weight’, either positive or negative, to a certain feature of the text. Sticking with the film review example, it might see a higher instance of the word ‘amazing’ in a positive review and assign it a positive weighting, and a higher instance of ‘terrible’ in the negative reviews and associate that word with poor reviews. Then when analysing a new review, it would see an instance of ‘amazing’ as being an indication of a positive review.

The important thing here is what ‘features’ are used when training the model. Are we looking at the frequency of individual words, or combinations of words, or both? Do we look at capitalisation and punctuation or not? Do we place more importance on the words in the first or last sentence, or is location of a feature within a text of no importance? For example, if we look only at individual words to train a model, the algorithm would see the sentence “It was amazing just how bad this film was” and would (wrongly) see the word ‘amazing’ as an indication of a positive review. In order to mitigate against this, combinations of words would need to be used to analyse the reviews more effectively.

The other factor in determining the effectiveness of a model is just how much data is available to train it. How many examples can the algorithm learn from? Does it have to make judgements based on being fed 10 film review, or 100, 1000, a million?

The features chosen plus the quantity of data used to train the model will be what defines the quality of the processing that the classification model is able to achieve.

You can view Andreas’s slides here (without the interactivity from the session)

Language Modelling

Another common question for ML is ‘Is this sentence acceptable / likely / good enough English?’. To get an answer to this, language modelling is used. By using and analysing corpora of data we can build up an understanding of how probable certain language or combinations of language are. New language is then compared to this model and we are able to tell how likely it is that the new language is ‘correct’ or ‘acceptable’ in a certain context.

Here we train the model by feeding in corpus data and analysing it. How often do individual words (unigrams) appear? How often do combinations of two words (bigrams) or three words (trigrams) appear?. Let’s take the following sentence as an example:

“She poured oil on the salad”

We can analyse the individual words and take their frequency to say how likely this sentence is compared to an other combination of six words, e.g. “She poured oil on the house”. But by analysing only the frequency of the individual words we may conclude that the second sentence is more common as house is higher frequency than ‘salad’, so this approach gives us some information about accuracy of a sentence but not enough.

So we need to look at combinations of words: how common is the bigram pour|oil or the trigram on|the|house. By analysing these longer combinations of words we can build up an understanding that “she poured oil on the…” is more likely to be finished with ‘salad’ than ‘house’ even if ‘house’ is the higher frequency word. Through this analysis we can also see that “She poured oil on the salad” is far more likely than “She oli poured on salad the” so when presented with a preiously unseen string of words it’s able to analyse it and say how probable/correct it is based on all the other instances on n-grams it’s been exposed to.

However, a key problem here is that no dataset will ever contain all possible words, or n-grams, so sometimes, the algorithm will come across new texts that it has never seen all or a part of before. If it were to therefore sees the probability of this word or combination being correct as zero, we would get many correct uses of language returning a zero probability of being correct. To get around this there are various methods of ensuring this doesn’t happen. For more information see various smoothing techniques.

Automated Assessment of EFL Learners’ Writing

Helen showing how input data is used to train the model.

After lunch Helen Yannakoudakis of the University of Cambridge talked about the work that ALTA are doing around the automated assessment of learner essays. This can have a real advantage for learners in that the results can be quicker to obtain, and without subjective bias, plus it gives access to feedback to learners that may not have a teacher or a tutor who can go over their writing with them.

In order to be able to assess an essay, we need:

  1. Representative training data, and for ALTA this comes in the form of the graded Cambridge Learner Corpus, which is essays from Cambridge English exams, marked up with grades and in some cases errors in an XML format
  2. Sophisticated methods for extracting textual features – i.e. the ability to analyse a text effectively
  3. An algorithm that can be trained to then look for an analyse these features in a text it has not seen before

In terms of point two, by applying a range of different assessment techniques to a text, ALTA are able to determine how accurate an essay is and therefore what grade it should receive, based on the scored allocated to the texts in the CLC by human markers. In order to do this, the algorithm looks at:

  • Frequency of unigrams
  • Frequency of n-grams (bigrams and trigrams)
  • Grammatical constructions (based on parsing of the sentences)
  • PoS tagging of text and probability of certain syntactic features being present

The function of the machine learning algorithm is to take all this data from the CLD and work out the weight that should be assigned to a certain features, positive or negative. Once this has been established it will know what are features of essays of different grades. Then, when presented with a new essay the algorithm will be able to make a judgement on the overall quality of the text by combining the weightings of the different parts of the essay. A higher instance of positively weighted features will make for a higher grade than higher instance of negatively weighted features.

By combining these different methods of analysis of the text together and by assigning weighting to different features, ALTA have been able to create an automated scorer that agrees with human marking nearly as much as humans agree with humans. In fact the algorithm has been effective enough to be able to weight positively certain features that are explicitly called out in the exam rubrics, indicating that the model works well. But, does this open the system up to gaming? Could a learner just feed in specific things the algorithm wants to hear rather than actually answering the question?

To test this ALTA tried randomising certain features and comparing automated scores to human scores. On almost all tests, the correlation was good enough. Except for the randomisation of sentence order within an essay. None of the sentence-level analysis covered cohesion within a full text. For this reason, they now use an entity grid to analyse the instances of nouns across different sentences in the text and how often they occur in adjoining sentences. This has helped overcome that issue.

So we are now at a point where the ALTA team feel that they are confidently able to automate the scoring of an essay based on lexical and grammatical control, and text-level coherence. There was no discussion today about how able the algorithm is to assess based on the learner’s answering of the question they were asked, but apparently more on this tomorrow!

Another area in need of research is assessment of Spoken output from learners. Helen mentioned that there are three key research challenges here:

  1. Segmenting spoken language into chunks
  2. Removing hesitation
  3. Dealing with repetition

PhD anyone?

Error Detection

The final session of the day was on error detection and correction. Marek Rei, of the University of Cambridge, gave a detailed account of the various analyses that contribute to a detection of an error in a learner’s text.  At a broad level, in order to detect an error, firstly we need a data set that is corrected by humans. From this data, algorithms are able to discover regularities. From here the trained model can be applied to data without human marking and detect the errors in the text itself.

Marek showing he really knows his stuff!

Marek went into detail about the different types of analysis that layer on top of each other in order to get us more accurate detection. Both precision accuracy (how many predicted errors are actually errors) and recall accuracy (how many real errors are found). These scores are combined to give an overall accuracy rating which improves with every additional layer of complexity on the analysis of the data.

Artificial neural networks – the ability for multiple combinations of input data to result in multiple possible outputs, rather than a linear approach to data analysis such as with perceptor (see above).

Word embeddings – Representations of words as vectors in order to see which different words occupy a similar semantic or linguistic space. Similar words end up having similar graphical representations

Word-level and character-level embedding – a way of analysing both the sequence of words and also characters and using existing data to make predictions about possible next stings of words and characters.

Dropout – the addition of randomisation into the feature weighting in order to make the models more robust.

Once all these methods are combined, Marek said that there was a recall accuracy of around 30%, so only 3 out of 10 errors were correctly identified. And this is with absolutely cutting edge uses of technology and brainpower. Compare this to essay assessment where we are already at an accuracy level comparable to that of a human (or better) and we see that error correction and detection are still areas where more research and study could bring about significant improvements for learners and teachers.

You can view Marek’s presentation here.


This is a lot of theory and over the course of the week we expect to think more about how these developments relate to teaching and learning and the contexts within which they could have an impact. For now, I’m off outside to get some evening sun and a cold drink!


Join our mailing list

Get new ELTjam posts & updates straight to your inbox

Powered by ConvertKit