See you: ACL 2018, understanding data characterization and method evaluation in a more challenging environment

Leifeng Network AI Technology Review Press: This article belongs to “Top Meeting Wen”. Every year, so many wonderful artificial intelligence / machine learning meetings, no pity, it is easy to look at the scene at the scene. So afterwards, see the summary of other researchers, maybe there will be new gains.

Sebastian Ruder is a Data Analytics’s Insight Research Center at Reading doctoral studies, and is also Aylien’s research scientist. He used to work in Microsoft, IBM dark blue and Google code summer camp. His main research interest is the depth study for the area adaptation. This article was published by Sebastian Ruder in aylien blog, an in-depth, comprehensive ACL 2018 conference paper research highlight review. Lei AI Science and Technology Review Full-text Compilation is as follows.

From July 15th to 20th this year, I was fortunate to participate in the 56th computer language school year meeting held by Australia, Melbourne, which is ACL 2018, and published three papers (,, Want to summarize the content of the entire ACL 2018 in a topic. However, delicate tastes, or clearly see some important issues. In the academic conference in the Natural Language Field of 2015 and 2016, the word embedded technology can be described as a uniform world. At that time, many people even thought that it was better to explain it as “Empirical Methods In Natural Language Process” “is not as explained by” Empirical Methods In Natural Language Process “”.

Natural language processing embedding method

(Embedding Methods In Natural Language Processing) “.

Stanford University NLP Holder Christopher Manning mentioned in a speech that 2017 is the year of Bilstm + Attention (bidirectional LSTM with attention mechanism). Although BilSTM with attention mechanism is still everywhere, in my opinion, the main content of this conference is still

further understanding

These model capture characterization


More challenging environment

These characterizations are adopted. I am concerned with the work involving the above topics, and some other topics I am interested.

Understanding data characterization

Detection model

A new one is that many papers have an informative analysis of existing models and their information they capture, rather than continuing to introduce new models that look more cool. Currently, the most common practice to do this is to automatically create a dataset, it

A certain aspect of focusing on generalization capabilities

, Then evaluate different training models in this data:

For example, Conneau et al. (Http:// Asseses different sentence embedded methods on 10 data sets, which are designed to capture certain language characteristics, such as predicting one. The length of the sentence, the content of the word, and the sensitivity to binary conversion, etc. They find that different encoder structures may cause embedded with different characteristics, and compared to the results in other tasks, the Bag-of-Embeddings captures the ability of the sentence level information.

The zhu et al. (Http:// evaluate the sentence embedding by observing the similarity of the similarity of the triplet generated from some semantic or grammar. They found that in many discovery, SkipTHOUGHT and INFERSENT can distinguish between negative words and synonyms, while INFERSENT is more good at identifying equivalent semantics and identifying quantifiers.

PEZZELLE et al. (Http:// specialized to study the quantifiers, and they test the ability of different CNN and LSTM models to predict the number of quantities in multiple contexts. They found that in a single sentence context, the model is better than human beings, and the performance of humans in the multi-sentence context is slightly better.

Kuncoro et al. (Http:// Assess the ability of LSTM based on the maining consistent rules modeling. They found that as long as the capacity is sufficient, the LSTM can model the main as consistency, but the performance of the syntactic NETWORK grammars, https://ARXIV.ORG/Abs/1602.07776) is more sensitive. good.

Blevins et al. (Http:// Assess models for different tasks, and see if they can capture the hierarchy of syntax. Specifically, they trained the label for predicting the word or analyzing the different depths of the tree. They find that all models can actually encode a large number of syntax information, especially the language model can also learn some syntax.

Under the effort of Lau et al. (Http://, they got an interesting result related to the generalization ability of the language model: the language model trained with the corpus of the fourteen poems. It can learn equivalents with human level.

However, there is also its limitations in the language model. SpiThourakis and Riedel ( found that the language model has a poor ability to model digital modeling, and they have proposed some policies to improve language models for this issue.

LIU et al. (Http:// is shown in Relp4nLP Workshop, and the LSTM network trained with natural language data can be recall from longer sequences than models with non-natural language data training.

参会见闻系列:ACL 2018,在更具挑战的环境下理解数据表征及方法评价

It is worth noting that I think it is more important to better understand the LSTM network and the language model modeling. Because it seems to be an important driving force in the NLP field, as we are fine-tuning about language models. The ACL 2018 papers (http://ARXIV.ORG/Abs/1801.06146) and this article discussed in the IMAGENET era of the NLP field.

Understand the most advanced model

Although the research work mentioned above is an attempt to understand a certain level of generalization capabilities of a particular model category, this ACL has some papers to better understand the best model currently used for specific tasks:

Glockner et al. (Http://ARXIV.ORG/Abs/1805.02266) focusing on the task of natural language reasoning. They created a dataset, and the sentence in the data set is different from the sentence in the training data. There is only one word in the training data, which is to test whether the model can be concluded. They found that the current best model cannot complete many simple inference.

Mudrkarta et al. (Https://ARXIV.ORG/Abs/1805.05492) The current top QA model has span moderate analysis, and it is found that these models often ignore key asks. Next, they are disturbed by the problem to make a counter-like sample that can greatly reduce the accuracy of the model.

I found that many papers explored different levels of the model. I hope that these new data sets can be a standard tool in each natural language processing researcher toolkit. In this way, we can not only see more papers in the future, but also such an analysis may also become part of the standard model assessment of except error analysis and model simplification tests.

Analysis summary bias

Another way to better understand a model is to analyze the summary bias of the model. The language structure associated with natural language treatment of neural architecture Workshop (Relsnnlp Workshop) Try to explore how much role in integrating language structure into the model. One of the focus of Chris Dyer’s speech on Workshop is: Cyclic Neural Network (RNN) is a useful summary bias for natural language processing (NLP). In particular, he believes that there are several obvious evidence to prove

RNN is more deserialized to sequentially

,which is:

Over time, the gradient will gradually decay. LSTM or GRUs may help us slow down this trend, but they will also forget the gradient information.

A training mechanism such as reverse input sequence is used in training machine translation models.

People use similar attention mechanisms to establish direct contacts with earliest content.

In response to the main reason, the error rate increases as the attractor increases (

According to Chomsky, the sequential reception effect is not the correct bias of human language, so the bias brought by the RNN network seems to be very appropriate on the language modeling task. Such a practice may result in a problem that the efficiency and poor generalization capacity in practice. Syntax RNN ( is a model that generates a tree structure and a sequence of a sequence by compressing the sentence to its component, rather than a sequence of sequences (rather than sequential). .

However, it is often difficult to determine if the model has useful summary bias. In order to identify the main reason, Chris assumes that the LSTM language model learns an unstructured “first noun” heuristic, which matches the verbs to the first noun in the sentence. Generally, confusion (and other evaluation indicators) are related to syntactivity or structural capabilities. However, when interfacing the structure-sensitive model from the simple heuristic model, the confusion is not particularly sensitive.

Understand language using deep learning technology

Mark Johnson mentioned in Workshop’s speech that although deep learning has brought a lot of revolution to natural language processing, its main benefit is its economy: the end-to-end model replaces complex constituents. Process, it is often possible to achieve target accuracy faster and easier. Deep learning did not change our understanding of language, in this sense,

The main contribution of deep learning is that the neural network (or this calculation model) can perform some natural language processing tasks, which also indicate that these tasks are not intelligent indicators.

. Although the depth learning method can model matching and execution infection tasks, their performance is still unable to rely on conscious responses and thinking.

Language structure

Jason Eisner is questioned in the “language structure and category really exists”: It is true that there is a structure and category, or but “scientists like to divide the data into the stack”, because regardless of language structure The method can also be amazing in the machine learning task. He found that even if the difference between the difference between “/ B /” and phoneme “/ p /” is further strengthened, then it has some meaning. In contrast, the neural network model is like a good performance, it can absorb anything that is not explicitly modeled.

He mentioned four common methods to introduce language structure information in the model: a) By the line-based method, the language category is introduced as feature; b) expand the data with the language category; c) The introduction of language structure is introduced through multi-task learning; d) By structured modeling, for example using conversion-based parser, cycle neural network syntax, even like BIO markings, such as BIO tagmons, to introduce language information.

Emily Bender has a speech in Workshop, where she has questioned the entire idea of ​​”learning with language”:

Even if you already have a huge corpus in a language, you don’t know anything about this language, then without any prior information (for example, what is a function word), then you can’t learn. Structure or meaning of sentence

. She also pointed out that many machine learning papers describe their methods similar to the process of infant learning, but did not quote any of the practical development of psychology or language. In fact, the baby learning environment has special situations, a variety of factors together, with subjective feelings, which contain a lot of signals and meanings.

Understand the fault mode of LSTM

Better understanding is also a topic that is characterized by natural language processing to learn Workshop (Reresentation Learning for NLP Workshop). YOAV Goldberg introduces his team in detail in the speech on Workshop to better understand the efforts of RNN’s characterization. In particular, he discussed what recently extracted a limited state automation from RNN to better understand what model learned (http://ARXIV.ORG/Abs/1711.09576). He also reminded the audience, even if it is trained in a certain task, the LSTM characterization is not only valid for a particular task. They typically predict that the statistics of the data distribution beyond the expectations of humans expectations. Even when the model is used to fight against losses, the characterization of the characterization is characterized, and the predictive ability characterization will still have some properties just mentioned. Therefore, it is also a challenge from the complete deletion of unwanted information from the encoded language data, even if it looks a perfect LSTM model, it may also have potential fault mode.

For the topic about the fault mode of LSTM, Mark Steedman who got ACL lifelong achievement award this year also expressed the idea that the topic was very fit: “LSTM is effective in practice, but is it theore in theory?”

Assessment in a more challenging environment

Antagonistic sample

A close-related subject matter associated with better understanding of the existing best model is how to improve these models. Similar to the confrontational sample papers mentioned above (, several articles tried to make the model more robust to face confrontational samples powerful:

Cheng et al. (Https:// proposes that encoders and decoders in the natural language machine translation model are more robust when confronting the input disturbance.

Ebrahimi et al. (Http:// proposes a white box confrontational sample to deceive the character level neural network classifier by replacing a small number of words.

Ribeiro et al. (Http:// improved on the previous method. They introduced retention semantics, but the prediction of the model changed the disturbance, and then generalized it into rules that will have confrontational conditions under many instances.

Bose et al. (Https://ARXIV.ORG/Abs/1805.03642) combined with antagonistic samples and noise comparison, the sampler will find a more difficult negative, so that the model can be better Learning characterization.

Learning robust and fairness

Tim Baldwin discusses different methods that make the model more robust when the domain conversion is discussed on the Repl4nLP Workshop. See Google Pan for slides. In terms of single source domain, he discussed a method based on different types of syntax and semantic noise in language disturbing training instance ( . In a multi-source domain environment, he proposes to train confrontation models on the source domain ( Finally, he discussed a method of learning a robust, a privacy protection capability (

Margaret Mitchell focuses on fairness and privacy characterization. She especially emphasizes the difference between the descriptive perspective and normative perspective of the world.

The characterization of machine learning model learning reflects the descriptive perspective of corresponding training data. The training data represents “the world in people’s mouth”

. However, research on fairness is also trying to create a representative view that can react to the world, which is to get our values ​​and inject them into characterization.

Improvement evaluation method

In addition to enhanced the robustness of the model, there are several articles attempt to improve the evaluation model:

Finegan-Dollak et al. (Http://ARXIV.ORG/Abs/1806.09029) clarified the evaluation method of the existing Text-to-SQL system and improved the improvement method. They believe that existing training sets – test set segmentation and variable anonymization process have defects, so they have proposed standard improvement versions of seven data sets to fix these defects.

DROR et al. (Https:// is paid to a bonus, but it is rarely practical or doing Good practice: Statistical significance test. In particular, they investigated an empirical paper in ACL and TACL 2017, and found that statistian significant tests were often ignored or misuse, so they proposed a simple statistical significance for natural language processing tasks. Check the selection protocol.

Chaganty et al. (Http:// surveyed the deviation of automatic indicators such as Bleu and Rouge, and then found that even if it is unfavigated estimation, it can only be relatively reduced. This work emphasizes the relevance of automatic indicators and the need to reduce the variance of human markers.

Powerful contrast baseline

Another way to improve model assessments is to compare new models and stronger baselines, which is to ensure remarkable method effect. Here are some papers that focus on this research direction:

SHEN et al. (Https:// systematically compares the word embedded method based on the poolization technology and more complex models like LSTM and CNN. They found that words-embedded methods for most data sets exhibited comparable, even better performance.

Ethayarajh ( proposes a powerful contrast baseline for the sentence embedding model on the REPL4NLP Workshop.

At the same time, Ruder and Plank ( found that “Tri-Training” has provided a strong baseline for semi-supervision learning, and the result is even more than current The best way is better.

In the above, we emphasize the importance of assessing the assessment in an environment that is more challenging in different tasks. If we simply only pay attention to data within a single task or in the field, the results will vary. We need to test the model under confrontation conditions to better understand the robustness of the model and their generalization ability in practical problems.

Create a more challenging data set

Want to evaluate in such an environment, you need to create a more challenging dataset. YEJIN Choi’s round table discussion in REPL4NLP (Summary See:

Everyone is too simple for Squad or Babi and the task that has already solved the task has invested too much attention.

. YOAV Goldberg even thinks that “Squad is like the Mnist dataset in the field of natural language (one of the most basic data sets of image recognition). Instead, we should focus on more challenging tasks and develop more difficult data sets. However, if the data set is too complicated, people cannot process it. In fact, people should not spend too much time to handle data sets because people have recently possible to efficiently process the data set, and create new, more challenging data sets is more important. At this ACL meeting, the researchers put forward two data sets used to read understanding and tried to surpass Squad:

QANGAROO ( ,http://ARXIV.ORG/Abs/1710.06481) Focus on the need to collect multiple information through multiple reasoning steps.

Narrativeqa (, http://ARXIV.ORG/Abs/1712.07040) Requires the reader to understand its potential meaning by reading the entire book or movie script to answer the story.

Richard Socher emphasizes the importance of training and assessing models in multi-tasking in multitasking in Machine Reading for Question Answering Workshop (Summary). In particular, he pointed out that natural language processing requires different types of inferories, such as logical inferior, language inferrance, emotional inferior, etc., and obvious single tasks cannot meet this requirement.

Evaluate in a variety of languages ​​with poor quality

Another important issue is to evaluate the model in a variety of languages. Emily Bender investigated 50 NaAcl 2018 papers, she found

There are 42 articles to assess a mysterious language that does not point out the name.

(Of course, it is English). She emphasizes that the language is naming for each work, because there are different language structures in different languages; the language that does not mention the process will make the research conclusion blur.

If we design the method of natural language processing as a cross-language method, you should have additional assessments in the language of poor resource quality. For example, the following two papers pointed out that if the target language is different from Estonian or Finnish, the existing non-supervised bilingual dictionary will fail:

The Søgaard et al. (Https:// further explores the limitations of the existing methods and points out: these methods will fail when embedding is training or use different algorithms in different fields. They finally propose a metrics to quantify the potential of these methods.

ARTETXE et al. (Https://ARXIV.ORG/Abs/1805.06297) proposes a new non-supervised self-training method, which uses better initialization to boot the optimization process, and this method is for different languages. Strong.

In addition, there are several other articles to evaluate their methods in languages ​​with poor resource quality:

DROR et al. (Https:// It is recommended to use orthogonal features to summarize bilingual dictionary. Although this is mainly helpful for related languages, they can also be evaluated in a non-similar language such as English-Finnish.

Ren et al. (Http:// Finally, it is recommended to use another language-lack of language assistance resources. They found their models significantly improved the quality of translation of rare language.

Currey and Heafield ( proposes an undo-supervised Tree-to-Sequence model for natural language machine translation with Gumbel Tree-LSTM. The results prove that their model is particularly useful for language of poor resource quality.

Progress in the research of natural language processing

Another issue during the meeting is a significant progress in the field of natural language processing. The ACL Chairman Marti Hearst involved this part in her main point. She used to show our models and could not do things with Stanley Kubrick’s Hal 9000 (see figure below). In recent years, because our model has learned to implement a task that is not possible for more than ten years ago, she is a bit boring now. It is true that we need to deepen the language and reasoning tasks or very far, but the progress made in natural language is still very significant.

HAL 9000. (Source: CC by 3.0, wikimedia)

MARTI also references the pioneer Karen Spärck Jones of Natural Language Processing (NLP) and Information Retrieval (IR): “The study is not in the winding, but in the climbing spiral staircase. Telling may not be appropriate,

参会见闻系列:ACL 2018,在更具挑战的环境下理解数据表征及方法评价

These stairs are not necessarily connected, but they all move toward the same direction.

. She also expressed a point of view that causing many people: in the 1980s and 1990s, only a few papers can be read, and it is much easier to keep up with the latest research results. In order to make it easier to keep up with the latest achievements, I have recently established a new document ( to collect the latest results for different natural language processing tasks.

In the field of natural language processing, she encourages people to participate in the ACL and contribute their own power. She also awarded the ACL Outstanding Service Award for the most effortful ACL members. In addition, ACL 2018 also launched its third sub-conference AACL (computational linguistic association Asia Pacific branch after EACL and 2000 NAACL:

This ACL’s conference call focuses on how to cope with the challenges produced with the development of research: the number of paper submitted is increasing, so more reviewers are needed. We expect to see new efforts to handle a lot of submission papers at a meeting next year.

Enhanced learning

Let us pull the line of sight 2016, when people look for the use of martial arts in natural language processing, and apply their applications in more and more tasks. In recent times, although the supervisory learning seems to be more suitable for most tasks, the dynamic characteristics of some timing-dependent tasks (such as when training and modeling dialogue), it is most useful. The way. Another important application that strengthens learning is

Direct optimization metrics like Rouge or Bleu

Instead of optimizing alternative losses like cross entropy. Text summary and machine translation is a successful application case in this area.

Reverse strengthening learning is great in the environment that is too complex and unable to specify a reward. Visual narrative is a successful application case in this area. Deep learning is especially suitable for sequential decision issues such as text-based games, browsing web pages and completes corresponding tasks in the field of natural language processing. “Depth Strengthening Learning Tutorials for Natural Language Processing” ( provides a comprehensive overview of this area.


In fact, there are other great tutorials. I especially like the variational inferior and depth generation model tutorial ( Tutorials about Semantic Analysis ( and “You must know more than 100 things about semantics and practice” ( /ebender/100things-sem_prag.html is worth seeing. See the following link to get a full tutorial list: https: //

Via, Leifeng Network AI Science and Technology Comment Compilation

Author: ArticleManager