Facebook open source blender, a chatbot that people say ‘feels more human’



[ad_1]

Facebook AI Research (FAIR), Facebook’s AI division and machine learning, details the work today in a comprehensive AI chatbot framework called Blender. FAIR claims that Blender, which is available in open source on GitHub, is the largest open domain chatbot in history and outperforms existing approaches to generate dialogue while “feeling[ing] more humane, “according to human evaluators.

FAIR says that Blender is the culmination of years of research to combine empathy, knowledge and personality in one system. To this end, the underlying models, which benefit from improved skills decoding and combining techniques, contain up to 9.4 billion parameters (configuration variables that define skill in a given problem), or 3.6 times more than systems previous.

Blender promises to make interactions with conversational AI systems like Alexa, Siri, and Cortana more natural than ever before, whether in business, industrial, or consumer-oriented contexts. That’s because they can ask and answer a wide range of questions; show knowledge on specific topics; and they express feelings such as empathy, seriousness or play depending on the circumstances.

Mix of skills and generation strategies

To achieve Blender’s advanced performance, FAIR researchers focused on two engineering steps: combination skills and build strategy.

GamesBeat Summit 2020 Online | Live now, update your pass for the networks and questions and answers from the speakers

“Combination skills” refers to selecting tasks that outperform larger models that lack fit. As FAIR researchers point out in a document, chatbot improvements can be achieved by fitting data models that emphasize desirable conversational skills. As a result, adjustment can also minimize undesirable traits learned from large data sets, such as toxicity.

Regarding the generation strategy, the choice of the decoding algorithm, the algorithm used to generate text from a language model, has a huge impact on the responses of a chatbot. Because the length of a bot’s responses tends to correspond to quality human judgment, it is desirable for decoders to strike a proper balance. Responses that are too short are generally perceived as boring or lack interest, while responses that are too long involve waffles or distractions.

Facebook Chatbot Blender

Above: A conversation with a Blender chatbot. Blender’s answers are in blue.

Image credit: Facebook

Over the course of these engineering steps, the researchers tested three types of model architectures, all of which used Transformers as the foundation. Transformers, a Google innovation, contain neurons (mathematical functions) arranged in layers that transmit signals from the input data and adjust the strength (weights) of each connection, as with all deep neural networks. This is how they extract functions and learn to make predictions, but Transformers also have attention. This means that each output element is connected to each input element and the weights between them are dynamically calculated.

First it was a retriever model that, given a dialog history (or context) as input, selected the next dialog response by rating a wide set of candidate responses and generating the highest. The FAIR researchers employed a poly-encoder architecture that encoded context features using attended representations for each candidate response, which they said resulted in improved performance while remaining “manageable” to compute, compared to other architectures, as cross-encoders. .

The second model was a generator that produced responses rather than retrieving them from a fixed set. Three models by size were considered, ranging from 90 million parameters to 2.7 billion parameters to 9.4 billion parameters.

The third model attempted to address problems with the generator, that is, its tendency to synthesize repetitive responses and to “hallucinate” knowledge. It took a “fetch and refine” (RetNRef) approach, where the fetch model described above produced a response when a dialog history was provided, which was then added to the generator input stream. In this way, the generator learned when to copy elements of the retriever’s responses and when not to be able to generate more interesting, interesting and “vibrant” responses. (Retriever models produce human-written responses that tend to include more vibrant language than standard generative models.)

Facebook Chatbot Blender

The FAIR team combined a Wizard Generator model with another recovery dog ​​that together determined when to incorporate the knowledge into the chatbot responses. The two models produce a set of initial knowledge candidates and then rank them, after which they select a single sentence and use it to condition the generation of responses. A classifier chooses whether or not to perform the recovery by dialogue, to avoid serving knowledge when it is not necessary.

Decoding

For generative models, the FAIR researchers used a beam search decoder method to generate responses to given dialogue contexts. Beam search maintains a set of partially decoded sequences, called hypotheses, that are aggregated to form sequences and then scored for the best sequences to surface.

To control the duration of chatbot responses, the FAIR team considered two approaches: a strict constraint on the minimum generation length and a classifier that predicted the length of the responses and set the constraint on the minimum generation length to its corresponding prediction. . The latter was more complex but resulted in variable length responses to questions, ensuring that the chatbot answered long when they seemed appropriate.

Training the models

To prepare the various models that make up Blender, the researchers first performed pre-training, a step that conditions machine learning models for particular tasks. They used Facebook’s own Fairseq, a toolkit that supports custom language model training, with data samples from a Reddit corpus containing 1.5 billion comments (with two sets of 360,000 comments each reserved for validation and testing ) clipped for known bots, non-english subreddits, deleted comments, comments with a URL, and comments of a certain length.

Facebook Chatbot Blender

The FAIR team then tweaked the models using another suite developed on Facebook, ParlAI, designed to train and test dialogue models. A selected training corpus was ConvAI2, which contains 140,000 statements involving paired volunteers who know each other by asking and answering friendly questions. Another was Empathetic Dialogues, which consists of 50,000 expressions of collective collaboration based on an emotional situation. Another data set, the Wikipedia Assistant, comprises 194,000 sentences of 1,250 topics, where each conversation starts with a topic chosen at random and the goal is to show expert knowledge.

A fourth fine-tuning dataset, Blended Skill Talk, aimed to combine the previous three sets (ConvAI2, Empathetic Dialogues, and Wizard of Wikipedia) to combine their respective abilities during dialogue. Here, 76,000 statements were collected with a guided and unguided human speaker, where the guided speaker could select the statements suggested by trained bots from the three individual data sets.

Evaluations

After training, the researchers evaluated Blender’s performance by comparing it to Google’s latest Meena chatbot, a machine learning model with 2.6 billion parameters. Human volunteers were tasked with answering two questions: “Who would you rather talk to for a long conversation?” and “Which speaker sounds more human?” – given 100 randomly published and published Meena records and the same number of records generated by Blender. In each case, the volunteers were shown a series of human-to-human dialogues paired with the respective chatbots.

The topics of conversation ranged from cooking, music, movies, and pets to yoga, veganism, instruments, and shopping malls; Blender models often go into detail when asked and name relevant stores, bands, movies, actors, pet species, and pets. names In one example, Blender offered a nuanced answer to a question about how Bach compared to Justin Beiber, while a request for Blender to write a song actually produced lyrics, though nothing particularly poetic.

Facebook Chatbot Blender

When chats showing Meena in action and chats showing Blender in action were presented, 67% of evaluators said that Blender’s best-performing chatbot had a generative model containing 9.4 billion parameters trained in the Corpus Blended Skill Talk, sounded more human. About 75% said they would rather have a long conversation with the fitted 2.7 billion parameter model than with Meena. And in an A / B comparison between person-to-person and person-to-person conversations, volunteers expressed a preference for fine-tuned models in Blended Skill Talk 49% of the time, while models trained only in public domain conversations. They were preferred just 36% of the time.

Problems, other experiments showed that Blender sometimes produced offensive sample-style responses from training corpus, mainly from Reddit’s comments. FAIR researchers say fine-tuning the Blended Skill Talk data set mitigated this to some degree, but addressing it holistically would require the use of an insecure word filter and a type of security classifier.

Facebook Chatbot Blender

Above: Here, Blender repeats and contradicts, forgets and hallucinates knowledge.

Image credit: Facebook

Of course, the FAIR researchers do not claim to have solved the problem of open domain conversation. In fact, they describe several of Blender’s main limitations:

  1. Use of vocabulary: Even the best Blender models tend to generate common phrases too often, like “do you like it?”, “Lots of fun” and “you have a hobby”.
  2. Non-trivial repetition: Models often repeat what they are told. For example, they will say they had a pet dog if a conversation partner mentions a pet dog, or that they like the same gangs as the person they are talking to.
  3. Contradiction and forgetfulness: Blender models contradict each other, although to a lesser degree on larger models. They also fail to establish the logical link that they shouldn’t ask the questions they asked before (to avoid the appearance of “forgetting”).
  4. Knowledge and correction of facts: It is relatively easy to incite Blender models to actually make mistakes, especially when exploring a topic in depth.
  5. Conversation duration and memory: Blender’s conversations would likely be boring and repetitive over the course of several days or weeks of conversation, FAIR researchers say, especially considering that Blender can’t recall previous conversations.
  6. Deeper understanding: Blender models lack the ability to learn concepts through additional conversation, and have no way to build on real-world entities, actions, and experiences.

Addressing all of this is likely to require new model architectures, which the FAIR team says it is exploring. It also focuses on building stronger classifiers to filter out harmful language in dialogues, as well as techniques to reduce gender bias in chatbots in general.

“We are excited about the progress we have made in improving open domain chatbots,” Facebook wrote in a blog post. However, building a truly intelligent dialogue agent who can chat like a human remains one of the greatest open challenges in AI today … Real progress in the field depends on reproducibility: the opportunity to build on the best possible technology. We believe that launching models is essential to allow a complete and reliable view of their capabilities. “

Previously trained and tuned Blender models with 90 million parameters, 2.7 billion parameters, and 9.4 billion parameters are available on GitHub, along with a script to interact with the bot (with built-in security filter). All the code for model evaluation and fit, including the data sets themselves, is available from ParAI.

[ad_2]