Dynamic Fusion Networks for Machine Reading Comprehension

Yuan Dominicusi,2*, Sisi Liu1,2, Chaofan Chen1,2, Zhengcuo Dan1,2, Xiaobing Zhao1,two
oneSchool of Information Applied science, Minzu Academy of Cathay, Beijing, China.
2Minority Languages Co-operative, National Language Resource and Monitoring Enquiry Center, Beijing, People's republic of china.
DOI: x.4236/jcc.2021.99011   PDFHTML XML 37 Downloads 121 Views Citations

Abstract

Teaching machine to comprehend a passage and reply respective questions, the auto reading comprehension (MRC) has attracted much attention in current years. Withal, most models are designed to finish English or Chinese MRC task, Considering lack of MRC dataset, the depression-resources languages MRC tasks, such as Tibetan, it is hard to become loftier operation. To solve this problem, this newspaper constructs a span-manner Tibetan MRC dataset named TibetanQA and proposes a hierarchical attention network model for Tibetan MRC task which includes word-level attention and re-read attending. And the experiments prove the effectiveness of our model.

Share and Cite:

Lord's day, Y. , Liu, Due south. , Chen, C. , Dan, Z. and Zhao, X. (2021) Teaching Machines to Read and Comprehend Tibetan Text. Journal of Computer and Communications, 9, 143-152. doi: 10.4236/jcc.2021.99011.

1. Introduction

Machine reading comprehension (MRC) aims to teach machines to read and empathize human being linguistic communication text. The task of machine reading comprehension asks the machine to read texts such as an article or a story, then answer some questions related to the text. The questions can be designed to query the aspects that human care about. Based on the reply form, MRC is simply categorized into four tasks: cloze tests, multiple choices, span extraction and free answering. In recent years, many Chinese and English language car reading comprehension datasets have emerged, such as: Squad [1], MCTest [2], MS-MARCO [3], Du-Reader Dataset [4] etc. Following these datasets, many models accept been proposed, such as S-Net [5], AS Reader [six], IA Reader [7] etc. And they achieved great performance. All the same, for low-resource language machine reading comprehension such as Tibetan, it is rarely mentioned. The master reasons are follows: 1) Lacking large-scale open Tibetan MRC datasets, the relevant experiments cannot be carried out. This is also the primary factor that hinders the development of Tibetan MRC. 2) Compared to English MRC, word sectionalization tools for Tibetan are nether developing. The wrong word segmentation results volition lead to semantic ambiguity, which will be propagated to downstream tasks. 3) For low-resources MRC tasks, it is hard to attain good performance on pocket-sized-scale dataset. Therefore, it needs the MRC model to strengthen its understanding.

To accost these problems, this paper proposes an finish-to-end model for Tibetan MRC. In order to reduce the fault propagation caused by give-and-take segmentation, the model incorporates syllable-level data. In improver, to enhance the power of model understanding, we adopt a hierarchical attention construction. In summary, our contributions are as follows:

• In order to solve the problem of lacking Tibetan MRC corpus, we construct a loftier-quality Tibetan MRC dataset named TibetanQA (The Tibetan Question Answering dataset), which covers multi-domain cognition and is constructed past crowdsourcing.

• To solve the segmentation errors, we combine syllable and word embedding, so that the model tin can learn the more complex information in Tibetan.

• To reduce the touch on of long text paragraph information that is irrelevant to the question, this paper uses a word-level attending mechanism to focus on the key words of the answer. To enhance the understanding power of model, this paper adopts a hierarchical attention network, which includes word-level attention and re-read attention to provide clues to answer the question.

two. Related Work

Car reading comprehension is an important stride in natural language processing from perceptual text to understand text. In the early on times, lacking large-calibration datasets, most of MRC system are dominion-based or statistical models. In the side by side decades, researchers begin to focus on MRC dataset construction. They care for machine reading comprehension equally a trouble with supervised learning and use transmission annotation to construct question-answer pairs. Hermann et al. suggest a blank-filling English automobile reading comprehension dataset CNN & Daily mail [8]. Hill et al. release the Children's Book Test dataset [nine], this dataset is only a simple shallow semantic agreement and practice not involve deep reasoning. To settle this problem, Laid et al. publish the RACE dataset in 2017 [10]. This dataset pays more than attention to reasoning ability. For span extraction MRC, Rajpurkar et al. collect a large-calibration dataset named Stanford Question Answering Dataset (SQuAD) with highly quality.

Followed these large-scale datasets, some of import research based on deep learning methods have broken out for MRC. The Lucifer-LSTM model is proposed by Wang et al. [xi]. They adopt Long Short-Term Retention (LSTM) [12] to encode the question and passage respectively, and then introduce the attending-based weighted representation of question in the LSTM unit of measurement. Subsequently, to capture long-term dependencies between words within a passage, the team of Microsoft proposed R-Net model [xiii]. Cui et al. propose the Attention-Over-Attention Reader model [14].

Different from the previous work, Seo et al. suggest the BiDAF model [15] which adopts two directional attentions. Xiong et al. propose a DCN model [sixteen] that uses an interactive attention machinery to capture the interaction between a problem and a paragraph.

The above models based on unmarried-layer attention take the problem of weak semantic interaction ability between the capture problems and paragraphs due to the small number of attention layers and shallow network depth. To solve this trouble, a series of contempo works have enhanced the model by stacking several attention layers [17]. Huang et al. propose Fusion-Net [18]. The model uses a fully perceptual multilayer attention architecture to obtain the complete information in the problem and integrate information technology into the paragraph representation. Wang et al. [19] propose a multi-granular hierarchical attention fusion network to summate the attention distribution at different granularities, and then perform hierarchical semantic fusion. Their experiments prove that multiple layers of attending interaction tin achieve better functioning. Tan et al. [20] propose an extraction-generative model. They employ RNN and attention machinery to construct question and context representations, and then they use seq2seq to generate answers based on cardinal information.

iii. Dataset Structure

Considering lack of Tibetan machine reading comprehension dataset, this paper constructs a span-style Tibetan auto reading comprehension dataset named TibetanQA. This process is mainly divided into three stages: passages collection, questions collection, and answer verification.

3.1. Passage Collection

Nosotros obtain a large amount of text information from the Yunzang website. In order to improve the quality of the TibetanQA, the articles encompass a wide range of topics, including nature, culture, education, geography, history, life, gild, art, person, science, sports and technology. In addition, we have deleted noise information in manufactures, such every bit images, tables, and website links, and discarded article shorter than 100 characters, finally, 763 articles are selected to the dataset.

three.two. Question Construction

In order to collect questions effectively, nosotros develop a QA collection web awarding, the students whose native language is Tibetan are invited to utilize this awarding. For each passage in the article, they first need to select a segment of text or a span in the article as the answer, and and then write the question in their ain language into the input field. Students are tasked with asking and answering up to 5 questions on contents of one article. The respond must be office of the paragraph. When they cease an article, the system will automatically assign the side by side article to them. To construct a more challenging corpus, we conduct a brusk-term training to guide them how to provide effective and challenging questions. For each student, nosotros will outset teach them how to ask and answer questions, and and then use a small amount of information to examination them, only students with an accuracy rate of ninety% can do the following work. Nosotros don't impose restrictions on the form of questions and encourage them to inquire questions in their own language.

iii.three. Reply Verification

In order to further improve the quality of the dataset, we invite another grouping of Tibetan students to bank check the dataset after obtaining the initial dataset. They select the valid QA pairs, discard the incomplete answers or questions, and strip away the question with wrong grammar. In the end, we construct ten,881 question and answer pairs. To better railroad train our model, we organize TibetanQA into json format, and add together a unique ID to each question respond pair (come across Table one). Finally, these question answer pairs are partitioned at random into a training ready, development set and examination set (see Table 2).

4. Model Details

In this section, we innovate our model in detail.

4.1. Data Preprocessing

Different from English, Tibetan is a Pinyin language. The smallest unit of word is syllable. And some syllables can indicate the some meaningful "case". The "case" in Tibetan refers to a type of role syllable that distinguishes between words and explains the office of the word in a phrase or judgement.

Table ane. An example in tibetan MRC corpus.

Table 2. Tibetan MRC dataset statistics.

In fact, there are many syllables in Tibetan can provide some key information for MRC task as the "case" practice. Therefore, it is necessary to embed the syllable information in the encoding layer. On the other mitt, the embedding of syllables can reduce the semantic ambiguity acquired past wrong word segmentation. Based on the above considerations, this paper combines syllables and words information. Next, we will introduce word-level and syllable-level Tibetan text pre-processing in our experiments.

• Syllable-level preprocessing: It is easy to split the syllables, because there is delimiter between the syllables. With the help of delimiter-".", we tin separate the syllables.

• Word-level preprocessing: Each discussion is equanimous of departure syllables, which is hard to spilt word in sentences. For word-level segmentation, we use Tibetan word segmentation tools [21].

Finally, the specific format is every bit shown in Table three.

4.2. Input Embedding Layer

With strong grammatical rules, Tibetan is made up of syllables, and syllables are the smallest unit of Tibetan. It is noteworthy that some syllables tin contain data, such every bit reference, subordination, gender, etc. This information will help to predict the right answer. Therefore, at the input encoding layer, we also embed the syllables into word represent, which can excerpt more information for the network.

Suppose there is a question and a passage, And they can be present as: Q = { q 1 , q 2 , q 3 , , q n } and P = { p 1 , p 2 , p 3 , , p k } , we plough them into syllable-level embedding ( { s ane q , southward 2 q , due south 3 q , , s north q } and { s 1 p , south 2 p , s 3 p , , s thou p } ) and discussion-level embedding ( { w 1 q , w two q , w 3 q , , w due north q } and { w 1 p , w ii p , w 3 p , , due west k p } ) respectively. Nosotros apply a pre-trained model to encode question and passage, each word token is encoded into a 100-dimensional vector with fastext through a lookup fashion. As for syllable-level encoding, nosotros use a bi-direction long curt-term retentivity neural network (BiLSTM) and utilize the final country equally the syllable-level token. Finally, we fuse two vectors of different levels through a two-layer highway network, and the final passage and questions are finally coded equally: { Thou t q } t = 1 n and { Chiliad t p } t = 1 yard .

four.iii. Word-Level Attention

Merely as people participate in a reading comprehension test, people will read the questions firstly, then starting time to briefly read the passage, mark the words relevant question, and pay more attention on the keywords. Finally, they volition search for the correct answer. Inspired by this, we propose discussion-level attention. We

Tabular array 3. Information preprocessing sample.

perform word-level attention to calculate the importance of each discussion in the passage to the question. Similarly, bold that the passage discussion-level embedding is { One thousand t p } t = ane m and the question discussion-level embedding is { Yard t q } t = 1 n . The attention vector of each word in the passage is calculated by the Equation (1).

S u = 5 T tan ( W u Q One thousand i q + West u p One thousand j p ) (1)

where W u Q and Westward u p is a trainable weight matrix, and S u presents the similarity matrix. Side by side, we will normalize S u , in which every row volition be normalized by a softmax function, shown in the Equation (2).

a u exp ( Due south u ) (ii)

To determinate which words in passage are helpful to answer the question, the query-to-context word-level attention. To determinate which words in passage are helpful to answer the question, the query-to-context word-level attending A i p is shown as the Equation (three).

A i p = a u M j q (3)

Finally, we will use Bi-LSTM to obtain the sentence-pair representation V t p . And the notation is shown as the Equation (four).

5 t p = B i L Due south T M ( V t 1 p , [ A t p , M t p ] ) (four)

4.4. Re-Read Attention

The discussion-level attending layer is a shallow attention calculation. To enhance the attending, we adopt a high-level attention to consider which sentence contains the correct span of answer. Therefore, we introduce the "re-read attention". Re-read attention aims to calculate the attention between the passage and question on judgement level. Before we calculate the attention, we demand to understand the question. Namely, for each token in question, we apply BiLSTM to generate a higher level of question embedding y i q . The notation is shown equally the Equation (5).

y i q = B i Fifty Southward T Thousand ( y i 1 q , [ s i q , w i q ] ) (5)

where y i one q presents the previous hidden vector, southward i q is syllable-level after input embedding layer and due west i q is the output of discussion-level attention layer.

Based on the agreement of the question, similarly, we perform the re-read attention, and the adding equations are (half dozen)-(8).

S five = V T tan ( Due west five Q y i q + W v p 5 j p ) (six)

a v exp ( South five ) (seven)

A i p = a v y i q (8)

where S v is the similarity matrix between passage and question semantic embedding, y i q is question embedding vector, 5 j p is the output of word-level attention layer.

Finally, we use BiLSTM to encode the output of re-read attention layer. The concluding vector is coded as, shown in the Equation (nine).

Thou t p = B i L S T M ( Grand t 1 p , [ A t p , y t q ] ) (9)

4.5. Output Layer

The master goal of this layer is to predict the starting position of the answer. At this level we utilise a softmax layer to achieve. This layer will predict the probability of each position in the give passage to exist the start or end of the answer. And it can be described as the Equations (10) and (11).

p s t a r t = due south o f t thou a x ( W 1 M p ) (ten)

p e n d = southward o f t one thousand a ten ( W 2 Thou p ) (11)

where W one and W 2 are training parameters, p s t a r t , p e n d are the start and end position of answer.

5. Experimental Issue and Analysis

5.1. Dataset and Evaluation

We comport some experiments on Team, Team (8K) and TibetanQA. Table iv shows the statics of datasets.

To evaluate the effect of the model, this paper uses two common evaluation methods EM and F1. EM is the pct of the predicted answer in the dataset that is the same as the true reply. F1 is the average discussion coverage betwixt the predicted answer and the true answer in the dataset.

5.2. Experiments on Unlike Models

Earlier the experiment, we would like to introduce our baseline models. They are foundation but have achieved great operation in English MRC task.

Because that at that place are no syllables in English, nosotros remove the syllable embedding in our model on SQuAD. Next, nosotros conduct some experiments on SQuAD, Squad (8K) our datasets. All models utilize fasttext embedding and are implemented by united states, and the results of experiment are as Table 5.

Table iv. Answer types with proportion statistics.

Team: The Stanford Question Answering Dataset (SQuAD) is a new claiming reading comprehension dataset. It was construct by crowdsourcing and published in 2015. SQuAD (8K): This is a dataset consisting of most 8000 question and answer pairs, which are randomly selected from the SQuAD dataset. TibetanQA: The dataset is constructed by the states and information technology uses a manual construction method. We nerveless 5039 texts of knowledge entities in various fields on the Tibetan encyclopedia website and manually constructed 8213 question answer pairs.

Tabular array 5. Experimental result of different models on 3 datasets.

R-Net: This model was proposed by Microsoft Research Asia Squad (Wang, Yang & Zhou, 2017). They pay more attention to the interaction betwixt questions and passage through a gate attention-base of operations network. BiDAF: The BiDAF model was proposed by Seo et al (Seo, Kembhavi, Farhadi & Hajishirzi, 2016). Different from R-Net, the Bi-DAF model adopted two directions interaction layer. They didn't use the self-matching as R-Net did simply calculated ii attentions near query-to-context and context-to-query. QANet: This model was proposed by Adam et al. [22]. They combine local convolution with global self-attending and achieved better operation on SQuAD dataset. What deserves to be mentioned the most is they improve their model by data augments. For a better comparison, we remove the data enhancement in the side by side experiments. Ti-Reader: This is our model, which including a hierarchical attending networks.

It can exist found our model take a better performance on three difference datasets. For the Team, our model achieves 73.1% and 81.2% on EM and F1. Compared with BiDAF, our model increases 4.6% on F1. For the SQuAD (8K), the EM reaches 64.9% and F1 reaches 75.8%. Compared with R-Net, our model increases 3.seven% on EM and half dozen.5% on F1. Compared with BiDAF, our model increases 4.1%, 7.three% on EM and F1. Compare with QANet, our model shows an improvemesF1. Thus, we can see that our model performs amend on the SQuAD (8K). For our dataset, we can discover our model is superior to other models. The Ti-Reader achieves 53.8% on EM and 63.ane% on F1. And when we include the syllable embedding, the deviation is +nine.half-dozen% on EM and +8.2% on F1.

Additionally, we explore the following two kinds of attention mechanisms: word-level attending and re-read attention. The experiment shows the performance of the model is decreased. The EM value is decreased by 3.i% and the F1 value is decreased by 3.nine% when removing word-level attention. The result illustrates the word-level attention mechanism tin can dynamically assign the weight of each give-and-take, so that the model can focus on those valuable words and improve the performance of the model. The Re-read attention mechanism is an interaction between the passage and question. Information technology can be found that the EM of the model has decreased by 5.1%, and F1 value has decreased by 4.8% when remove the re-read attention.

half dozen. Conclusions

In this newspaper, we proposed the Ti-Reader model for Tibetan reading comprehension. The model uses hierarchical attention mechanism, including word-level attention and re-read attention. At the same time, we conduct some extra experiments, and testify their effectiveness. Compared with two classic English MRC models, BiDAF and R-Cyberspace, the experiments show that our model has more advantages for Tibetan MRC. However, there are yet some incorrect answers.

In the futurity, we will continue to better the accuracy of the model's prediction answers and design lighter models.

Acknowledgements

This work is supported by National Nature Scientific discipline Foundation (No. 61972436).

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this newspaper.

References

[one] Rajpurkar, P., Zhang, J., Lopyrev, K. and Liang, P. (2016) Team: 100,000+ Questions for Motorcar Comprehension of Text. Proc. EMNLP, Austin, TX, i-10. https://doi.org/x.18653/v1/D16-1264
[2] Richardson, J.C., Christopher, Thousand. and Erin, R. (2013) Mctest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text. Proc. EMNLP, Seattle, Washington, 193-203.
[3] Nguyen, T., Rosenberg, M., et al. (2017) Ms Marco: A Human being Generated Motorcar Reading Comprehension Dataset. Proc ICLR, Toulon, France, Louisiana, USA, i-10.
[4] Wei, H., Kai, L., Jing, 50., et al. (2018) DuReader: A Chinese Car Reading Comprehension Dataset from Real-World Applications. Proc ACL, Melbourne, Commonwealth of australia, 37-46.
[5] Tan, C.Q., Wei, F.R., Nan, Y., et al. (2018) S-Net: From Answer Extraction to Reply Synthesis for Machine Reading Comprehension. Proc. AAAI, 5940-5947.
[6] Kadlec, R., Schmid, 1000., et al. (2016) Text Agreement with the Attention Sum Reader Network. Proc ACL, Berlin, Germany, 908-918. https://doi.org/x.18653/v1/P16-1086
[vii] Alessandro, S., Philip, B., Adam, T. and Yoshua, B. (2016) Iterative Alternate Neural Attention for Machine Reading. arXiv preprint arXiv:1606.02245
[viii] Hermann, K.M., Kocisky, T., Grefenstette, E., Espeholt, L., Kay, Westward., Suleyman, One thousand. and Blunsom, P. (2015) Instruction Machines to Read and Comprehend. Advances in Neural Information Processing Systems, 1693-1701.
[9] Hill, F., Bordes, A., Chopra, Southward. and Weston, J. (2015) The Goldilocks Principle: Reading Children's Books with Explicit Retention Representations. arXiv preprint arXiv:1511.02301
[10] Lai, G., Xie, Q., Liu, H., Yang, Y. and Hovy, E. (2017) Race: Large-Scale Reading Comprehension Dataset from Examinations. arXiv preprint arXiv:1704.04683 https://doi.org/10.18653/v1/D17-1082
[eleven] Wang, S. and Jiang, J. (2016) Machine Comprehension Using Friction match-Lstm and Answer Pointer. arXiv preprint arXiv:1608.07905
[12] Hochreiter, S. and Schmidhuber, J. (1997) Long Brusk-Term Retention. Neural Computation, nine, 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
[13] Wang, Due west., Yang, Due north., Wei, F., Chang, B. and Zhou, M. (2017) Gated Self-Matching Networks for Reading Comprehension and Question Answering. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol. ane: Long Papers, 189-198. https://doi.org/10.18653/v1/P17-1018
[14] Cui, Y., Chen, Z., Wei, S., Wang, South., Liu, T. and Hu, G. (2016) Attention-Over-Attention Neural Networks for Reading Comprehension. arXiv preprint arXiv:1607.04423 https://doi.org/10.18653/v1/P17-1055
[15] Seo, One thousand., Kembhavi, A., Farhadi, A. and Hajishirzi, H. (2016) Bidirectional Attention Flow for Machine Comprehension. arXiv preprint arXiv:1611.01603
[xvi] Xiong, C., Zhong, V. and Socher, R. (2016) Dynamic Coattention Networks for Question Answering. arXiv preprint arXiv:1611.01604
[17] Yin, J., Zhao, W.X. and Li, X.Yard. (2017) Blazon-Aware Question Answering over Knowledge Base with Attention-Based Tree-Structured Neural Networks. Periodical of Computer Science and Applied science, 32, 805-813. https://doi.org/ten.1007/s11390-017-1761-viii
[18] Huang, H.Y., Zhu, C., Shen, Y. and Chen, W. (2017) Fusionnet: Fusing via Fully-Aware Attending with Application to Machine Comprehension. arXiv preprint arXiv:1711.07341
[19] Wang, W., Yan, M. and Wu, C. (2018) Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering. arXiv preprint arXiv:1811.11934 https://doi.org/x.18653/v1/P18-1158
[20] Tan, C., Wei, F., Yang, N., Du, B., Lv, Westward. and Zhou, Yard. (2017) South-Internet: From Answer Extraction to Answer Generation for Machine Reading Comprehension. arXiv preprint arXiv:1706.04815 https://doi.org/ten.1007/978-3-319-99495-6_8
[21] Long, C.J., Liu, H.D., Nuo, Thousand.H., Wu, J., Mohammad and Le, Q.V. (2018) Tibetan Pos Tagging Based on Syllable Tagging. Chinese Info. Process, 29, 211-216. (In Chinese)
[22] Yu, A.Westward., Dohan, D., Luong, Thou.T., Zhao, R., Chen, K., Norouzi, M. and Le, Q.5. (2018) Qanet: Combining Local Convolution with Global Self-Attending for Reading Comprehension. arXiv preprint arXiv:1804.09541

fishbournemoothoung.blogspot.com

Source: https://www.scirp.org/journal/paperinformation.aspx?paperid=112522

0 Response to "Dynamic Fusion Networks for Machine Reading Comprehension"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel