裝配圖網(wǎng) > 資格/認(rèn)證考試 > 全國(guó)翻譯資格認(rèn)證 > 外文文獻(xiàn)翻譯--使用MFCCDTW和KNN的隔離詞自動(dòng)語(yǔ)音識(shí)別（ASR）系統(tǒng)【中文4600字】【PDF+中文WORD】

外文文獻(xiàn)翻譯--使用MFCCDTW和KNN的隔離詞自動(dòng)語(yǔ)音識(shí)別（ASR）系統(tǒng)【中文4600字】【PDF+中文WORD】

外文文獻(xiàn)翻譯--使用MFCCDTW和KNN的隔離詞自動(dòng)語(yǔ)音識(shí)別（ASR）系統(tǒng)【中文4600字】【PDF+中文WORD】,中文4600字,PDF+中文WORD,外文文獻(xiàn)翻譯,使用MFCC,DTW和KNN的隔離詞自動(dòng)語(yǔ)音識(shí)別（ASR）系統(tǒng)【中文4600字】,【PDF+中文WORD】,外文,文獻(xiàn),翻譯,使用,MFCC,DTW The 2016 Asia Pacific Conference on Multimedia and Broadcasting (APMediaCast) Isolated Word Automatic Speech Recognition (ASR) System using MFCC, DTW & KNN 110 978-1-4673-9791-9/16/$31.00 ?2016 IEEE Muhammad Atif Imtiaz Faculty of Electronics & Electrical Engineering University of Engineering and Technology, Taxila atif.imtiaz@uettaxila.edu.pk Gulistan Raja Faculty of Electronics & Electrical Engineering University of Engineering and Technology, Taxila gulistan.raja@uettaxila.edu.pk Abstract— Automatic Speech Recognition (ASR) System is defined as transformation of acoustic speech signals to string of words. This paper presents an approach of ASR system based on isolated word structure using Mel-Frequency Cepstral Coefficients (MFCC’s), Dynamic Time Wrapping (DTW) and K-Nearest Neighbor (KNN) techniques. The Mel-Frequency scale used to capture the significant characteristics of the speech signals; features of speech are extracted using MFCC’s. DTW is applied for speech feature matching. KNN is employed as a classifier. The experimental setup includes words of English language collected from five speakers. These words were spoken in an acoustically balanced, noise free environment. The experimental results of proposed ASR system are obtained in the form of matrix called confusion matrix. The recognition accuracy achieved in this research is 98.4 %. Keywords—ASR; MFCC; DTW; KNN I. INTRODUCTION Speech is propagation of periodic variations in the air from human lungs. The responsibility for the production and shaping of actual sound is done by the human vocal tract with the help of pharynx, nose cavity and mouth. Automatic Speech Recognition (ASR) system is the process of automatically interpreting human speech in a digital device and is defined as transformation of acoustic speech signals to words string. Generally goal of all ASR systems are used to extract words string from input speech signal [1]. In ASR process the input is the speech utterance and output is the in the form of textual data in association with given input. Some factors on which the performance of ASR systems mainly relies are vocabulary size, amount of training data and systems computational complexity. There are numerous applications of ASR like it is extensively used in domestic appliances, security devices, cellular phones, ATM machines and computers. This paper describes an ASR System of English language experimented on small vocabulary of words. Rest of the paper is organized as follows: Section II describes the overall ASR System Overview, the major blocks used in ASR System. While implementation of ASR system using Feature Extraction and classification techniques are described in Section III. Section IV discuses the brief description of experimental setup, as well as some experimental results. Concluding remarks are discussed in section V. II. ASR SYSTEM OVERVIEW ASR system comprises of two main blocks i.e. Feature extraction block and a classification block as shown in Fig. 1. Fig. 1. Block Diagram of Proposed ASR System Design The input to the block is speech and output of the block is textual data. The working of blocks is described below: A. Feature Extraction Block Feature Extraction is one of the most vital module in an ASR system. In ASR, speech signal is split up into smaller frames usually 10 to 25 msec. As there is redundant information, present in the speech signal. Therefore, to take out important and useful information feature extraction technique is applied. This will also help in diminution of dimensionality. Perceptual Linear Prediction (PLP) coefficients, Wavelet transform based features, Linear Predictive Coefficients (LPC), Wavelet packet based features and Mel Frequency Cepstral Coefficients (MFCC) are the widely used features in ASR [2]. MFCC is used in this research and is discussed in details in section III. B. Classification Block After extracting features from speech signal, the extracted features are given to the classification block for recognition purpose. In classification the input speech feature vector is used to train on known feature patterns and is tested on test dataset and the performance of classifier is evaluated on percentage recognition accuracy. In this research, DTW is used for feature matching and KNN is used for classification, The inner blocks shown in Fig. 2 are individually described below in detail: 1) Pre- Processing: The audio signals in are recorded having a sampling rate of 16 kHz. Each word is stored in separate audio file. The pre-processing step includes the Pre- emphasis of s ignal to boost the energy of s ignal a t high frequencies. The difference equation of Pre-emphasis filter is given by equation (2). both are discussed further in section III. H(z) = B(z) = A(z) bO+blz-l 1 = 1 ? 0.97z-1 (2) C. Database In ASR system, the database is a group of speech samples. These samples of speech data are collected in a way to illustrate different changeable aspects of language. Selection of a dataset is of significant importance for successfully conducting ASR research. It provides a platform in comparing performance of The Output response of pre-emphasis Filter is shown in Fig. 3. Orignal Signal 0.4 0.2 0 -0.2 -0.4 different speech recognition techniques [3]. It also provides 0 5000 10000 15000 -3 researchers a balance in different speech recognition aspects i.e. gender, age and dialect. A database comprises of large, medium or small sizes depending upon the word count. Data can be gathered from sources i.e. books, newspapers, magazines, x 10 1 0 -1 -2 -3 Filtered signal lectures and TV Commercials. Due to issues of unavailability of volunteers and some identity issues, speech databases are not easily available. Some standard speech databases are available for few languages, like BREF for French, TIMIT for English and ATR for Japanese etc [4]. III. IMPLEMENTATION OF ASR SYSTEM In this section implementation and description of feature extraction technique Mel frequency cepstral coefficient (MFCC), feature matching technique (DTW) and feature classification technique K-Nearest Neighbor (KNN) are 0 50 100 150 200 250 300 350 400 Fig. 3. Pre- Emphasis Filter Output 2) Framing and Windowing: The speech signal is not stationary in nature. In order to make it stationary framing is used. Framing is the next step after pre-processing; in this step speech signal is split up into smaller frames overlapped with each other. After framing windowing is used to remove discontinuities at edges of frames. The window method used in this research is Hamming Window. The Hamming Window is defined by equation (3). discussed in detail: w(η) = 0.54 ? 0.46 cos( Zπn ) 0 ≤ η ≤ N ? 1 N-1 0 ot?erwise (3) A. Mel frequency cepstral Coefficient Human Speech as a function of the frequencies is not linear in nature; therefore the pitch of an acoustic speech signal of single frequency is mapped into a “Mel” scale. In Mel scale, the Where, N is total number of samples in a single frame. The output response of Original signal and windowed signal is shown in Fig. 4. Original Signal frequencies spacing below 1 kHz is linear and the frequencies spacing above 1 kHz is logarithmic [5]. The Mel frequencies corresponding to the Hertz frequencies are calculated by using equation (1). 0.4 0.2 0 -0.2 -0.4 0 5000 10000 15000 fmel = 2595log(1 + f ) (1) 700 The block diagram for Mel-Frequency Cepstral Coefficients (MFCC) computations is shown in Fig. 2. 0 x 1 -3 Windowed signal 1 0 -1 -2 0 50 100 150 200 250 300 350 400 Fig. 4. Original Signal Vs Windowed Signal 3) Fast Fourier Transform (FFT): Fast Fourier transform is used for calculating of the discrete fourier transform (DFT) of signal, with size N=512 have been used [6]. This step is performed to transform the signal into frequency domain. The FFT is calculated using equation (4). Fig. 2. Block Diagram for MFCC Computation τ X[k] = ∑N-1 x[η]e-jZ kn n＝0 N (4) Where, N is the size of FFT. The Magnitude spectrum of FFT X[k] = ∑N-1 2x[η]cos[ π k(2η + 1)]; k = 0,1,2, … N ? 1 (6) n＝0 ZN is shown in Fig. 5. 0.04 Fast Fourier Transform (FFT) The MFCC’s graph for a single word is shown in Fig. 8. MFCC Computation of a Single Word 0.035 15 0.03 10 Amplitude 0.025 0.02 5 0.015 Amplitude 0.01 0 0.005 -5 0 0 50 100 150 200 250 300 350 400 Frequency Fig. 5. Fast Fourier Transform Magnitude Spectrum 4) MelFilter Bank: The next step after taking FFT of the signal is the transformation from Hertz to Mel Scale, the spectrums power is transformed into a Mel scale [7]. The Mel filter bank comprises of triangular shaped overlapping filters as -10 -15 -20 0 10 20 30 40 50 60 70 80 90 No of Frames Fig. 8. MFCC’s for Single Word shown in Fig. 6. Fig. 6. MFCC Filter Bank Output B. Classification & Recognition In determining the performance of the system specifically ASR system, the role of classifier is very significant. In this research Dynamic Time Warping (DTW) and K-Nearest Neighbors have been used for Speech feature matching and Classification. DTW measures the resemblance in two time series, which are different regarding time or speed. In programming of DTW dynamic approach is taken in account in order to optimize the similarity between two time series. For continuous speech recognition case Hidden Markov Models (HMM) and Artificial Neural Networks (ANN) are 5) Delta Energy: In this step take base 10 Logarithm of output of previous step. The computation of Log energy is essential because of the fact that human ear response to acoustic speech signal level is not linear, human ear is not much sensitive to difference in amplitude at higher amplitudes. The advantage of logarithmic function is that it tends to duplicate behavior of human ear. Energy computation is calculated using equation (5). The graph for energy computation is shown in Fig. 7. considered suitable for classification. ANNs have a tendency to replicate the brain activity human. ANNs comprises of a set of neurons which are interconnected with each other. In ANN the output is measured by calculating the product of inputs weighted sum. One of the most popular classification techniques for continuous speech recognition is Hidden Markov Models (HMM). It is basically statistical classification technique and models a time series in the presence of two stochastic variables [9]. The proposed research focuses on ASR of words based on isolated word E = ∑ tZ t＝t1 xZ(t) (5) structure and it does not require any language model. In this Signal Log Energy 4 2 Log Energy of Frames 0 -2 -4 -6 0 10 20 30 40 50 60 70 80 90 No of Frames Fig. 7. Signal Log Energy Output 6) Discrete Cosine Transform (DCT): The Discrete Cosine Transform (DCT) is employed after taking logarithm of output of the Mel-filter bank. It finally produces the Mel- Frequency Cepstral Coefficients. In this research for an isolated word, 39 dimensional features are taken out i.e. 12-MFCC (Mel frequency cepstral coefficients), one energy feature, one delta energy feature, one double-delta energy feature, 12-delta MFCC features and 12-double delta MFCC features. An N-point DCT [8] is defined by equation (6). research, Dynamic Time Wrapping (DTW) and K-Nearest Neighbor (KNN) techniques have been used for feature matching and classification based upon the MFCCs. The classification step includes two stages; i) Training ii) Testing The results and percentage recognition accuracy are obtained in the form of Confusion Matrix. DTW and KNN are discussed further in next section 1) Dynamic Time Wrapping (DTW): DTW Algorithm calculation is in view of measuring closeness in two time series which might shift in time and speed. The comparison is measured in terms of position of two time arrangements if one time arrangement might be wrapped non-straightly by extending or contracting it along it’s time pivot. The wrapping in two time arrangements can further be utilized to discover relating regions in two time arrangements or to focus closeness between the two time arrangements. Numerically, DTW compares two time arranged patterns and measure the similarity between them with the help of minimum distance formula. Consider two time series P and Q having length n and m i.e. P = p1, pZ, p3 ……., pi ….., pn Q = q1, qZ, q3 ……., qj ….., qm In time series P and Q the ith and jth component of the matrix includes the distance d(pi, qj) in the two matrix points pi and qj [10]. Then using Euclidean distance formula, in equation (7) measures the absolute distance between two points. d(pi, qj) = J(pi ? qj)Z (7) Every matrix element i and j is belongs to the alignment in points pi and qj. Then, using equation (8) accumulated distance is calculated. D(i, j) = min[D(i ? 1, j ? 1), D(i ? 1, j), D(i, j ? 1)] + d(i, j) (8) 2) K- Nearest Neighbor (KNN):The working of KNN classifier in this research is dicussed below. l KNN method consists of assigning the index of the feature vector that is nearest to given score in the feature space. l Minimum score indices from DTW are processed in KNN method. l It Converges the Current feature on to respective feature of feature Space. l Same numbers of features are returned by KNN but these features are from feature Space. l Mode of the KNN returned Features gives the most 3) Confusion Matrix: In order to check the efficiency of the system i.e. recognition accuracy and percentage of error, a confusion matrix is formed. In case of N words, it will contain N×N matrix. In confusion matrix all diagonals entries, state Aij for i=j, showed the no of time a word i is matched correctly [11]. Similarly non-diagonal entries, state Aij for i≠j, showed the number of times a word i is is confused with the word j A11 A12 A13 … A1N A21 A22 A23 … A2N A31 . A32 . A33 … A3N . … . . . . … . AN1 AN2 AN3 … ANN 4) Percentage Error: The calculation of percentage of error is very important in order to check the overall system performance and it is calculated in the form of confusion Matrix. For this purpose a single isolated word is tested and check how many time it is recognized successfully and stated in diagonal entry in row i. percentage is calculated by dividing successfully entries divided by the total no of entries. Thus, Correct Match C and percentage error E, for a particular word, can be represented as in equation (9) & (10); The results obtained from confusion matrix are further dicussed in section IV. frequent feature lies in and it would be the Correct Mαtc? C = Aij A +A +A +? A ; w?ere i = j, j = 1,2,3, … N (9) il i2 i3 iN Recognized Word. % of error E= (1-C) x100 (10) IV. EXPERIMENTAL RESULTS AND DISCUSSION The experiments were performed on a small size vocabulary of English. The setup includes words spoken from five different speakers. These words were spoken in an acoustically balanced, noise free environment. The implementation and experimental results were analyzed with the help of MATLAB R2014b. The testing and training results of ASR are obtained in the form of matrix called confusion matrix as shown in Fig. 10. No of Succesful or Unsuccesful Recognition Confusion Matrix Graph of words 200 150 100 50 0 1 2 3 4 5 6 8 7 7 9 8 5 6 10 3 4 9 10 Word Index 2 1 Word Index Fig. 10. Confusion Matrix Graph of Words Fig. 9. Flow diagram of KNN Fig. 9. Shows the flow diagram of KNN classifier, here K_N is the number of nearest neighbors, N_S is the number of speakers and N_W is the number of words in vocabulary. In Fig. 10 of confusion matrix graph the x-axis and the y-axis are showing the indices of the words. The z-axis shows the height i.e. it shows the total number of times, an individual Word is successfully recognized or it confused with any of other word. The diagonal slots show heights as successful recognition rate. The maximum possible attained possibility of height in this case is 200. The total number of times a word is tested in this case is 200. The values of correct match C and error % E, for words, are summarized in Table I. TABLE I: RECOGNITION & ERROR PERCENTAGE OF WORDS Word Value of Correct Match C Recognition Accuracy (%) Error (%) = (1-C)x100 “Dark” 0.98 98 2 “Wash” 0.99 99 1 “Water” 0.995 99.5 0.5 “Year” 0.975 97.5 2.5 “Don’t” 0.97 97 3 “Carry” 0.995 99.5 0.5 “Greasy” 0.98 98 2 “Like” 0.985 98.5 1.5 “Oily” 0.975 97.5 2.5 “That” 0.995 99.5 0.5 Accumulative Average 0.984 98.4 1.6 Table I. describes the recognition and error rates of a dataset. Firstly each word is evaluated on individual basis and then accumulative average of the dataset is calculated. The data is obtained in the form of confusion matrix as a result of testing the ASR system. The accumulative average success rate obtained for the dataset given above is 98.4 % with 1.6 % error rate. V. CONCLUSION The proposed research on an ASR system delineates MFCC, DTW and KNN techniques. The extraction of features is performed using MFCC, DTW is used for speech features matching and KNN is used for classification. Minimum score indices acquired from DTW are processed in KNN. The experimental results are obtained in the form of confusion matrix. It is observed during the whole research that the proposed ASR System shows good recognition performance when MFCC, DTW and KNN are used jointly. The recognition accuracy achieved in this research is 98.4 % with an error of 1.6 %. REFERENCES [1] J.M. Gilbert?, S.I. Rybchenko, R. Hofe, S.R. Ell, M.J. Fagan, R.K. Moore, P. Green, “Isolated word recognition of silent speech using magnetic implants and sensors,” International Journal of Medical Engineering and physics, vol. 32, pp. 1189-1197, August 2010. [2] Vimala.C and Dr.V.Radha “A Review on Speech Recognition Challenges and Approaches” World of Computer Science and Information Technology Journal (WCSIT) ISSN: 2221-0741 Vol. 2, No. 1, pp. 1-7, 2012. [3] J. Clear, and N. Ostler S. Atkins, "Corpus design criteria," Oxford Journal of Literary and linguistic computing, vol. 7, no. 1, pp. 1-16, 1992. [4] L. F. Lamel, and M. Eskenazi J. L. Gauvain, "Design Considerations and Text Selection for BREF, a large French Read-Speech Corpus," in 1st International Conference on Spoken Language Processing, ICSLP, 1990, pp. 1097-1100. [5] M Murugappan, Nurul Qasturi Idayu Baharuddin, Jerritta S “DWT and MFCC Based Human Emotional Speech Classification Using LDA” International Conference on Biomedical Engineering (ICoBE), Penang, 27-28 February 2012, pp. 203-206. [6] Michael Pitz, Ralf Schl¨uter, and Hermann Ney Sirko Molau, "Computing Mel-Frequency Cepstral Coefficients on the Power Spectrum," in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP '01), USA, 2001, pp. 73-76. [7] Ibrahim Patel and Dr. Y. Srinivas Rao “Speech Recognition using HMM with MFCC-AN analysis using frequency spectral decomposition technique” Signal & Image Processing : An International Journal (SIPIJ) Vol.1, No.2, pp.101-110, December 2010. [8] AMilton, S.Sharmy Roy, S. Tamil Selvi “SVM Scheme for Speech Emotion Recognition using MFCC Feature” International Journal of Computer Applications (0975 – 8887) Volume 69– No.9, pp.34-39, May 2013. [

外文文獻(xiàn)翻譯-使用MFCCDTW和KNN的隔離詞自動(dòng)語(yǔ)音識(shí)別ASR系統(tǒng)中文4600字PDF中文WORD.zip

外文文獻(xiàn)翻譯--使用MFCC，DTW和KNN的隔離詞自動(dòng)語(yǔ)音識(shí)別（ASR）系統(tǒng)【中文4600字】【PDF+中文WORD】

中文翻譯.docx---(點(diǎn)擊預(yù)覽)

07878163 (1).pdf---(點(diǎn)擊預(yù)覽)

07878163 (1).docx---(點(diǎn)擊預(yù)覽)

壓縮包目錄	預(yù)覽區(qū)
外文文獻(xiàn)翻譯--使用MFCC，DTW和KNN的隔離詞自動(dòng)語(yǔ)音識(shí)別（ASR）系統(tǒng)【中文4600字】【PDF+中文WORD】 07878163 (1).docx--點(diǎn)擊預(yù)覽 07878163 (1).pdf--點(diǎn)擊預(yù)覽中文翻譯.docx--點(diǎn)擊預(yù)覽	請(qǐng)點(diǎn)擊導(dǎo)航文件預(yù)覽

編號(hào)：233075192 類(lèi)型：共享資源大?。?span id="mzebxcnn0" class="font-tahoma">1.49MB 格式：ZIP 上傳時(shí)間：2023-10-02

12
積分

舉報(bào)

版權(quán)申訴 word格式文檔無(wú)特別注明外均可編輯修改；預(yù)覽文檔經(jīng)過(guò)壓縮，下載后原文更清晰！ 立即下載

關(guān) 鍵詞：: 中文4600字 PDF+中文WORD 外文文獻(xiàn)翻譯--使用MFCC，DTW和KNN的隔離詞自動(dòng)語(yǔ)音識(shí)別（ASR）系統(tǒng)【中文4600字】【PDF+中文WORD】外文文獻(xiàn) 翻譯使用 MFCC DTW

資源描述：: 外文文獻(xiàn)翻譯--使用MFCCDTW和KNN的隔離詞自動(dòng)語(yǔ)音識(shí)別（ASR）系統(tǒng)【中文4600字】【PDF+中文WORD】,中文4600字,PDF+中文WORD,外文文獻(xiàn)翻譯,使用MFCC,DTW和KNN的隔離詞自動(dòng)語(yǔ)音識(shí)別（ASR）系統(tǒng)【中文4600字】,【PDF+中文WORD】,外文,文獻(xiàn),翻譯,使用,MFCC,DTW

展開(kāi)閱讀全文

溫馨提示:
1: 本站所有資源如無(wú)特殊說(shuō)明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2: 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
3.本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 裝配圖網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

裝配圖網(wǎng)所有資源均是用戶(hù)自行上傳分享，僅供網(wǎng)友學(xué)習(xí)交流，未經(jīng)上傳用戶(hù)書(shū)面授權(quán)，請(qǐng)勿作他用。

關(guān)于本文

本文標(biāo)題：外文文獻(xiàn)翻譯--使用MFCCDTW和KNN的隔離詞自動(dòng)語(yǔ)音識(shí)別（ASR）系統(tǒng)【中文4600字】【PDF+中文WORD】
鏈接地址：http://m.hcyjhs8.com/article/233075192.html

點(diǎn)擊下載此資源

秋霞电影网午夜鲁丝片无码,真人h视频免费观看视频,囯产av无码片毛片一级,免费夜色私人影院在线观看,亚洲美女综合香蕉片,亚洲aⅴ天堂av在线电影猫咪,日韩三级片网址入口

外文文獻(xiàn)翻譯--使用MFCCDTW和KNN的隔離詞自動(dòng)語(yǔ)音識(shí)別（ASR）系統(tǒng)【中文4600字】【PDF+中文WORD】

最新文檔

相關(guān)資源

相關(guān)搜索

秋霞电影网午夜鲁丝片无码,真人h视频免费观看视频,囯产av无码片毛片一级,免费夜色私人影院在线观看,亚洲美女综合香蕉片,亚洲aⅴ天堂av在线电影猫咪,日韩三级片网址入口

外文文獻(xiàn)翻譯--使用MFCCDTW和KNN的隔離詞自動(dòng)語(yǔ)音識(shí)別（ASR）系統(tǒng)【中文4600字】 【PDF+中文WORD】

最新文檔

相關(guān)資源

相關(guān)搜索

外文文獻(xiàn)翻譯--使用MFCCDTW和KNN的隔離詞自動(dòng)語(yǔ)音識(shí)別（ASR）系統(tǒng)【中文4600字】【PDF+中文WORD】