distance telephone lines using artificial neural networksSPEAKER IDENTIFICATION AND VERIFICATION OVER SHORTDISTANCE TELEPHONE LINES USING ARTIFICIAL NEURALNETWORKSGanesh K Venayagamoorthy, Narend Sunderpersadh, and Theophilus N Andrewemailprotected emailprotected emailprotectedElectronic Engineering Department,M L Sultan Technikon,P O Box 1334, Durban, South Africa. ABSTRACTCrime and corruption have become rampant todayin our society and countless money is lost each yeardue to white collar crime, fraud, and embezzlement. This paper presents a technique of an ongoing workto combat white-collar crime in telephonetransactions by identifying and verifying speakersusing Artificial Neural Networks (ANNs). Resultsare presented to show the potential of this technique. 1.
INTRODUCTIONSeveral countries today are facing rampant crime andcorruption. Countless money is lost each year due towhite collar crime, fraud, and embezzlement. In today’scomplex economic times, businesses and individualsare both falling victims to these devastating crimes. Employees embezzle funds or steal goods from theiremployers, then disappear or hide behind legal issues.
Individuals can easily become helpless victims ofidentity theft, stock schemes and other scams that robthem of their moneyWhite collar crime occurs in the gray area where thecriminal law ends and civil law begins. Victims ofwhite collar crimes are faced with navigating a dauntinglegal maze in order to effect some sort of resolution orrecovery. Law enforcement is often too focused oncombating “street crime” or does not have the expertiseto investigate and prosecute sophisticated fraudulentacts. Even if criminal prosecution is pursued, a criminalconviction does not mean that the victims of fraud areable to recover their losses. They have to rely on thcriminal courts awarding restitution after the convictionand by then the perpetrator has disposed of or hiddemost of the assets available for recovery.Order now
From the civillaw perspective, resolution and recovery can just be adifficult as pursuing criminal prosecution. Perpetratorsof white collar crime are often difficult to locate andserved with civil process. Once the perpetrators havebeen located and served, proof must be provided thatthe fraudulent act occurred and recovery/damages areneeded. This usually takes a lengthy legal fight, whichoften can cost the victim more money than the frauditself. If a judgement is awarded, then the task ofcollecting is made difficult by the span of time passedand the perpetrator’s efforts to hide the assets.
Oftenafter a long legal battle, the victims are left with aworthless judgement and no recovery. One solution to avoid white collar crimes and shortenthe lengthy time in locating and serving perpetratorswith a judgement is by the use of biometrics techniquesfor identifying and verifying individuals. Biometrics aremethods for recognizing a user based on his/her uniquephysiological and/or behavioural characteristics. Thesecharacteristics include fingerprints, speech, face, retina,iris, hand-written signature, hand geometry, wrist veins,etc. Biometric systems are being commerciallydeveloped for a number of financial and securitapplications.
Many people today have access to their company’sinformation systems by logging in from home. Also,internet services and telephone banking are widely usedby the corporate and private sectors. Therefore toprotect one’s resources or information with a simplepassword is not reliable and secure in the world oftoday. The conventional methods of using keys, accesspasswords and access cards are being easily overcomeby people with criminal intention. Voice signals as a unique behavioral characteristics isproposed in this paper for speaker identification andverification over short distance telephone lines usingartificial neural networks. This will address the whitecollar crimes over the telephone lines.
Speakeridentification 1 and verification 2 over telephonelines have been reported but not using artificial neuralnetworks. Artificial neural networks are intelligent systems thatare related in some way to a simplified biological modelof the human brain. Attenuation and distortion of voicesignals exist over the telephone lines and artificialneural networks, despite a nonlinear, noisy andunstationary environment, are still good at recognizingand verifying unique characteristics of signals. Multilayerperceptron (MLP) feedforward neural networkstrained with backpropagation algorithm have beenapplied to identify bird species using recordings ofbirdsongs 3.
Speaker identification based on directvoice signals using different types of neural networkshave been reported 4,5. The work reported in thispaper extends the work reported in 5 to short distancetelephone networks using ANN architectures describedin section 4 of this paper. The feature extraction, the neural network architecturesand the software and hardware involved in thedevelopment of the speaker identification andverification system are described in this paper. Resultswith success rates up to 90% in speaker identificationand verification over short distance telephone linesusing artificial neural networks is reported in this paper.
2. SPEAKER IDENTIFICATION ANDVERIFICATION SYSTEMA block diagram of a conventional speakeridentification/verification system is shown in figure 1. The system is trained to identify a person’s voice byeach person speaking out a specific utterance into themicrophone. The speech signal is digitized and somedigital signal processing is carried out to create atemplate for the voice pattern and this is stored inmemory. The system identifies a speaker by comparing theutterance with the respective template stored in thmemory.
When a match occurs the speaker is identified. The two important operations in an identifier are theparameter extraction and pattern matching. In parameteextraction distinct patterns are obtained from theutterances of each person and used to create a template. In pattern matching, the templates created in theparameter extraction process are compared with thosestored in memory. Usually correlation techniques areemployed for traditional pattern matching.
ADC ParameterExtractionPatternMatchingMemoryTemplateOutputDevicemicFigure 1: Block Diagram of a Conventional SpeakerIdentification/Verification System. The speaker identification/verification system overtelephone lines investigated in this paper using artificialneural networks is shown in figure 2. FeatureExtractionNeural NetworkClassificationSpeaker IdentityorSpeaker AuthenticityTelephoneSpeech SignalFigure 2: Block Diagram of the SpeakerIdentification/Verification System using an ANN. In this paper, the speaker identification/verificationsystem reported is a text-dependent type. The system istrained on a group of people to be identified by eachperson speaking out the same phrase. The voice isrecorded on a standard 16-bit computer sound card fromthe telephone handset receiver.
Although the frequencof the human voice ranges from 0 kHz to 20 kHz, mostof the signal content lies in the 0. 3 kHz to 4 kHz range. The frequency over the telephone lines is limited to 0. 3kHz to 3.
4 kHz and this is the frequency band of interestin this work. Therefore, a sampling rate of 16 kHzsatisfying the Nyquist criterion is used. The voices arestored as sound files on the computer. Digital signalprocessing techniques are used to convert these soundfiles to a presentable form as input vectors to a neuralnetwork. The output of the neural network identifiesand verifies the speaker in the group.
3. FEATURE EXTRACTIONThe process of feature extraction consists of obtainingcharacteristic parameters of a signal to be used toclassify the signal. The extraction of salient features is akey step in solving any pattern recognition problem. Fospeaker recognition, the features extracted from aspeech signal should be consistent with regard to thedesired speaker while exhibiting large deviations fromthe features of an imposter.
The selection of speakeruniquefeatures from a speech signal is an ongoingissue. Findings report that certain features yield betteperformance for some applications than do otherfeatures. Ref. 5 have shown on how the performancecan be improved by combining different types offeatures as inputs to an ANN classifier. Speaker identification and verification over telephonenetwork presents the following challenges:a) Variations in handset microphones which result insevere mismatches between speech data gatheredfrom these microphones.
b) Signal distortions due to the telephone channel. c) Inadequate control over speaker/speakingconditions. Consequently, speaker identification and verificationsystems have not yet reached acceptable levels ofperformance over the telephone network. Severalfeature extraction techniques are explored but only thPower Spectral Densities (PSDs) based technique isreported in this paper. The discrete Fourier transform ofthe telephone voice samples is obtained and the PSDsare computed.
The PSDs of three different speakers A,B and C uttering the same phrase is shown in figures 3,4 and 5 respectively. 0 1000 2000 3000 4000 5000 6000 7000 8000-80-60-40-200Power Spectrum Magnitude (dB)Frequency HzFigure 3: PSD of Speaker A0 1000 2000 3000 4000 5000 6000 7000 8000-100-80-60-40-200Power Spectrum Magnitude (dB)Frequency HzFigure 4: PSD of Speaker B0 1000 2000 3000 4000 5000 6000 7000 8000-150-100-500Power Spectrum Magnitude (dB)Frequency HzFigure 5: PSD of Speaker CIt can be seen from these figures that the PSDs of thspeakers differ from each other. Ref. 5 has reportedsuccess on speaker identification up to 66% and 90%with PSDs as input vectors to multilayer feedforwardneural networks and Self-Organizing Maps ( SOMs)respectively. 4. PATTERN MATCHING USING ARTIFICIALNEURAL NETWORKSArtificial Neural Networks (ANNs) are intelligentsystems that are related in some way to a simplifiedbiological model of the human brain.
They arecomposed of many simple elements, called neurons,operating in parallel and connected to each other bysome multipliers called the connection weights orstrengths. Neural networks are trained by adjustingvalues of these connection weights between theneurons. Neural networks have a self learning capability, arefault tolerant and noise immune, and have applicationsin system identification, pattern recognition,classification, speech recognition, image processing,etc. In this paper, ANNs are used for pattern matching.
The performance of different neural networarchitectures are investigated for this application. Thipaper presents results for the MLP feedforward networkand the self-organizing feature map. Descriptions ofthese networks are given below. 4.
1. MLP FEEDFORWARD NETWORKA three layer feedforward neural network with asigmoidal hidden layer followed by a linear output layeis used in this application for pattern matching. Theneural network is trained using the conventionalbackpropagation algorithm. In this application, anadaptive learning rate is used; that is, the learning rate isadjusted during the training to enhance faster globalconvergence.
Also, a momentum term is used in thebackpropagation algorithm to achieve a faster globalconvergence. The MLP network in figure 6 is constructed in theMATLAB environment 6. The input to the MLPnetwork is a vector containing the PSDs. The hiddenlayer consists of thirty neurons for four speakers.
Thenumber of neurons in the output layer depends on thenumber of speakers and in this paper it is four. sigmoidal activation functionlinear activation function1st speakerNth speakerVectorof PSDs Figure 6: MLP NetworkAn initial learning rate, an allowable error and themaximum number of training cycles/epochs are theparameters that are specified during the training phaseto the MATLAB neural network program. 4. 2. SELF-ORGANIZING FEATURE MAPSThe second type of neural network selected for thisinvestigation is the self-organizing feature map 7.
Thisneural network is selected because of its ability to learna topological mapping of an input data space into apattern space that defines discrimination or decisionsurfaces. The operation of this network resembles theclassical vector-quantization method called the k-meansclustering. Self-organizing feature maps are moregeneral because topologically close nodes are sensitiveto inputs that are physically similar. Output nodes willbe ordered in a natural manner. Typically, the Kohonen feature map consists of a twodimensional array of linear neurons.
During the trainingphase the same pattern is presented to the inputs of eachneuron, the neuron with the greatest output value isselected as the winner, and its weights are updatedaccording to the following rule:w t w t x t w t i i i ( ) () () () + = + ;#8722; 1 a (1)where wi(t) is the weight vector of neuron i at time t, is the learning rate and x(t) is the training vector. Those neurons within a given distance, theneighborhood, of the winning neuron also have theirweights adjusted according to the same rule. Thisprocedure is repeated for each pattern in the training setto complete a training cycle or an epoch. The size of theneighborhood is reduced as the training progresses. Inthis way the network generates over many cycles anordered map of the input space, neurons tending tocluster together where input vectors are clustered,similar input patterns tending to excite neurons insimilar areas of the network. 5.
IMPLEMENTATION OF THE SPEAKEIDENTIFICATION AND VERIFICATION SYSTEMThe work that is being reported in this paper isimplemented in software. The telephone speech icaptured and processed on a Pentium II 233 MHzcomputer with a 16 bit sound card. The telephonereceiver is interfaced to the sound card. Telephonspeech is captured over signals transmitted within 10kilometres of transmission network. Digital signalprocessing and neural network implementations arecarried out using the MATLAB signal processing andneural network toolboxes respectively. This work iscurrently undergoing and an implementation of a realtimespeaker identification and verification system ovetelephone lines on a digital signal processor isenvisaged.
6. EXPERIMENTAL RESULTSThe MLP network is trained with the PSDs of eightvoice samples recorded at different instants of timeunder controlled and uncontrolled speaking conditionsof four different speakers uttering the same phrase at alltimes. Controlled speaking conditions refer to noise anddistortion free conditions unlike uncontrolled speakingconditions which have noise and distortion on thetransmission lines. The number of PSD points for eachvoice sample is about 500.
As mentioned in section 4. 1,an adaptive learning rate is used for the MLP network. The initial learning rate is 0. 01. The allowable sumsquared error and maximum number of epochsspecified to the MATLAB neural network program i0.
01 and 10000 respectively. It is found that the sumsquared error goal is reached within 1000 epochs. A success rate of 100% is achieved when the trainedMLP network is tested with the same samples used inthe training phase. However, when untrained samplesare used, only a 63% success rate is obtained. This isdue to the inconsistency in the PSDs of the inputsamples with those used in the training phase.
The MLPnetwork is also tested with unseen voice samples ofpeople who are not included in the training set and thenetwork successfully classified these voice samples asunidentified. Four speakers are identified using the self-organizingfeature map like in the case of the MLP network. Aninitial learning rate of 0. 01, an allowable sum squarederror of 0. 01 and a maximum of 70000 epochs arespecified at the start of the training process to theMATLAB neural network program. The results with theself-organizing feature map shows a drastic change inthe success rate in identifying the speakers as reportedin 5.
With PSDs as inputs, a success rate of 85% and90% is achieved under uncontrolled and controlledspeaking conditions respectively. Ref. 5 has reported that success rate can be increasedto 98% under uncontrolled speaking conditions byusing Linear Prediction Coefficients (LPCs) as inputs toSOMs which remains to be yet to be tried out in thiswork. Currently, with the PSDs as inputs a lot ofcomputations is involved and the SOM takes a lot oftime to learn. 7. CONCLUSIONSThis paper has reported on the feasibility of usingneural networks for speaker identification andverification over short distance telephone lines and hashown that performance with the self-organizing map ishigher compared to that with the multilayer feedforwardneural network.
Different feature inputs to the selforganizingmap remains to be tried out in order toachieve higher identification/verification ratesminimizing the training time and the size of thenetwork. Speaker identification with telephone speechsignals over long distance telephone lines is currentlbeing investigated using similar techniques. This paper has shown that speaker identification ispossible over the telephone lines and thereforetelephonic bank and other transactions can beauthenticated. Hence a technique to combat and/orreduce white collar crime. 8. REFERENCES:1 D.
A. Reynolds, “Large population speakeidentification using clean and telephone speech”, IEEESignal Processing Letters, vol. 2 no. 3 March 1995, pp. 46 – 48.
2 J. M. Naik, L. P. Netsch, G. R.
Doddington, “Speakerverification over long distance telephone lines”,Proceedings of IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP),23-26 May 1989, pp. 524 – 527. 3 A. L. Mcilraith, H.
C. Card, “Birdsong RecognitionUsing Backpropagation and Multivariate Statistics”,Proceedings of IEEE Trans on Signal Processing, vol. 45, no. 11, November 1997. 4 G.
K. Venayagamoorthy, V. Moonasar,K. Sandrasegaran, “Voice Recognition Using NeuralNetworks”, Proceedings of IEEE South AfricanSymposium on Communications and Signal Processing(COMSIG 98), 7-8 September 1998, pp. 29 – 32.
5 V. Moonasar, G. K. Venayagamoorthy, “Speakeridentification using a combination of differentparameters as feature inputs to an artificial neuralnetwork classifier”, accepted for publication in theProceedings of IEEE Africon 99 conference, CapeTown, 29 September – 2 October 99. 6 H. Demuth, M.
Beale, MATLAB Neural NetworkToolbox User’s Guide, The Maths Works Inc. , 1996. 7 T. Kohonen, Self-organizing and associate memorySpring Verlag, Berlin, third edition, 1989.