Machine Learning Research and Big Data Analytics A Centre of Excellence Under FAST, MHRD

Machine Learning Research and Big Data AnalyticsA Centre of ExcellenceUnderFAST, MHRD Dhruba K Bhattacharyya, FIETE Professor, CSE Tezpur University

Group Members • Prof Dhruba K Bhattacharyya PI • Prof Shyamanta M Hazarika Co-PI • Prof Utpal Sharma Co-PI • Prof NityanandaSarma Co-PI • Dr SwarnajyotiPatra Member • Dr B Borah Member • Dr SanjibDeka Member • Dr Siddharta S Satapathy Member • Dr RajibGoswamiMember • MrDebojitBoro Member • Ms SanghamitraNath Member

Thrust Areas • Machine Learning • Network Security • Natural Language Processing • Robotics • Bio-informatics • Cognitive Radio Networking • Multi-spectral and Hyper-spectral Satellite Data Processing

Summary of Achievements

Some Achievements :Network Security • Development of a Tool called TUCANNON+ to (i) Capture traffic, (ii) Launch DDoS attacks of • all types, (iii) monitor packet and flow traffic. • Development of a test-bed for attack traffic simulation, capturing, monitoring and validating • defense methods. • Development of defense methods for both low-rate and high-rate DDoS attacks using statistical • and information • theoretic measures. • Development if an effective correlation measure to discriminate DDoS attack traffic from • legitimate traffic. • Development a real-time defense implemented on hardware to detect both low-rate and high-rate • DDoS attacks. • Development of an effective defense to counter XSS attacks.

Network Security Test-bed & TUCANNON+ The tool has two components: Server and Client program. The server program comes with an user interface, through which one can specify different parameters like protocol, SIP type, attack pattern, attack strength (in terms of threads) etc. The client program in turn generates the attack traffic based on the specifications. R C Baishya, N Hoque and D.K. Bhattacharyya.DDoS Attack Detection Using Unique Source IP Deviation. In the Journal of Network Security, November 2016 (in press).

TUCANNON+: Network Traffic Monitoring Tool (TUMONITOR) The tool allows the user to observe a set of selected features viz. packet count per interval, protocol specific packets per interval, TCP flag specific packets per interval, number of unique source IP addresses per interval time. Also the user can monitor the value of an arithmetic expression involving a subset of the features. The tool can be used by a researcher to understand the traffic under different condition. Certainly TUMONITOR is not an IDS, however, a network administrator can use this tool to keep an eye on the traffic passing through the monitoring point. D K Bhattacharyya and JugalKalita, DDoS Attacks: Evolution, Detection, Prevention, Reaction, and Tolerance, CRC Press, Taylor & Francis Group, May, 2016

SSM Based TCP Targeted LRDDoS Attack Detection Method Self-Similarity Definition: Scale invariance property of an object or process, that at some time scale looks just like an appropriate scaled version of itself measured over a different time scale. • Self-Similarity Matrix (SSM) • Definition: A self-similarity matrix SSM from a data series is an ordered • sequence of feature vectors V = (v1,v2, …, vn) where each vector vi • describes the relevant features of a data series in a given local interval. • Then the self-similarity matrix is formed by computing the similarity pairs • of feature vectors. • S(j,k) = s(vj,vk) wherej,kϵ (1, … , n) • where s(vj,vk) is a function measuring the similarity of the two vectors. • Features used: • Average packets per network • flow (f1) • Number of packets per interval or • sample (f2) • Number of network flows (f3) • Server outflow performance (f4) • Similarity measure used: Euclidean • Distance Fractal structures A SSM based TCP targeted LRDDoS attack detection method measures network traffic self-similarity across multiple time scales, over a subset of relevant features. The method has been experimented over real life low-rate dataset for multiple scenarios and the results demonstrate convincing results that confirms its efficacy. Self-similarity matrix S for M traffic samples • where and

Incoming traffic Sample a data series into N samples. Set a value for total matrix size M. Set seed pointer sptr = 1. If sptr < N then, compute the features of sample sptr if not computed and set sample i = sptr + 2. Set sptr = i - 2 Set scale count m = 3 No m ≤ M ? Compute the features of sample i if not computed Yes Yes Construct m X m SSM S starting from sample sptr to i. Ignore rejected sample if present. i ≤ N ? Calculate the standard deviation σmand compute I using Equation 4. Increment i Increment m I = 1/0 ? Indicate matrix S as self-similar. 0 1 Alarm anomaly as LRDDoS attack i < N ? Yes Reject sample i. Increment i. Reset m = m No End of N samples Performance of SSM corresponding to matrix size M SSM periodic LRDDoS attack detection for 4 attackers Boro, D., Haloi, M. and Bhattacharyya, D.K., “A Self-Similarity Based TCP Targeted Low-Rate DDoS (LRDDoS) Attack Detection Method”, Security and Communication Networks, Wiley, 2016 [minor revision].

Cross-Site Scripting (XSS) Attack Detection • Introduced a Client-Server architecture for XSS attack detection that balances the load between client and server. • An attribute clustering method is presented supported by rank aggregation to detect confounded Java-Scripts. • Our unsupervised method shows high detection accuracy with optimal feature subset. Figure 1 : XSS attack detection architecture Table1 : Showing the results of attribute clustering in terms of True positive rate, false positive rate and accuracy S Goswami, N Hoque and D.K. Bhattacharyya. An Unsupervised Method for Detection of XSS Attack. In the Journal of Network Security, November, 2016 (in press)

LTDS-An Effective Low-rate TCP DDoS Attack Defense LTDS is a DDoSdefense solution capable of detecting low-rate TCP DDoS attack with high detection accuracy. The core of our method is to observe the amount of traffic transmitted without two way ACK exchange between the communicating IPs at every interval. In a TCP DDoS attack the victim does not send ACKs. Hence under an attack we can observe a significant hike in the amount of traffic transmitted without two way ACK exchange. We use a non-parametric change point modeling technique to detect such a change in the network traffic.

FFSc: Low-rate and High-rate DDoS Attack Detection It is very difficult to identify low-rate DDoS attack because the behavior of low-rate network traffic is very similar to normal traffic. For effective identification of low-rate and high-rate DDoS attack feature-feature score (FFSc) is computed for each network traffic sample. A normal profile is generated from normal network traffic analysis that stores mean, maximum and minimum FFSc values. During captured traffic analysis FFSc is computed for unknown sample. If the deviation of FFSc between normal and captured traffic is greater than a threshold value then attack alarm is generated. Performance analysis of the proposed method N Hoque, D K Bhattacharyya and J K Kalita, FFSc: A Novel Measure for Low-rate and High-rate DDoS Attack Detection, Security and Communication Networks, 9(13) 2032-2041, Wiley, 2016

High Performance Computing for BDA • Parallel Computing • Multiple processor cores perform similar or dissimilar tasks simultaneously • Parallel computing technologies: • Cluster and supercomputer • General purpose graphics processing unit (GPGPU) • We are working on developing efficient Deep learning systems using GPU • GPU cores: Simpler architecture than CPU cores, energy efficient • Consumes less IC resources, so huge number of cores can be put into a single chip • Suitable for stream processing of graphs with large number of vertices • NVIDIA GPU currently being used in our lab: 384 cores and 6 Gbps bandwidth • Major computing tools for handling big data: • Parallel computing • Distributed computing • Application specific hardware Distributed Computing • Computing components are located on networked computers and they communicate and coordinate via message passing • We categorize the distributed computing architectures to three classes, along with their generic architectures: • MapReduce architecture • Fault tolerant graph architecture • Streaming graph architecture H Kashyap, HA Ahmed, N Hoque, S Roy, DK Bhattacharyya, Big Data Analytics in Bioinformatics: Architectures, Techniques, Tools and Issues, Network Modeling Analysis in Health Informatics and Bioinformatics 5 (1), 28

A Hardware Solution for DDoS Defense Application specific hardware • Advantages: • Optimized application specific datapath requires lesser computation cycles (compared to general datapath in CPUs) • Disadvantages: • High development cost and time (compared to software development) • Types: • Non-configurable : Application Specific IC (ASIC) • Configurable: Configurable Programmable Logic Devices (CPLDs) and Field Programmable Gate Arrays (FPGAs) • Developed a hardware module to detect Distributed Denial of Service (DDoS) attacks in real time • Implementation considers a Xilinx Virtex – 5 FPGA device • The FPGA design implements our proposed VERC measure for DDoS attack detection. • Resource requirements: • Slices : 750/7200 (10% of the available) • Block RAM : 0 (0% of the available) • DSP Slices : 3/48 (6% of the available) • Performance of the detection module: • Maximum frequency: 118 MHz • Time required to classify the traffic instances of 1 second window as either attack or normal: 354ns

NaHiD Correlation Measure for DDoS Attack Detection A real-time DDoS detection solution demands for minimum number of features to be used during traffic analysis, whereas correlation measures such as Spearman, Pearson, and Kendall are often fail to provide high detection accuracy over less number of features. An effective measure called NaHiD is designed for network anomaly detection, towards DDoS detection. For any two objects X and Y of n dimensions, the proposed correlation measure considers standard deviation and mean of the two objects The measure is implemented on both software and hardware (FPGA) platform. From normal network traffic a normal profile is generated and from captured traffic s Figure 1: Performance analysis on CAIDA Performance analysis on CAIDA Performance analysis on DARPA Performance analysis on DARPA 2000 Figure 2: Performance analysis on DARPA • N Hoque, H Kashyap and D K Bhattacharyya, “A Real-Time DDoS Attack Detection Method using FPGA” in IEEE Transaction on Network and Service Management, November, 2016 (under review)

Some Achievements:Bioinformatics • Development of a robust correlation measure to support identification of co-expressed patterns • that show shifted, scaled and shifted-and-scaled correlations. • Development of a robust biclustering technique to identify co-expressed gene patterns with high • biological relevance. • Development of a robust PPI Complex finding method using unsupervised machine learning • approach. • Development of efficient methods to extract Co-expressed Network Modules using both • traditional and soft-computing approach and rank the modules against a given disease query. • Development of a Triclustering method to identify coexpressed gene patterns over Gene-Sample- • Time space.

SSSim Measure Introduced an effective shifting-and-scaling correlation measure named SSSim (Shifting and Scaling Similarity), which can detect highly correlated gene pairs in any gene expression data.

SSSim in ICS Biclustering Introduced a technique named ICS (Intensive Correlation Search) biclustering algorithm, which uses SSSim to extract biologically significant biclusters from a gene expression dataset. The technique performs satisfactorily with a number of benchmarked gene expression datasets when evaluated in terms of functional categories in Gene Ontology database. Comparison of ICS with iBBiG on Subset of Yeast dataset Some p-values on Yeast Sporulation dataset Ahmed, H A, Mahanta, P, Bhattacharyya, D K and Kalita, J K, "Shifting-and-Scaling Correlation Based BiclusteringAlgorithm" IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6 (2014): 1239-1252.

Core and Peripheral connectivity based Cluster Analysis over PPI Network (CPCA) CPCA exploits the core-periphery structural features of complexes. A complex consists of a core dense region with some proteins weakly connected to the dense region, often called periphery. It uses two connectivity criterion functions to identify core and peripheral. To locate initial node of a cluster a measure called DNQ (Degree-based Neighborhood Qualification) index is introduced. CPCA performs well when compared with well known counterparts in terms of sensitivity, ppv, precision, recall and accuracy. Comparison using Co-locali-zation score Comparison using MIPS gold standard Ahmed, H A, D K Bhattacharyya, and J K Kalita. "Core and Peripheral connectivity based Cluster Analysis over PPI Network" in Elsevier’s Computational Biology and Chemistry (CBAC), 59, 32-41, 2015.

FUMET: A Fuzzy Network Module Extraction Technique for CEN A soft thresholding co-expression network construction technique based on fuzzy logic, which can handle both positive and negative correlations among genes and can handle membership of a single gene to multiple network modules. P Mahanta, H A Ahmed, D K Bhattacharyya and A GhoshFUMET: A Fuzzy Network Module Extraction Technique for Gene Expression Data in the Journal of Bioscience, vol 39, no 2, June, 2014, Springer.

GeCON: Reconstruction of Gene CEN Gene pairs showing negative or positive co-regulation under a given number of conditions are used to construct such gene co-expression network with signed edges to reflect up- and down-regulation between pairs of genes. Most existing techniques lacking computational efficiency. A fast correlogram matrix is used to capture the support of each gene pair to construct the network. • S Roy, D K Bhattacharyya & J K Kalita, “Reconstruction of Gene Co-expression Network from Microarray Data Using Local Expression Patterns”, BMC Bioinformatics , Vol. 15, S10, 2014.

CoBi: Polynomial Time Co-regulated Biclustering A novel expression pattern-based polynomial time biclustering technique for grouping both positively and negatively regulated genes together as co-regulated genes from microarray expression data in a deterministic way. Roy, S., Bhattacharyya, D. K. and Kalita, J. K. CoBi: Pattern Based Co-Regulated Biclustering of Gene Expression Data, Pattern Recognition Letters, Elsevier, 34(04), 1669{1678, 2013.

Tricluster Analysis In GST Microarray Data Developed a triclustering method to find groups of co-expressed genes over sample and time domains using SSSimmeasure. Triclustering results are better in terms of biological significance than pre-existing algorithms namely TRICLUSTER and ICSM. Developed a shared memory shared nothing architecture to parallelize our THD-Tricluster and to reduce the execution time. T Kakati, H A Ahmed, D.K. Bhattacharyya, J K Kalita. THD-Tricluster: An Effective TriCluster Algorithm with Shifting-and-Scaling Patterns in Elsevier’s CBAC, November, 2016 (under review).

CEN Module Extraction in Finding Disease Related Genes The work considers the important issue of analysis of CEN using both gene expression similarity and semantic similarity. The work considers not only the highly co-expressed genes, but also the genes with less expression similarity, yet high semantic similarity which are termed as border genes. The border genes obtained are found to be involved in biological pathways, related to some neurodegenerative disease, Alzheimer’s disease. T Kakati, H J Kashyap, D K Bhattacharyya, THD-Module Extractor: An Application for CEN Module Extraction and Interesting Gene Identification for Alzheimer’s Disease, in Nature’s Scientific Reports, 2016 (under minor revision)

Some Achievements:Multi-spectral and Hyper-spectral Data Analysis • To develop efficient supervised and semi-supervised classification methods to identify • objects of interest from multi-spectral and hyper-spectral satellite data. • To develop ensemble classification approach for accurate classification of objects over • hyper-spectral satellite data. • To identify an optimal subset of relevant features for classification of satellite data.

Classification of Hyperspectral satellite data using Object Based Image Classification (OBIC) Technique Classification of Hyperion data Chutia and Bhattacharyya (2014): Effective feature extraction approach for fused images of Cartosat-I and Landsat ETM+ satellite sensors. Applied Geomatics (Springer), 6(3), 181-195 Chutia and Bhattacharyya (2014): OBCsvmFS: Object-Based Classification supported by Support Vector Machine Feature Selection approach for hyperspectral data. Journal of Geomatics, 8 (1), 12-19

An Effective Ensemble Classification Approach using Random Forests and Correlation-based Feature Selection (CFS) Technique Classification of QuickBird image (a part of Shillong city) Chutia and Bhattacharyya (2015): An Effective Ensemble Classification Framework using Random Forests and Correlation-Based Feature Selection Technique, IEEE J. of Remote Sensing Letters, November, 2016 (under minor review)

Some Achievements:Big Data Mining • To develop efficient supervised and unsupervised feature selection methods for • accurate classification of real-life data. • To develop an integrated classifier that operates over an optimal subset of features and • ensures best possible classification accuracy. • To develop efficient supervised and unsupervised incremental feature selection • methods for accurate classification of real-life data. • Multi-objective optimization for selection of views over large data warehouses.

MIFS-ND: Mutual Information-based Feature Selection Method • MIFS-ND is used to select an optimal subset of features from large dataset • Using feature class and feature feature mutual information, the method select relevant features and removes redundancy • To select high-ranked feature, NSGA-II optimization technique is used • Classification accuracy of MIFS-ND is high on many real-life datasets N Hoque, D K bhattacharyya and J K Kalita, MIFS-ND: A Mutual Information-based Feature Selection Method, Expert Systems with Applications 41(2014), 6371-6385.

IFS-KNN: An Incremental Feature Selection for Classification using KNN+ • Used to select an optimal subset of features in a dynamic way from high dimensional datasets. During feature selection, a dynamic profile is created for every new class of instance. It selects only the high weightage features • The traditional KNN gives equal priority to all features during nearest neighbor computation. Hence, a noise value of a feature may yield unpredictable behavior in KNN. • KNN+ classifier does not consider all the features during nearest neighbor computations. • Performance is evaluated on gene expression, network and text categorization datasets using DT, RF, NB, KNN and SVM classifiers. KNN+ performs better than traditional KNN. N Hoque, H A Ahmed, D K Bhattacharyya and J K Kalita, IFS-KNN: An Incremental Feature Selection Method, in the Journal of Machine Learning, Elsevier, 2016 (in press)

Multi-Objective Optimization in Selection of Views to Materialize in Data Warehouses • Materialized views in Data Warehouseis a promising solution to speed up the analytical processing of huge volume of historical data for running decision support applications. • The problem is NP-hard. • With the advent of Big data and MapReduceprogramming paradigm, we investigate on view selection problem for materializing in Big data framework. • The Forma analysis based multi-objective DE for binary encoded data has been modified and applied in designing a view selection and recommendation system for materializing in Hadoop Distributed File System (HDFS) data warehouse framework by promoting diversity of solutions in solution vector space. • The popular elitist multi-objective GA termed as NSGA-II and Archived Multi-objective Simulated Annealing (AMOSA) algorithm are customized for applying in materialized view selection in MapReduce based distributed file system framework for comparative performances analysis. ` Goswami, R., Bhattacharyya, D.K., Dutta, M. and Kalita J.K. : Approaches and Issues in View Selection for Materializing in Data Warehouse, International Journal of Business Information Systems, Vol. 21, No. 1, pp. 17–47, 2016, DOI: 10.1504/IJBIS.2016.073379.

Some Achievements:Cognitive Radio Networks • To develop efficientOpportunity Prediction Scheme at MAC-Layer Sensing for Ad-hoc • Cognitive Radio Networks. • To develop a cooperative spectrum sensing technique in CRNs using Coalitional • Game Theory. • To maximize network throughput through joint routing and channel allocation in multi- • hop cognitive radio network. • To analyze empirically the effectiveness of classification methods in Spectrum Sensing • in CRNs.

Opportunity Prediction at MAC-Layer Sensing for Ad-hoc CRNs In this work, two important issues of MAC-layer sensing have been investigated for underlay mode cognitive radio networks. These are -(a) estimation and modeling of licensed channel usage pattern of PUs, while tolerating interference from secondary users (SUs), and (b) usage of learnt channel usage patterns for discovery of opportunities by the SUs. A Hidden Markov Model based channel usage pattern of PUs is proposed for use by the SUs to predict the spectrum opportunity. The proposed model uses estimated interference power constraint (IPC) in determining the interference due to presence of SUs to protect the PUs from harmful interference. A distributed MAC protocol for data dissemination (DMDD) in underlay mode CRNs is also proposed which utilizes the proposed channel usage model. Training the HMM for channel Performance analysis of the proposed DMDD using designed Channel model Figure 4: HMM representing licensed channel observation sequence Channel Ranking in DMDD where Figure11: % of msg received w.r.t. no. of channels Compared to SURF Figure12: % of msg received under different PU activity compared to SURF Deka, S. K., Sarma, N., 2016. Opportunity Prediction at MAC-Layer Sensing for Ad-hoc Cognitive Radio Networks. Journal of Network and Computer Applications. (under review).

Cooperative Spectrum Sensing in CRNs Algorithm for Proposed DCSS scheme To overcome the issues of individual spectrum sensing, Cooperative spectrum sensing (CSS) has been emerging as a prominent solution which exploits Secondary User (SU) spatial diversity to make a global decision about the availability of Primary User (PU) in a licensed band. Consideration of reliability factor of SUs might proven as an important feature during cooperation among the SUs. In this work, we have proposed a distributed Cooperative Spectrum Sensing scheme using Coalitional Game theoretic model for Cognitive Radio Networks which contributes to improve the sensing performance in terms of detection probability. The utility function of the game is formulated by considering the trade-off between gain and cost during coalition formation. Utility function for the proposed game theoretic CSS model Performance analysis of the proposed method J.Gupta, P.Chauhan, M. Nath, M. Manvithasree, S.K. Deka and N. Sarma, Coalitional Game Theory based Cooperative Spectrum Sensing in CRNs, ACM 18th International Conference on Distributed Computing and Networking, ICDCN -2017.(in press)

Network Throughput Maximization through Joint Routing and Channel Allocation in Multi-hop Cognitive Radio Network Performance analysis: Test Case I: Flows that need to be scheduled: The existing spectrum sharing approaches only involve maximize throughput/utilization by optimally allocating resources channels. However, resource allocation alone only lead to sub-optimal result. In order to maximize spectrum utilization, spectrum sharing demands cross-layer design of routing and resource allocation to efficiently allocate resources to CR nodes in a multi-hop CRN. The main contributions of the paper are – • Defining the joint routing and channel allocation as an optimization problem with objective to maximize network throughput • An Integer Linear Programming(ILP) formulation to solve the optimization problem • Implementation of the formulation using CPLEX With the given input, the scheduled obtained from the solver is shown in the table below. ILP formulation: • We present an Integer Linear Programming (ILP) formulation for the optimization problem. • We introduce two decision variables: • Variable 1: • Variable 2: • All 4 flows are scheduled achieving a maximum throughput of 8Mbps. • The computational time taken to solve the problem is 0.13 sec (or 26.49 ticks). Z Ahmed and N Sarma, Network Throughput Maximization through Joint Routing and Channel Allocation in Multi-hop Cognitive Radio Network, accepted for publication in proc. of IEEE Sponsored Intl Conference on Applications and Innovations in Mobile Computing (AIMoC) 2016, February 10-12, 2016, Kolkata, India.

Applying Classification Methods for Spectrum Sensing in Cognitive Radio Networks – An Empirical Study For collecting real time experimental data GNU Radio is used and the existing sample python scripts usrp_spectrum_sense.py and benchmark_tx.py is modified as sensing.py and transmission.py for sensing and transmission respectively. Other program parameters in the transmission.py script are The Sampling rate = 1 Mega Samples Modulation used = GMSK, Sub Channel Bandwidth =6.25 KHz So, the no of fft bins collected in a particular Channel = 160 (1MS/6.25e3) The receiver which is tuned to the center frequency of the channel can sweep only 8 MHz channel Bandwidth due to the USRP1 daughter board constraint. Out of the total 160 bins 75 percent is taken and 25 percent is discarded from both the lower and upper cut frequency (12.5 percent each) of the channel. The program senses the power level from bin 20 to bin 140. The sensing data was captured with the power and SNR features with active transmission and another with no transmission. All the sensing data are labeled as “Free” and “Occupied” class with respect to the known occupied and free channels respectively. In low SNR environment (fading channels) where there is high noise level and regardless of the fact that there is a signal present (low amplitude) it can't be distinguished in cognitive radio networks. This work exploits the signal power and the SNR features collected in test bed to take a good spectrum decision in such condition by employing supervised learning. The conventional energy detection method may cause misdetection of the signal as it fails in a low SNR environment. Here, all supervised learning model is built not only based on just the power received of the signal, but also the SNR feature so that even if there is a low power signal in a highly noisy environment the classifier can still give a decision to detect the signal with a priori knowledge. Our empirical study clearly reveals that supervised learning gives a high classification accuracy by detecting low amplitude signal in a noisy environment. Figure 1 : Showing the Experimental Setup of the CRN Test Bed Comprising of USRP 1 Figure 2 : Performance of Classifiers with different Number of Testing Samples and Average F1 measure N.Basumatary, N.Sarma, B.NathApplying Classification Methods in Spectrum Sensing in Cognitive Radio Networks: An Empirical Study In ETAEERE 2016 (Springer Conference), (in press).

Some Achievements:Natural Language Processing • Compute and analyze VOT (Voice Onset Time) values for the stops of the Assamese • language and its dialectal variants to provide a better understanding of the • phonological differences that exist among the different dialectal variants of a • language. • To develop a speech corpus. • Computational Modeling of Morphology and Syntax of Manipuri – a resource poor • Tibeto-Burman language. • Computational Modeling of Morphology and Syntax of Assamese – a resource poor • inflectional language.

Speech and Natural Language Processing Incorporating Dialectal Features in Synthesized Speech • Objective 1: Compute and analyze VOT (Voice Onset Time) values for the stops of the Assamese language and its dialectal variants to provide a better understanding of the phonological differences that exist among the different dialectal variants of a language which may prove to be useful for dialect translation and synthesis. • Tasks: Compute and analyze the VOT values for the stops of the Assamese language and its dialectal variants (Nalbaria variety) and find out their position in the standard VOT continuum. • Subtask 1:Development of Speech Corpus • List of words having the voiced/voiceless plosives in word initial position followed by vowel sounds ‘a’,‘e’,‘i’,‘o’ and ‘u’ is prepared and recorded from 4 speakers (2 speaking the AIR variety & 2 speaking the Nalbaria variety) at a sampling rate of 44.1kHz and 16 bit resolution in a noise free environment. • Subtask 2: VOT measurement • PRAAT speech analysis software is used to generate the waveform and spectrogram for each word utterance containing the plosive in word-initial position. On each waveform 2 points in time are located: the onset of burst release marked by the onset of low amplitude, aperiodic noise and the onset of voicing marked by the onset of high amplitude periodic energy. Classification of Assamese Stop Consonants SanghamitraNath,HimangshuSarma, and Utpal Sharma. A preliminary study on the VOT patterns of the Assamese language and its Nalbaria variety. In Computational Linguistics and Intelligent Text Processing, pages 542-552. Springer, 2014

Results: • VOT (lead) range for both the varieties of Assamese is similar to the standard ranges although the maximum value is much larger. • The VOT (short lag) range for the AIR variety fall into the standard range, but the maximum value for the Nalbari variety is larger. • The range for the aspirated stops needs to be extended on both ends. • VOT for the voiced stops in the Nalbaria variety has both positive and negative values • Stops in the AIR variety are much more aspirated than the stops in the Nalbaria variety. • Conclusions: • VOT values for the two varieties of Assamese under study show differences which can be used for dialect identification/recognition. • It is likely that VOT will also make a substantial difference in the synthesis of the Assamese dialects. Experiments on speech synthesis with varying VOT values are yet to be carried out. Objective 2: Formant structure of vowels and diphthongs are important to distinguish the vowel/diphthong sounds from each other. Furthermore, accurate estimation of segmental duration is crucial for natural sounding text-to-speech synthesis. Therefore analyzing the vowels and diphthongs with respect to formants and segmental duration may reveal information that might help in dialect recognition and synthesis. Tasks: Development of Speech Corpus: A list of words having the vowels and diphthongs in word initial, medial and end positions is prepared and recorded from 4 speakers (2 speaking the AIR variety & 2 speaking the Nalbaria variety) at a sampling rate of 44.1kHz and 16 bit resolution in a noise free environment using a Sony recorder. • Measurement of formants and vowel duration: • A PRAAT script extracts formants F1 and F2 at 25%, 50% and 75% of the vowel length and at 20%, 40%, 60% and 80% of diphthong duration and also duration of vowel and diphthong segments. The Euclidean distance between the nucleus and the offglide of a diphthong is calculated and recorded in the excel sheet. • Observations: • In the Nalbaria variety, the /a/ is more close to the /aa/,i.e., the backness of /a/ is less than that of the AIR variety while the /u/ is more central and /o/ is more back, while in the AIR variety, /u/ is more back and /o/ is more central. S Nath and U Sharma. An analysis of the vowels and diphthongs of the Assamese language and its Nalbaria variety. In Computing and Communication Systems (I3CS), 2015 International Conference on, 2015.

Observations(contd): In almost all cases the distance (between nucleus and offglide) is much larger in the AIR diphthongs making them more prominent. The dynamic F1F2 plot of most diphthong in AIR almost reaches the target vowel while the dynamic F1F2 plot of most diphthongs in Nalbaria lies somewhere between the vowel sounds /i/ and /aa/ . Duration of vowels in Nalbaria is much smaller than the duration of vowels in AIR. Computational Modelling of Morphology and Syntax of Assamese – a resource poor inflectional language A. Stemming of Words Objective- Automatic identification of the stem of words occurring in texts. ◦ Experimented with Assamese, Bengali, Bishnupriya Manipuri and Bodo ◦ Developed a rule-based approach to remove suffixes from words. Use a dictionary of frequent words to reduce over-stemming and under-stemming. ◦ To deal with problems due to large number of single-letter suffixes, proposed an HMMbased hybrid approach. ◦ Obtained accuracy of 94% for Assamese and Bengali using the hybrid approach, which is an improvement over existing methods. ◦ Obtained accuracy of 87%, and 82% for Bishnupriya Manipuri and Bodo, respectively. Ours is the first reported work on these two languages. SahariaNavanath and Sharma Utpal and KalitaJugal. Stemming resourcepoor Indian languages. ACM Transactions of Asian Language Information Processing (TALIP), vol 13, no. 3, article 14, p 14.1-14.26 (26 pages), September 2014. DOI:http://dx.doi.org/10.1145/2629670

B. Parsing of Assamese Sentences- • Objective- Recognising the syntactic structure of Assamese sentences. • ◦ Experimented with Assamese, a morphologically rich, inflectional and resource-poor Indian language. • ◦ Developed a hierarchical Part-of-speech tagset suitable for Assamese. • ◦ Part-of-speech tagging for Assamese using a rule-based approach that is augmented with a dictionary. • ◦ Part-of-speech tagging for Assamese using an HMM based approach. • ◦ Identified multi-word units in texts. • ◦ Explored three dependency parsing models for Assamese, viz. Link grammar parsing, Malt parsing, and MST parsing. • ◦ Developed an Assamese TreeBank-a repository to store the parsed sentences. SahariaNavanath. Computational Morphology and Syntax for a Resource- Poor Inflectional Language. PhD Thesis, Tezpur University, 2014. Computational Modeling of Syntax of Manipuri- a resource poor Tibeto-Burman Language Objective- Syntax modeling and development of an effective parser for Manipuri ◦ Collected a raw corpus of about 16 millions words from Manipuri newspapers available in public domain. This is in addition to about 1.4 million words corpus obtained from Technology Development of Indian Language (TDIL) Programme, DeitY, MC & IT, Govt. of India, under research license. ◦ Developed transliteration software for conversion of the collected newspaper articles into Unicode (UTF-8) format. ◦ Studied of syntax structure for Manipuri and identified framework for syntax model to be developed- CFG, TAG, etc. ◦ Identified implementation issues for Manipuri parsers.

Some Achievements:Bio-mimetic and Cognitive Robotics • Development of a combined diagrammatic reasoning and • qualitative spatio-temporal reasoning framework to detect motion- • events in video. • To extend CORE9 for human activity recognition. • Intent recognition in a generalized framework for collaboration.

Data Acquisition Data Processing Subject Motor Driver Prosthetic Hand • Figure 2: Basic Block Diagram of The Complete System 1. EMG Signal 2. Applying Summation Feature Figure 3: Summation of Entire sample (Testing on PC) Figure 1: Types of Grasp Figure 4: Prosthetic Hand Real-Time EMG-based Prosthetic Hand Control Design Mantoo Kaibarta, Nayan M. Kakoty and Shyamanta M. Hazarika Abstract Findings An EMG-based five-fingered prosthetic hand control-design in real-time is being attempted to give assistance to the people suffering from upper limb injury or inability. The EMG is captured from the surface of the subject’ hand muscle (non-invasively). The experimental results shows that because of the embedded system with real time mode the system has a great potential application with portability. Although, there are a number of EMG-based prosthetic hand, we are focusing on to design with less electrode (channel), low cost and high efficiency. Processing the EMG Signal in Real time The samples are collected at 1 kHz sampling rate for 100ms. The microcontroller dsPIC33FJ128GP802 from Microchip is chosen for the EMG signal processing. It comes in 28 pins, 3.3V dc power supply, 16-bit data path, 128 KB of ROM and 16 KB of SRAM. Objective Design a classifier for classifying six grasps as shown in figure 1 that works in real. EMG signal pattern changes with change of subject, location of the electrode placed on hand muscle and environmental conditions. Our goal is to implement a robust system. Conclusion and Future Work We are able to detect and process EMG Signal reasonable accuracies I n real time with one channel. Our focus is on implementation of classification of six grasp types using SVM with two channels of 16-bit EMG data. For filtering and extracting a better EMG signal pattern we will take advantage of wavelet technique. Our aim is to make as small a circuit board as possible so that it fits within the prosthetic hand. References • Kakoty, N. M. and Hazarika, S. M. (2011) “Recognition of Grasp Types through Principal Components of DWT based EMG Features”, 12th International Conference on Rehabilitation Robotics, Zurich, Switzerland. June 2011. • P.R.S. Sanches, A.F. Muller, L. Carro, A.A. Susin, P. Nohama, “Analog reconfigurable techniques for EMG signal processing”, SociedadeBrasileira de EngenhariaBiomedica, v.23, n.2, p. 153-157, April 2007. Tezpur University, Tezpur, Assam - 784028

Flexor Digitorum Profundus (blue) Extensor Digitorum Muscle (purple) Figure 3.4 – Processing of EMG Signal (Testing on PC) Electrode1 Electrode2 Channel 1 Figure 3 - Placement of Electrodes Real-Time EMG-based Prosthetic Hand Control Design Mantoo Kaibarta, Asst. Prof. Nayan M. Kakoty, Prof. Shyamanta M. Hazarika Placement of Electrodes 1. Raw EMG Signal 2. Filtered Signal (Difference Filter) 3. Applying Summation Feature Tezpur University, Tezpur, Assam - 784028

Development of Cluster Facilities to Support Big Data Analytics Following objectives are aimed to achieve by utilizing the facilities: Generate DDoS attack centric alert dataset using multiple defense sensors to validate alert correlation methods. Develop an Alert Correlation Analyzer using Granger Causality over very large alert datasets. Use of Theano or Py-CUDA platform for classification of Big data using Deep Learning with Alternate Dropping. Extraction of network modules from voluminous gene expression data using multi-objective approach towards disease gene(s) identification. Develop an unsupervised differential analysis method to analyze disease genes in progression.

Machine Learning Research and Big Data Analytics A Centre of Excellence Under FAST, MHRD

Machine Learning Research and Big Data Analytics A Centre of Excellence Under FAST, MHRD

Presentation Transcript

Big Data Meets Learning Analytics

Big Data Analytics

Big Data + Data Analytics

Machine Learning and the Big Data Challenge

Big Data analytics

Big Data Analytics

Big data analytics

2014 UBTech Big Data and Learning Analytics SIG

Learning Analytics, Big Data, and Knowledge Evaluation Systems

Research and Analytics: The Revolution of Machine- to-Machine Data

Big Data and Analytics

Big Data Analytics

Fast and Expressive Big Data Analytics with Python

Big Data analytics

Big Data Analytics

How machine learning is benefitting big data analytics ?

Big Data Analytics

The Visual Big Data Analytics Platform for Stream Processing and Machine Learning

Big Data Analytics Architecture and Challenges, Issues of Big Data Analytics

Big Data Analytics