Vectorization of Nepali sentences

Vectorization is the process of converting categorical data into numerical form. Sentiment analysis is all about categorical / sentence-level analysis. It is must important part of sentiment analysis of Nepali sentences as well as other languages.

In this article, we further discussed the sentiment analysis of Nepali sentences with python. In the previous two article introduction and part one we discussed basic introductory of sentiment analysis. Continuing this article by analyzing the python code and explain about it .

Set of Bag of Words
Sentiment Analysis of Nepali sentences(Part-3)| Vectorization
Bags of words

word_arrays variable store split documents from split_docs() function length_of_docs store total number of document on data sets.

Function individual_words collect the bag of words from the collection of documents as a python set. The map function takes two parameters one is a data structure and the other is an array. Union function makes words unique.

Convert into Set
Sentiment Analysis of Nepali sentences(Part-3)| Vectorization
Set to list conversion

On further calculation, we work with the list so a set of individual words convert into the list. The Nepali language has a slightly different word structure having similar meaning so it’s computationally expensive to manage all the word on Sentiment Analysis of Nepali sentences.

Occurrence Count
Sentiment Analysis of Nepali sentences(Part-3)| Vectorization
Count occurrence of words on vocabulary list

We count a number of individual words on the document set. word_array (collection of documents) in the form of a list and individual words (bag of words) convert_into_list two variables are used to count the occurrence of words. In the above function, each word on the document is count and store as a dictionary.


{“मैले”:1, “मेरो”:5, “फायर”:1, “एचडीलाई”:2 ,”8″:1 ,”दुई”:2 ,”हप्तामा”:3, “गरेको”:2, “छु”:1, “र”:4, “म”:5, “यो”:5, “मनपरौछु”:2}

Sentiment Analysis of Nepali sentences(Part-3)| Vectorization
Document vectorize

This function vectorize each document with respect to individual word collection. if number of individual word in collection is 2000 so each document have 2000 word with respective vector value.

Let’s take an example

This is our total word collection

{“मैले”:1, “मेरो”:5, “फायर”:1, “एचडीलाई”:2 ,”8″:1 ,”दुई”:2 ,”हप्तामा”:3, “गरेको”:2, “छु”:1, “र”:4, “म”:5, “यो”:5, “मनपरौछु”:2}

And this is document

[“गरेको”, “छु”, “र”, “म”, “यो”, “मनपरौछु”}

than our vector array is


We discussed further code on the upcoming article, Are you interested?

To know more about vectorization follow this link

Vector keeps the valuable number with it!
Spread the love

One Thought to “Sentiment Analysis of Nepali sentences(Part-3)| Vectorization”

  1. Nisha Nepal

    Thanks for such a articles

Leave a Comment