Mapping the network of alphabets

04/03/2011

If you can detect the distances or proximity of named entities, you have a network at your hands. Whether these entities are neighboring countries or parts of the human body, you can analyze these networks using network analysis.

With this realization, every word is a  network that you can analyze and visualize. Example: The word “kitten” is made of letters K  I  T T E N. With the network approach we can say that there is a edge between the K and the I others are I-T, T-T, T-E and E-N. We can see that the letter T is used in 3 pairs of letters. So, what is we expand this same thinking to longer documents? Using this clever formula to separate the letters in a word, Automap to calculate the co-occurrence of the letters and Gephi to visualize the end result, we the following map:

The map is made of my masters’ thesis (in Finnish). The bigger the letter, the more other letters are next to it in the sample text. The thickness of the edge represents the frequency of the “interaction”. The most common pairing is S-I or I-S.

So, why do this? I don’t know, you tell me (this post is a part of the mental disorder known as social network analysis). This could have some applications in cryptography, language detection or assigning missing characters in documents. It could be also interesting to find out if different writers have different kind of alphabet network. Using the method to detect plagiarization could one application.

Any other ideas?

Twitter: jattipaa

www.verkostoanatomia.fi

Facebook: Verkostoanatomia

 

10 Responses to “Mapping the network of alphabets”

  1. Petro Says:

    I could have guessed you wrote your thesis about “Social Network Analysis” (= sosiaalisten verkostojen analyysi) if I had not knew that already!


  2. You’re right, that explains the I-S -connection. Good call! It would be interesting to use this method in natural language settings: on different forums or discussion boards. We could get the texture of a language quite easily.

    • Petro Says:

      That’s right. Actually, it is possible to spell the headline of your thesis by using only the letters represented in the “hubs” of that network. Is it a coincidence or have you been using these central words more frequently… Yes, the question is indeed, how to use this information. We might be able to locate a text discussing on a certain topic by that mechanics, but there are search engines already that work quite well, isn’t that so? BUT: Would we possibly locate a text written by a certain person?


  3. I was also wondering some applications in semantic analysis: are words with similar semantic load constructed similarly? Is the network structure of negative emotional text different than of a positive one? I would need a large repository of such texts to test this.

  4. wudler Says:

    Hey!

    I was playing with the thought of doing a same network for Hungarian language, to see the same connections… and partly to see any relation between the – supposedly related – two languages (although I want to use some literature as raw material, not a master thesis about SNA😛 ).

    However, Automap got me a bit, and can’t figure out how to import such segmented text file and make a relevant letter-network out of it. Could you point me to some guides or explain roughly what I should do?

    Thanks awfully!

    Cheers


    • Sure, will get back to you soon!


    • Forgot to mention: you’ll probably need NodeXl or Ucinet also.

      First of all load your textfile to notepad, then copy-paste it to excel. Each line should go to separate cells. Then use the macro to separate individual characters. Copy-paste it back to nodepad and save it (magyar.txt etc.). Now you should have a text file with all of the letters separated. Make sure the character encodings are UTF-8 throughout the process.

      Next step (which caused me a lot of headache before stumbling to a solution) is to import the textfile to Automap. If you try to import the file directly, nothing happens (hence the headache). The trick is to first open some other text file and THEN the magyar.txt. It works fine now, go figure!

      Now go to PREPROCESS and remove all the numbers, exclamation marks etc. you don’t want to ruin your analysis. After this is done. Go to generate-semantic network-semantic list. Play around with the choices and hit confirm. Automap now should produce a .csv file with columns “concept,concept,frequency,average distance, shortest distance”. All of the distances should be 1.

      Next step is to separate these to different columns with excel so that you have | a | a | 1761| columns with all of the letter pairings and frequencies. The frequencies are likely to be high, so you may have to divide it with an appropriate number (relative to the overall character count). Use the Ucinet DL importer (edgelist) or copy-paste it to NodeXl so that the frequency = tie strength.

      Then, export to .net file and open in Gephi. Let me know if this works/fails.

      These are the network analysis that use the most time but receive the least attention. Fun to do though!

  5. wudler Says:

    Awesome, thanks a lot for the details! I’ll let you know when I have the results, but that’s likely to happen only during the weekend.

  6. wudler Says:

    Hey!

    I posted the results a couple of days ago on my blog, but only in Hungarian. Wanted to do an English post, but I haven’t figured out so far how to go bilingual. I’ll do it sooner or later. Until then, here’s the post:
    http://latod.wordpress.com/2011/04/10/magyar-abece-halozata/

    hope google translate helps🙂 Your guide worked like charm btw!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: