Mapping the network of alphabets
If you can detect the distances or proximity of named entities, you have a network at your hands. Whether these entities are neighboring countries or parts of the human body, you can analyze these networks using network analysis.
With this realization, every word is a network that you can analyze and visualize. Example: The word “kitten” is made of letters K I T T E N. With the network approach we can say that there is a edge between the K and the I others are I-T, T-T, T-E and E-N. We can see that the letter T is used in 3 pairs of letters. So, what is we expand this same thinking to longer documents? Using this clever formula to separate the letters in a word, Automap to calculate the co-occurrence of the letters and Gephi to visualize the end result, we the following map:
The map is made of my masters’ thesis (in Finnish). The bigger the letter, the more other letters are next to it in the sample text. The thickness of the edge represents the frequency of the “interaction”. The most common pairing is S-I or I-S.
So, why do this? I don’t know, you tell me (this post is a part of the mental disorder known as social network analysis). This could have some applications in cryptography, language detection or assigning missing characters in documents. It could be also interesting to find out if different writers have different kind of alphabet network. Using the method to detect plagiarization could one application.
Any other ideas?