Constructing Social Networks in the Bible

Lemuel Kumarga

Apr 2018

Problem Description

Social circles form huge parts of our lives. They include individuals that we interact with, how often we interact with them and the mode of communication used for interactions. With the rise of digital communication and networking, these interactions are carefully recorded through the use of modern tools and algorithms. A brief look at our social networking sites, such as Facebook and LinkedIn, allows us to easily gather information that characterizes our social circles, such as network of friends, frequency of communication, etc.

Unfortunately, information about our social circles has not always been as readily available. People who lived before the 21st century did not have access to the data-rich information of digital communication, nor did they have the tools to analyze large quantities of daily interactions. However, by synthesizing modern concepts with historical records, we could potentially unearth some information regarding these individuals' social circles. In the case of this project, we will use Natural Language Processing (NLP) to construct a social network for the bible.

To skip the methodology and proceed straight into the network, please click here.


First load the necessary modules for this exercise.

In [1]:
import sys
import defaults as _d
import helper as _h

# Load All Main Modules

# Load All Submodules
from collections import OrderedDict
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import matplotlib.patches as mpatches
import plotly.offline as py
import plotly.graph_objs as py_go
from sklearn.cluster import MeanShift

# If you can't find the module, run in python
from nltk import sent_tokenize, word_tokenize


Also, we will construct helper functions to be used later on.

In [2]:
# -------------------------------------
# Genre-Related Functions
# -------------------------------------
def __get_genre_groups():
    global _genre_group
    if "_genre_group" not in globals():
        _genre_group = bible.groupby("Genre",sort=False)
    return _genre_group

def __get_genre_colors():
    global _genre_colors
    if "_genre_colors" not in globals():
        color_pal = _d.get_color("palette")(len(__get_genre_groups()))
        color_dict = dict()
        ind = 0
        for name, _ in __get_genre_groups():
            color_dict[name] = color_pal[ind]
            ind += 1
        _genre_colors = color_dict
    return _genre_colors

def __get_genre_legends(rev = True):
    global _genre_legends
    global _genre_legends_rev
    if "_genre_legends" not in globals():
        _genre_legends = [mpatches.Patch(color=_d.bg_color,label="Genre")]
        for name, group in __get_genre_groups():
            legend_text = name + " (" + group.index[0]
            if (len(group.index) > 1):
                legend_text += " - " + group.index[-1]
            legend_text += ")"
            _genre_legends.append(mpatches.Patch(color=__get_genre_colors()[name], label=legend_text))
        _genre_legends_rev = _genre_legends[:0:-1]
    if rev:
        return _genre_legends_rev
        return _genre_legends

# -------------------------------------
# Word-Cloud Related Functions
# -------------------------------------
def __word_cloud(input, fig_size = (20,10), image = None, colors = None):
    # Step 1: If there is an image specified, we need to create a mask
    mask = None
    if (image != None):
        mask = np.array(
        if (colors == "image_colors"):
            colors = wordcloud.ImageColorGenerator(mask)
    # Step 2: Set up default colors
    def_colors = mpl.colors.ListedColormap(_d.get_color())
    # Step 3: Generate Word Cloud
    wc = wordcloud.WordCloud(height=fig_size[1]*100,
                             mask = mask,
                             colormap = def_colors,
                             color_func = colors).generate_from_frequencies(input)

    # Step 4: Plot Word Cloud
    plt.imshow(wc, interpolation='bilinear')

def __wc_color_func(character_freq_by_genre):
    # Create color functions to determine the genre most associated with the character
    def color_func(word, font_size, position, orientation, **kwargs):
        most_common_genre = character_freq_by_genre[word].most_common(1)[0][0]
        intensity = 1. * character_freq_by_genre[word][most_common_genre] / sum(character_freq_by_genre[word].values())
        return _d.pollute_color(__min_color, __get_genre_colors()[most_common_genre],intensity)
    return color_func
__get_legend_separator = mpatches.Patch(color=_d.bg_color,label="")    
def __get_minmax_legends(input, title, key_format = "{:.2f}"):
    output = []
    max_item = max(input.items(), key=operator.itemgetter(1))
    output.append(mlines.Line2D([0], [0], marker='o', color=_d.bg_color, label="Max: " + key_format.format(max_item[1]) + " - " + max_item[0],
                      markerfacecolor=_d.ltxt_color, markersize=20))
    min_item = min(input.items(), key=operator.itemgetter(1))
    output.append(mlines.Line2D([0], [0], marker='o', color=_d.bg_color, label="Min: " + key_format.format(min_item[1]) + " - " + min_item[0],
                      markerfacecolor=_d.ltxt_color, markersize=10))
    return output

__min_color = _d.pollute_color(_d.bg_color,_d.txt_color,0.4)
def __get_saturate_legends(title):
    output = []
    output.append(mpatches.Patch(color=_d.get_color(0),label="Concentrated In 1 Genre"))
    output.append(mpatches.Patch(color=_d.pollute_color(__min_color,_d.get_color(0),0.3), label="Spread Out Across\nMultiple Genres"))
    return output


Loading the Data

In this exercise, we will be using the bible corpus from Kaggle. The data will be stored in abbreviated book keys, with each book containing the following attributes:

  • Book Name: Full name of the book
  • Testament: New (NT) or old (OT)
  • Genre: Genre of the book
  • Chapters: Number of chapters
  • Verses: Total number of verses
  • Text: The actual text of the book
In [3]:
# Get all book statistics
abb = pd.read_csv("data/key_abbreviations_english.csv")\
        .query('p == 1')[["a","b"]]\
        .rename(columns={"a" : "Key"})
ot_nt = pd.read_csv("data/key_english.csv")\
          .rename(columns={"n" : "Name", "t" : "Testament"})
genres = pd.read_csv("data/key_genre_english.csv")\
           .rename(columns={"n" : "Genre"})

# Load the main biblical text
bible = pd.read_csv("data/t_asv.csv")\
          .groupby("b", as_index=False)\
          .agg({"c": pd.Series.nunique, "v": "size", "t":" ".join})\
          .rename(columns={"c": "Chapters","v": "Verses","t": "Text"})
# Perform some cleaning
bible['Text'] = bible['Text'].apply(lambda t: re.sub("[`]|['][^s]","",t))

# Join the remaining book statistics
bible = bible.join(abb.set_index('b'), on='b')\
             .join(ot_nt.set_index('b'), on='b')\
             .join(genres.set_index('g'), on='g')\
             .drop(['b', 'g'], axis=1)\
# Show the first few lines
Name Testament Genre Chapters Verses Text
Gen Genesis OT Law 50 1533 In the beginning God created the heavens and t...
Exo Exodus OT Law 40 1213 Now these are the names of the sons of Israel,...
Lev Leviticus OT Law 27 859 And Jehovah called unto Moses, and spake unto ...
Num Numbers OT Law 36 1288 And Jehovah spake unto Moses in the wilderness...
Deut Deuteronomy OT Law 34 959 These are the words which Moses spake unto all...

About the Data

We will also derive some language statistics from each book, mainly:

  • Sentences: Number of sentences in the book, and
  • Words: Number of words in the book.
In [4]:
# Add Sentences and Words columns
bible["Sentences"] = pd.Series(0, index=bible.index)
bible["Words"] = pd.Series(0, index=bible.index)

# Save Tokens
def get_tokens():
    sent_tokens = OrderedDict()
    word_tokens = OrderedDict()
    for i, r in bible[["Text"]].iterrows():
        txt =
        sent_tokens[i] = sent_tokenize(txt)
        word_tokens[i] = word_tokenize(txt)
    return (sent_tokens, word_tokens)

sent_tokens, word_tokens = _h.cache(get_tokens, "bible_tokens")

for i in bible.index:[i,'Sentences'] = len(sent_tokens[i])
    # Remove Punctuation[i,'Words'] = len([w for w in word_tokens[i] if re.match('\w+',w)])

# Show
Name Testament Genre Chapters Verses Sentences Words
Gen Genesis OT Law 50 1533 1753 38040
Exo Exodus OT Law 40 1213 1112 32078
Lev Leviticus OT Law 27 859 662 23773
Num Numbers OT Law 36 1288 996 31911
Deut Deuteronomy OT Law 34 959 740 27890

Book Length

One of the most intuitive ways to understand the books' uneven distribution is to assume that we are doing devotions each chapter a day. Under such a scenario, we will have the following timeline:

In [5]:
# Create Plots
yticks = []
ylabels = []
x_progress = 0
x_length = sum(bible["Chapters"])
y_progress = 0
y_length = len(bible["Chapters"])               
for name, group in __get_genre_groups():

    row_ids = [ bible.index.get_loc(i) for i in group.index ]

    # Part 1: Bars When Genre Is Still Being Read
    length = 0
    # For each book in the genre
    for idx in row_ids:

        # If we are reading this book in the anniversary 
        if (math.floor((x_progress + length)/365) < math.floor((x_progress + length + bible["Chapters"][idx])/365)):
            yticks.append(idx + 1)
            ylabels.append("{} ({}%)".format(bible.index[idx],round(idx/y_length * 100)))

        plt.broken_barh([(x_progress + length, bible["Chapters"][idx])],
                        (y_progress, (idx + 1) - y_progress),
                        facecolors = __get_genre_colors()[name])
        length += bible["Chapters"][idx]
    # Part 2: Bars When Genre has Been Read
    plt.broken_barh([(x_progress + length, x_length - x_progress - length)],
                    (y_progress, max(row_ids) + 1 - y_progress), 
                    facecolors = __get_genre_colors()[name])
    x_progress += length
    y_progress = max(row_ids) + 1

# Add Titles and Grid
plt.title("Chapter Distribution by Book")
plt.grid(color=_d.fade_color(_d.ltxt_color,0.5), linestyle='dashed')

# Add X-Axis Details
plt.xlabel("Time Since Start")
xticks = [365, 2 * 365, 3 * 365 ,sum(bible["Chapters"])]
xlabels = [ "Year 1", "Year 2", "Year 3", "Year 3\nMonth 3" ]
plt.xticks(xticks, xlabels)

# Add Y-Axis Details
ylabels.append("{} ({}%)".format(bible.index[-1],round(1 * 100)))
plt.ylabel("% of Books Completed")
plt.yticks(yticks, ylabels)
plt.ylim(0, y_length)

# Add Legends
plt.legend(handles=__get_genre_legends(), bbox_to_anchor=[1.27, 1.0])

By the 1st year, we would have only completed 18% of the bible. If this is not discouraging enough, after a further year, we would still not have completed the Old Testament (Law to Prophets). However, upon reaching the New Testament (Gospels to Apocalyptic), we could complete the whole set of books within 9 months. The Old Testament is deceivingly at least 3 times longer than the New Testament!

Chapter Length

Assuming that the average human reads 200 words per minute, we can also estimate how long it will take to read 1 chapter a day:

In [6]:
bible["Minutes_p_Chapter"] = bible["Words"] / bible["Chapters"] / 200.
inputs = []

deg_incr = 360. / len(bible.index)
for name, group in __get_genre_groups():
    # Insert Legend Item
            r = [0, 0, 0, 0],
            theta = [0, 0, 0, 0],
            name = name,
            legendgroup = name,
            mode = 'none',
            fill = 'toself',
            fillcolor = __get_genre_colors()[name],
            showlegend = True
    # Insert Each Book
    for key, val in group["Minutes_p_Chapter"].items():
                r = [0, val, val, 0],
                theta = [0,bible.index.get_loc(key)*deg_incr,(bible.index.get_loc(key)+1)*deg_incr,0],
                name = bible["Name"][key],
                legendgroup = name,
                mode = 'none',
                hoverinfo ='text',
                text=bible["Name"][key] + ": " + "{:.1f}".format(val) + " min",
                fill = 'toself',
                fillcolor = __get_genre_colors()[name],
                showlegend = False

layout = py_go.Layout(_d.py_layout)
layout["autosize"] = False
layout["width"] = 450
layout["height"] = 350
layout["margin"] = dict(t=80,l=0,r=0,b=20)
layout["title"] = "Minutes Required to Read a Chapter"

fig = py_go.Figure(data=inputs, layout=layout)
py.iplot(fig, config=_d.py_config)

From the chart above, we conclude that chapter lengths across books are varied as well. For example, a chapter in 1 Kings will take around 5.5 minutes to read, while a chapter in Psalms will take around 1.5 minutes.

Preliminary Insights

After obtaining an overview of the bible, we move to investigate the occurrences of various characters in the book.

The Trinity

The first point of interest is how much God appears at different books in the bible:

In [7]:
def find_occurence(regex):
    output = OrderedDict()
    for name, group in __get_genre_groups():
        l = [len(re.findall(regex, for _, wt in group[["Text"]].iterrows()]
        output[name] = (len(l),sum(l)/len(l))
    return output

entityToSearch = OrderedDict([('God', 'God|Lord|GOD|LORD'),

ind = 0
# Construct Plots for Each Entity
f, splt = plt.subplots(1,len(entityToSearch.items()), figsize=(20,5))
for title, regex in entityToSearch.items():
    occurences = find_occurence(regex)
    x = 0
    for n, v in occurences.items():
        splt[ind].bar([x + v[0]/2],
                      color = __get_genre_colors()[n],
                      width = v[0])
        x += v[0]
    ind += 1

# Insert Legends
plt.legend(handles=__get_genre_legends(False), bbox_to_anchor = [2.2, 1.05])

Unsurprisingly, words associated with God the Father (Jehovah/Father) appear prominently in the Old Testament, while words associated with God the Son (Jesus/Christ) hit high frequencies in the Gospel narratives. Word counts of the Spirit appear the highest in Acts. This sequence is in line with the story of the Gospel, where the events first transcribed were between God the Father and His people, followed by Jesus Christ and his believers, and finally with the Holy Spirit and the church.

(Note: One limitation of such an approach is the failure to capture symbols pointing to God. For example, words such as "Lamb" in Revelations correspond to Christ, but such symbols were excluded as they would introduce false positives.)

Major Characters

Using external sources, we can also obtain a list of major characters in the bible. This list can then be used as a reference for detecting names:

In [8]:
# Characters obtained from
characters_regex = 'Adam|Seth|Enos|Kenan|Mahalalel|Jared|Enoch|Methuselah|Lamech|Noah|Shem|Adam|Cain|Enoch|Irad|Mehujael|Methusael|Lamech|Tubal-cain|Arpachshad|Shelah|Eber|Peleg|Reu|Serug|Nahor|Terah|Abraham|Isaac|Jacob|Judah|Perez|Hezron|Ram|Amminadab|Nahshon|Salmon|Boaz|Obed|Jesse|David|Abel|Kenan|Enoch|Noah |Abraham|Isaac|Jacob|Joseph|Sarah|Rebecca|Rachel|Leah|Moses|Aaron|Miriam|Eldad|Medad|Phinehas|Joshua|Deborah|Gideon|Eli|Elkanah|Hannah|Abigail|Samuel|Gad|Nathan|David|Solomon|Jeduthun|Ahijah|Elijah|Elisha|Shemaiah|Iddo|Hanani|Jehu|Micaiah|Jahaziel|Eliezer|Zechariah|Huldah|Isaiah|Jeremiah|Ezekiel|Daniel|Hosea|Joel|Amos|Obadiah|Jonah|Micah|Nahum|Habakkuk|Zephaniah|Haggai|Zechariah|Malachi|Beor|Balaam|Job|Amoz|Beeri|Baruch|Agur|Uriah|Buzi|Mordecai|Esther|Oded|Azariah|Abimelech|Saul|Ish-boseth|David|Solomon|Jeroboam|Nadab|Baasha|Elah|Zimri|Tibni|Omri|Ahab|Ahaziah|Jehoram|Jehu|Jehoahaz|Jehoash|Jeroboam|Zechariah|Shallum|Menahem|Pekahiah|Pekah|Hoshea|Rehoboam|Abijam|Asa|Jehoshaphat|Jehoram|Ahaziah|Athaliah|Jehoash|Amaziah|Uzziah|Jotham|Ahaz|Hezekiah|Manasseh|Amon|Josiah|Jehoahaz|Jehoiakim|Jeconiah|Zedekiah|Simon|John|Aristobulus|Alexander|Hyrcanus|Aristobulus|Antigonus|Herod|Herod|Herod|Philip|Salome|Agrippa|Agrippa|Simon|Aaron|Eleazar|Eli|Phinehas|Asher|Benjamin|Dan|Gad|Issachar|Joseph|Ephraim|Manasseh|Judah|Levi|Naphtali|Reuben|Simeon|Zebulun|Jesus|Mary|Joseph|James|Jude|Joses|Simon|Peter|Andrew|James|John|Philip|Bartholomew|Thomas|Matthew|James|Judas|Simon|Judas|Matthias|Paul|Barnabas|James|Jude|Caiaphas|Annas|Zechariah|Agabus|Anna|Simeon|John|Apollos|Aquila|Dionysius|Epaphras|Joseph|Lazarus|Luke|Mark|Martha|Mary|Mary|Nicodemus|Onesimus|Philemon'
character_freq = []
for name, group in __get_genre_groups():
    names = [re.findall(characters_regex, for _, wt in group[["Text"]].iterrows()]
    l = [(w,name) for l in names for w in l]

# The frequency of each character occurence by genre
character_freq = nltk.ConditionalFreqDist(character_freq)

# Plot word cloud for each name
inputs = {}
for n, fd in character_freq.items():
    inputs[n] = sum(fd.values())
__word_cloud(inputs, colors=__wc_color_func(character_freq))

# Titles
plt.title("Major Character Occurences")

# Legends
legend_cloud = list(__get_genre_legends(False))
legend_cloud.extend(__get_minmax_legends(inputs, "Word Count","{:d}"))
plt.legend(handles=legend_cloud, bbox_to_anchor = [1.31, 1.])

Based on the graph, David appears the most in the bible. In addition, his appearances are concentrated within the History genre. This is in stark-contrast to Jesus, whose name appeared across multiple genres.


In order to construct a social network, we first need to identify relevant characters in the bible. One approach is to find a list of names from external sources, and then use that to identify the names. However, this method is unscalable. To illustrate, suppose we would like to construct a similar network for "Oliver Twist". Then, we would need to find a list of names associated with the book. But what happens if we are not able to find such a list?

Therefore, to reduce reliance on external sources, we need to develop a more robust approach for name-identification.

Finding the Entities

Fortunately, we are able to capture names due to the nature of English linguistics. Names fall under the category of "Proper Nouns", which we can detect using Part-of-Speech (POS) tagging:

In [9]:
def get_proper_noun_tokens():
    tagged_word_tokens = OrderedDict((n, nltk.tag.pos_tag(wt)) for n, wt in word_tokens.items())
    # Extract Only Proper Nouns and Add Index
    proper_noun_tokens = OrderedDict((n, [(i, w[0]) for i, w in enumerate(wt) if w[1] == "NNP"]) for n, wt in tagged_word_tokens.items())
    return proper_noun_tokens
proper_noun_tokens = _h.cache(get_proper_noun_tokens, "ppn_tokens")
# Print 100 Most Common Words
noun_freq = nltk.FreqDist(w for n,wt in proper_noun_tokens.items() for i, w in wt)
", ".join([n for n, v in noun_freq.most_common(50)])
'Jehovah, God, Israel, David, Lord, O, Jesus, Judah, Jerusalem, Thou, Moses, Egypt, Behold, Christ, Saul, Jacob, Aaron, Solomon, Babylon, Spirit, Pharaoh, Son, Joseph, Abraham, Father, Joshua, Jordan, Levites, Go, Thy, Ye, Moab, Benjamin, Ephraim, My, Paul, Peter, Holy, Yea, Manasseh, Zion, A, Joab, Jeremiah, Samuel, Jews, Psalm, John, Isaac, Hezekiah'

Based on the text above, we have captured a majority of names in the bible. However, there are also some false positives such as O, Go, Thy, Ye that need to be removed. It is also interesting to see entities other than people being detected (e.g. Jerusalem, Babylon).

Managing the Cases

The first case to handle is the occurrence of words which are not proper nouns (O, Go, Thy, Ye). To solve this, we simply need to exclude them from consideration:

In [10]:
false_npp = ['O','Thou','Behold','Go','Thy','Ye','My','A','Yea','Thus','Come',
             'Are','Mine','See','Tell','Whoso','Gods','Wilt','Red','Holy','[',']','Mount', 'TR','Please',
# Extract Only Proper Nouns and Add Index
proper_noun_tokens = OrderedDict((n, [(i, w) for i, w in wt if w not in false_npp]) for n, wt in proper_noun_tokens.items())
# Print 100 Most Common Words after excluding False Proper Nouns
noun_freq = nltk.FreqDist(w for n,wt in proper_noun_tokens.items() for i, w in wt)
", ".join([n for n, v in noun_freq.most_common(50)])
'Jehovah, God, Israel, David, Lord, Jesus, Judah, Jerusalem, Moses, Egypt, Christ, Saul, Jacob, Aaron, Solomon, Babylon, Spirit, Pharaoh, Son, Joseph, Abraham, Father, Joshua, Jordan, Levites, Moab, Benjamin, Ephraim, Paul, Peter, Manasseh, Zion, Joab, Jeremiah, Samuel, Jews, John, Isaac, Hezekiah, Assyria, Samaria, Jonathan, Absalom, Ammon, Jeroboam, Gentiles, Gilead, Elijah, Philistines, Canaan'

The second case to consider is non-human entities. Some examples of these are nations (Jerusalem, Babylon), locations (Galilee), symbols (Lord, Father, Son) and false idols (Baal). Since the relationships between non-human entities can yield useful insights, we will not be excluding such words and instead expand our scope from humans to entities.

The Entity Cloud

Using the Proper Noun approach, we can subsequently plot these entities into a word cloud:

In [11]:
# The frequency of each character occurence by genre
character_freq = nltk.ConditionalFreqDist((w[1],bible["Genre"][n]) for n,wt in proper_noun_tokens.items() for w in wt)

# Plot word cloud for each name
inputs = {}
for n, fd in character_freq.items():
    inputs[n] = sum(fd.values())
__word_cloud(inputs, colors=__wc_color_func(character_freq))

# Titles
plt.title("Entities in the Bible")

# Legends
legend_cloud = list(__get_genre_legends(False))
legend_cloud.extend(__get_minmax_legends(inputs, "Word Count","{:d}"))
plt.legend(handles=legend_cloud, bbox_to_anchor = [1.31, 1.])