About
Dr. Phil Lieberman, Jewish Studies
Research Context 1
The Torah was originally transcribed in Medieval times, where medieval scholars transcribed the consonants. This could produce ambiguity; consider the New York Times – if you read the words “sh rd ths” – that could resolve to many different phrases when you add in the vowels. However, when people read the Torah, they read the vowels out loud, and there is punctuation that indicates the stitching together of words (similar to hyphens and colons).
The Torah consists of 5 books; however some scholars believe that the Torah is really 4 books + Deuteronomy, which may have been added into the Pentateuch afterwards, with different origins. The initial research question revolves around using data science skills to examine the similarities and differences between the first 4 books and Deuteronomy.
Proposed Project Approach
Lieberman’s initial approach is as follows:
- Encode the punctuation marks in the first 5 books.
- Examine the relationship between the punctuation patterns in the first 4 books vs Deuteronomy for differences.
Where do we find similar patterns? Where do we find different patterns? This may reveal something about the relationship between the first 4 books vs Deuteronomy.
The DSI-aligned approach
An alternate approach which may provide the basis of more holistic examination of concepts, phrases, and style may be by using transformers. This will allow the encoding of the full texts into a common latent space and examine similarity/FAISS. The language is Hebrew.
Research Context 2
Lieberman has 250,000 medieval texts (he is a medieval scholar) which span a wide time period. These are written in Judeo-arabic, with some Hebrew as well. He believes that the Hebrew vocabulary changes over time, and would like to examine what this looks like. He wants to examine the relationship with time and historical changes that led to these changes in vocabulary.