A propitious start.
> Perhaps we can also discuss and share scripts for cleaning and preparing data here and anything else related to datasets.
I once wrote a crawler that walked through all the bibles published on Biblehub & Biblegateway, and then
'read' all their thousands and thousands of pages (ie, downloaded them all), then parsed every one of them out into JSON-like databases specific to each translation. Once completed this meant that I now had a way to directly correlate both the various translations to each other, but also with the original words in the extant manuscripts.
I say all this to simply point out that working with textual data is both important, and also complex for anything non-trivial. Many people just assume that dealing with text is both easy and simple, yet both assumptions are incorrect. There is a lot of work required to clean & normalize textual data for our robowaifus first, because as you said, GIGO.
High performance of the processing system seems pretty important to me, and so when I created the Biblebot software I wrote everything in standard C++ which could literally work through the entire processing load in just seconds once I had all the thousands of files downloaded locally. Slower methods such as Python would also suffice I imagine, but the basic need would remain: the data must be cleaned and normalized.
Further, I began to work on some graphical concepts to both display & refine the complex semantic and other interrelationships between words and phrases. This toolset hasn't materialized yet, but the ideas are basically valid I believe. The format for the interrelationships were based on Terry Halpin's Object Role Modeling, as specified in pic related.
I imagine that these experiences can at least be informative for this project, if not of some practical use. Good luck Anon.