Research Log #016

Welcome to Research Log #016, where we document weekly research progress across the various initiatives in the Manifold Research Group. Also, the  “Pulse of AI'' section includes  breakthroughs in the broader research community we think are interesting!

Interested in contributing or just learning more? Check out the Github or join the conversation on Discord.


The NEKO Project aims to build the first large scale, Open Source "Generalist" Model, trained on numerous modalities. You can learn more about it here

  • Language: We fixed the issues w/ tokenization and adopted best practices, and testing confirms it is working correctly. Our loss is decreasing! We are now going to start doing some bigger runs. More can be found through here.
  • Vision + Language: The VQA objective is now working, and the model is training succesfully! We are working on documenting the behavior of the model so that anyone can understand better how the model is currently working. We are also searching for new datasets to train the model on. More can be found through here.
  • Datasets: Finished the dataset availability survey for our Language and Vision tasks. Currently having discussion on what would be the best datasets to use in the final model. More can be found here.

Agent Forge

The AgentForge Project aims to build models, tools, and frameworks that allow anyone to build much more powerful AI agents capable of using tools and interacting with the digital and physical worlds. 

We are continuing with our survey of AI Agents, and are particularly focusing on how  different parts of RL and Cognitive Science related to how LLM agents should work . 

We also are exploring Fuyu and Mind2Web to scope out a new project to create a model that can use Tools in a complex way.

Pulse of AI

  • AlphaFold: AlphaFold is back and bigger than ever before. The Google DeepMind team has been working on the next generation of this protein folding model. Now AlphaFold has better accuracy and can predict more than proteins. It can now predict things like DNA and RNA, small molecules and PTMs reaching atomic accuracy. More can be found through here
  • RedPajama-Data-v2: A new massive dataset that consists of 30.4 trillion tokens. It is an exciting dataset because it is the first one to contain 5 different languages. It contains English, German, French, Spanish and Italian. It is generated by the and more can be found through the release blog post here.

