I am looking for help getting started. I would appreciate pointers, advice, … I know little or nothing about the python interpreter.
I am tasked with modifying languages to support a research project involving a hardware implementation of content addressable memory.
This is a demonstration project at this time.
I was the 2nd author of a paper covering our hardware implementation at Memsys 2023 We are now working towards an early end to end example implementation of our approach. We have an implementation of content addressable memory simulated in an FPGA, and will likely move towards an actual working test in Silicon DDR within the next year. But we are looking as part of the next stage in development to demonstrate an end to end solution - as there are significant benefits to what we are doing for AI (though that is NOT the only application) and AI is where alot of the action is we are looking to use a relatively simple AI application as a test bed/proof of our concept.
In doing so we discovered that alot of AI work is done in python - including the app we are using as a test bed. We had started work developing a preprocessor for C and have a preliminary implementation there. I had also barely started looking to modify GCC to support Content addressable memory when I learned that the AI App we are targeting is in Python.
So I am looking for direction regarding modifying extending python to support Content addressible memory and specifically to support our hardware implementation of content addressable memory.
I am a software developer with many decades of experience in many areas with many languages. I am not a compiler or interpreter expert, though I have worked on many projects that involved some form of interpreter. I have developed in many languages - but not alot of python. More than 50% of my work has been in C. But the rest has been in dozens of languages including a bit of python.
I’m no expert in content-addressible memory, but do you currently have a way to access it from C? If so, it should be possible to write a Python extension module. From the point of view of Python code, you simply import a module like any other, and call its functions. From the C point of view, it’s a set of functions that broker data between CAM and Python.
Building an extension module isn’t too hard, but it can be fairly complicated, so I would recommend looking into Cython (not to be confused with CPython) to see if you can create the module there.
My objective is not to write some code to access the memory - we already have that many ways.
This is about extending the data types/attributes of the language, and then adding the code to the interpreter to do a memory access by content, rather than address when referencing a variable with the address by content attribute set.
Think of CA as a normal data element but with an attribute that indicates it is references by key instead of by index or pointer or …
so as an example say you have an array structs,
Where each struct is a data records for a person.
But now you define the access key as data type that is an SS#
So if you want the city a person is in
you would code
city = person[123-45-6789].city
So what I am seeking to do is modify a copy of python so that I can create the array of struct person - specifying that it is indexed by SS#
So I have to have python add an attribute to array of struct person that specifies that it is addressed by SS#,
Then when that array is accessed by user code, I have to alter the read/write code to tell the DDR that it is doing a CA access and then send the current key as the address, not an index.
There are going to be some issues - because I do not think there can be bounds checking on the key - certainly not initially,
Right now I am seeking to know where to find the code for parsing variable definitions so that I can modify it to add key attributes,
and where I can find the code for processing variable references so that I can modify it to do a reference by key for variables with a key attribute set.
You almost certainly don’t need to modify the interpreter to achieve
this. The behaviour of almost everything, including indexing notation,
is not baked into the interpreter, but is determined by the data
types being operated on. You can create a custom type (either in
pure Python, or in C via an extension module) that does whatever
you want when it’s indexed.
That surprises me, but should make the task much easier.
I am not looking to create a custom data type - or atleast that is not my plan,
I am looking to add an attribute to existing data types, such that load/store operations are performed differently.
Though I am open to other ideas as to how to impliment this.
At this stage there is nothing dictating syntax, I do not have to come up with the best solution. I am not looking to feed what I am doing into the standard python implimentation. If our memory work proves valuable, that will occur in the future with people better understanding the language philosophy making those choices.
I am just looking to prove a concept.
Can you provide pointers as to where to start ? I am not sure that an example custom data type is the best fit, is there is an example of a custom attribute for variables.
Serious question - why? Is it simply to demonstrate that it’s possible? Are there real use cases where being able to do this with a built in type rather than a custom type is necessary?
As this is a research project, I guess the goal might simply to be “see what Python would have looked like if this type of hardware had been available when it was designed” - is that correct? But the rest of your post suggests you are hoping to be able to provide actual benefits for AI applications using Python on your hardware - and in that case, I don’t really understand why a 3rd party module isn’t the quickest and best route to actual benefits. Libraries like PyTorch have proved this is an effective approach for making use of GPU hardware, and I imagine your situation is similar.
It seems to be like CAM pretty cleanly maps to the Mapping interface that Python supports. These would in your case probably map integers to arbitrary objects, i.e. pointers. Alternatively, it might be a good idea to implement something like numpy arrays (i.e. continuous blocks of memory representing a specific data type) and add corresponding operations to that.
Generally, I would suggest to first get familiar with how python programming generally works, and then look into libraries like numpy and pandas for ideas on how custom C-level data structures can be quite easily exposed in python, as well as the actual AI libraries (most likely tensorflow?) that are being used for your AI application.
Once you have a good implementation of CAMs, you can then look into replacing some of the internals of the CPython interpreter, maybe you can have attribute lookup via the __dict__ be a lot faster by using CAMs. If this works, that could potentially be a massive speedup.
However, generally, Python is not the language in which you would directly interact with these kinds of features, since implementing the algorithms that make use of it isn’t worth it most of the time because of how slow the python interpreter is. Instead you should look into the existing AI libraries that are being used, whose core is written in C, and add algorithms using CAM there.
Clarifying - we are using AI as a popular demonstration of the value of CAM. Our objective is not write AI programs. It is to use an existing AI program to demonstrate the value of CAM.
We initially thought that sorting and searching would be the “killer app” - and that still probably is true - 40% of all CPU time is spent sorting and searching. The low hanging fruit for CAM is actually operating systems where there are data tables all over the place and constant searching of those tables. One of the “features” of CAM is that sorted data is essentially accomplished via an insertion sort that takes place in the memory itself - not the CPU.
But AI is driving investment today. We presented our CAM paper at Memsys 2023 in october is one of the things we learned - that we should have known is that almost all leading edge work and mony is focused on AI. We listened to and read all the memory/AI papers that were presented and immedate realised - we can do “that” via CAM.
Two of the biggies are “sparse arrays” We can handle the efficient storage of sparse arrays in memory - rather than the CPU.
And compressed arrays - there are a couple of notation standards that are used to efficiently story and multiply arrays - and this is again something we can do easily in CAM.
There are two important facets of CAM.
The first is to think of the address of something in memory as a token or key rather than a physical location in memory.
There are several others doing PIM - processor in memory - while that is part of our approach, the Core to what we are doing is redesigning the addressing logic in memory such that with a TINY bit of help from a TINY processor we can perform all kinds of complex address transformations.
I think in Python it would also be great to have direct support for special content-addressable memory. This would be tremendously useful, I think, for certain AI applications. But it would mean having either special functions or datatypes that support this, both on the Python library level and the underlying C-level. Custom objects, like a special PyCAMArrayObject – also exposed on the Python level – seem to me the best way to implement this. Unless the host machine on which Python runs only supports content-addressable memory - so is totally different from current machines --, it makes no sense imo to change the Python interpreter code: anything that could be achieved by doing that is easier to achieve with custom datastructures defined in a Python extension code.
Beware though – while it is true that all AI seems to be using Python, that doesn’t mean that most CPU cycles in typical AI programs are spent in the Python interpreter. The hard work is done by low-level libraries written in C++ (and in GPUs, of course).
I think the best way to demo this to a receptive audience would be to implement a sparse array package in C (or C++ or Rust or whatever) that follows the requirements for numpy interoperability. Potentially you could even write something that’s compatible with jax, although I’m not sure how much more involved that would be.
Compatibility with numpy would make it very easy for folks in data science, ML, etc to try out your package[2]. Compatibility with jax would bring it directly into the realm of accelerated ML (although probably not GPU support). So it’d be very accessible to your audence.
On the other side, it makes your life a whole lot simpler if you can use existing models to measure the performance impact, and this would be more convincing to people than toy examples.
The sparse package is a fairly small codebase that implements sparse arrays in a numpy-compatible way, so it might be a good place to start and see how it implements the interface.
First CAM is a generally applicable capability, we are addressing AI specifically as a marketing issue. The Low hanging fruit of CAM is Operating systems where significant performance gains are possible. But AI is where all the excitement and money is going right now.
CAM has significant benefits to AI, but those benefits are not Unique to AI.
At a language level CAM is not a language specific issue. My interest in Python is again driven by the fact that it is a popular AI language and the one the AI demo application we are using is written in Python.
I clearly do not know what the scope limits of module in Python are as I am being told that modules can be used to add new attributes to existing data types - which would not be my expectation.
Regardless CAM is not a new data type, in fact it is not even new. There are languages that have some from of Content addressing - Python may already be one. C/C++ is not. Perl has limited CA.
Fundimentally CAM breaks the paradigm that memory is referenced by LOCATION. The programmer asks for what they want, and the Memory rather than the CPU returns what was asked for.
We are not the first or only people working on this. The difference with our approach is that most of the work in our implimentation is being performed by altering the addressing logic in memory - we do have a CPU in memory in our approach but that is to handle edge conditions that can not be performed entirely in the addressing logic.
Other implementations put a more substantial CPU in the memory and have more of a resemblance to multiprocessing.
Regardless whether our approach or another prevails, it is likely that CAM is coming. That was another takeaway from the MEMSYS 2023 conference. The memory side of computing has more headroom for future growth than the CPU side. I am reluctant to say that CPU’s have hit a wall - that has been said many times before. But it is harder and harder to squeeze more performance from CPU’s. Other systems are increasingly being explored for future performance improvements. Memory is the largest subsystem outside the CPU, and while certain aspects of memory are as cutting edge as CPU’s there has been little effort until recently to go outside the standard memory paradigm. Now that is occuring. It is entirely possible this will prove a dead end, or that our particular approach will be a dead end. I am betting otherwise. But it is near certain that something significant will happen in this space. Our approach is not focused on AI - except as a marketing vehicle. But others have done memory development specifically for AI. AI is a huge memory pig. Some of the AI related memory work is seeking to move operations like matrix multiplies into memory. We are not doing that - now. That would require significant additional complexity to our approach. But it is likely to occur eventually.
SOME of the AI related development fits into the Module or library model - as I would understand that.
CAM however is a potential attribute of any data type.
I do not think it is very difficult to incorporate into most languages,
When a data item is defined it is assigned (or not) a CAM attribute, and probably a key identifier. When the data items is loaded or stored, the CAM attribute triggers different code generation (for compilers) or execution for interpreters. It is possible to have content addressing without changes to memory hardware - but that requires a significant CPU time cost. As noted there are languages that have some CA functionality today. CAM means the CPU cost is nearly the same as a normal memory reference. The work gets performed by the memory.
In many cases with no performance cost at all. In many more the additional time does NOT impact the CPU and therefore is usually free.
I am not a python expert. I am not an expert in any computer language. I am close to the opposite - someone who has worked with many computer languages.
I do understand enough about python that I know that its high performance for an interpreted language is because significant portions of python code actually execute highly optimized mower level code.
Regardless, my “focus” on AI is mostly marketing. Our work is NOT about AI, it is about the Content addressible memory paradigm.
Our objective is to use a common AI application to demonstrate how CAM has value - partly to AI, but also generally.
Your reference to GPU’s is relevant. One way of thinking of CAM is as an MPU There is nothing a GPU does that can not be done in a CPU,
GPU’s are just dedicated to and better at certain things.
The GPU analogy is imperfect - because GPU’s tend to perform mathematically complex tasks indepent of the CPU - they are not close coupled. Memory is closely coupled to the CPU. a CAM/MPU
also performs a task that the CPU could do itself, but unlike a GPU which executes a big task loosely couple, a CAM/MPU executes a small task many many times tightly coupled.
I want to be careful because there is PIM (processor in memory) work being done that more strongly resembles a GPU - Samsung is working on that as are others. What we are doing has some similarities to cache but with more sophistication. Cache injects itself into memory loads/stores allowing the CPU to run close to full speed and letting the cache controller deal with the actual load/store and the fact that memory is slower. So a cache controller is sort of a closely coupled MPU. CAM divorces the CPU from the how and where the data is stored in memory. The “address” becomes a token. That token now means something different from the row/column location of the data.
The token means something at the software level so we are abstracting away - on the CPU side the physical location of something in memory. Now the memory becomes responsible for using the token to figure out what cell the data is in. One of the cores to our approach is grasping that a tradtional address is also a token, it is just not one with a consequential meaning.
An example that is easy to understand - but NOT the best use of CAM, is instead of having the CPU calculate the memory address of the iTh element of an array of objects of some size, have the CPU tell the memory I want the iTh element of an array. The memory would need to know the the specific object, the size of the object, and what element in the array of objects you want. This would save several instructions on the CPU and perform the load store address calculation in the memory. As noted - this is not the best use case for CAM, but it should be one that is easy to understand.
So putting that in terms of what I need to do with python.
If I was looking to impliment traditional arrays, but address them by token instead of address. I would add a CAM attribute to the array when created. Then when I perform a load/store, instead of calculating the address of the data in regular memory using the CPU, I would communicate the object I wanted and the array index of that object, the memory would already know the size as that gets sent to the memory when the object is created. Then the CAM/MPU would calculate the address from the object, the index, and the object size.
There would be less CPU cycles executed, some of the work normally done in the CPU would be done in the memory.
What I am after here is pointers as to how to alter Python to extend the syntax to add an attribute to a data object when it is defined snd then when lods/stores to that object occur to use different code to perform the load/store that does not do an address calculation but instead passes a toke to the Memory/CAM/MPU.
To be clear - I used traditional array access by index as an example.
That is something CAM can do. But it is NOT the best use case, it is just the easiest to understand.
We hope this is really useful. Time will tell. We are not the only ones doing this. But our approach is both more general than the others we are familiar with and works by modifying the memory addressing logic.
That means for MANY load/store operations there is little or no time cost overhead - technically most stores as close to free because even if it takes the memory a long time to store the data, as long as the CPU does not immediately load the same data which is highly unlikely, the time the memory uses is free. This is also why our “Sort in memory” capability - on of the first things we did, is essentially an insertion sort. Storing is expensive but the cost is hidden from the CPU while lookup is nearly free.
I am glossing over lots of details. on the memory side, but that is not relevant here.
This is not likely far from implementation in real hardware. We have tested it in FPGA’s for several years and right now are moving towards being able to provide demonstrations of the value.
That said though our technology is applicable to memory generally - it is not even specific to DDR, it could be used for Flash or static ram or SSD’s or hard disks, it is unlikely that you will be able to buy CAM/DDR to put in your laptop anytime soon.
Again one of the reasons we are looking at AI, is because that is an area where they will be happy to pay the initial 3-4 times the cost that this will likely be in return for the performance benefits.
To be clear the cost will not be high because the hardware is complex.
It will be high initially because what you make billions of is cheaper than what you make millions of.
I think the main message is that this type of specialized hardware manipulation wouldn’t get written in Python. It’d be in a low-level language and exposed via an extension module.
So if you know how to implement this in C, C++, Rust, etc, then you should build a useful data structure in that language and then expose an interface that can be used from Python.
This will be necessary anyway, because the relevant comparison is against code written in those languages (e.g. numpy, jax, pytorch)
David, it really, really sounds like Python is not the language to try this with. Even if all existing CPUs would already support this, it would likely not affect Python’s generated bytecode, which is simply too high-level to use this mechanism directly. You would be much better off creating a C or C++ library, which are languages that purport to have a fairly direct mapping between the constructs in the language (such as arrays) to the constructs of the hardware (such as address arithmetic).
Just to clarify - I do not care what language the code is written in.
I can write in anything. If I do not know a language I can learn enough for this is a few days.
Where I am less sure about your remarks is, your observation that this is a module - I do not know python well enough to be able to say for certain that the attributes for data types can not be globally altered in a module - but I am skeptical that essential core language functionality can be globally changed in a module.
Content addressable memory is analogous to implementing integer array indexes;
Is there a howto, or primer or design guide for modules ?
Anjhd again thanks.
Or the design of the python interpreter ?