AI black box, is it any clearer?

The concept of Artificial Intelligence (AI) is inherently bound to patterns. From its origins in Feed Forward Artificial Neural Networks (FFANN), AI models stand out in recognizing patterns and learning from them. This exceptional capability extends to the modern Large Language Models (LLMs), which, based on pattern learning, can provide accurate responses for any question and ease our daily lives by facing routinary tasks (e.g., answering emails, summarizing meetings, etc.).

Surprisingly, even in the middle of this new era where AI is all around, little is known about the internal processing of these models. Few are the groups who have attempted to unravel the processing behind the black box of AI models due to the huge dimensionality of the problem. Several interconnected concepts such as neuron weights and activation, learning algorithms, etc. affect AI performance. However, there are still some success cases which might be relevant for the future development of the field.

We decided to address this challenge by studying the inside processing structure of FFANN. According to our results, networks trained to recognize images showed similar configuration of neuron weights (Figure 1). These patterns found in the structure of the network (even after regularization) suggest the convergence of learning towards fixed motifs. Consequently, the presence of these patterns might be exploited for pre-training new models thus reducing the requirements and associated costs of the training procedure.

Figure 1: Representation of neurons’ weights from 2 Feed-Forward Artificial Neural Networks trained independently showing great similarities in the structure of their values.

Recently, the Anthropic team shed light to LLMs black box from another perspective. By applying dictionary learning, the research uncovered the presence of neuronal activation patterns, which allowed grounding the inner network’s processing into human-comprehensive features. The Anthropic team found millions of features covering a wide range of fields including politics, sciences, art, and coding, among others; and mapped them to neuron activation patterns. Additionally, tweaking neuron activation allowed to adjust the ‘level of featureness’ in the response. For example, as the authors say:

‘Amplifying the “Golden Gate Bridge” feature gave Claude an identity crisis even Hitchcock couldn’t have imagined: when asked “what is your physical form?”, Claude’s usual kind of answer – “I have no physical form, I am an AI model” – changed to something much odder: “I am the Golden Gate Bridge… my physical form is the iconic bridge itself…”.’
Overall, these findings, along with our own research on FFANNs, highlight the importance of understanding the internal processes of AI models to optimize and modulate their performance. The ability of tuning different aspects of the processing sets the first steps towards a necessary fine control of AI behaviour. As has been reported by several studies, actual LLMs are not truly neutral about different societal aspects such as political ideology or race. Thus, the implementation of trustworthy AI methodologies to rely on for sensitive uses (such as justice or human resources) will require the co-development of control strategies oriented to prevent hazardous biases. Designing control strategies should be focused not only on answer post-processing but also on the model architecture itself and its tuning. All in all, understanding black boxes provides tools to ensure that the AI revolution we are witnessing contributes to a safe and equal society without compromising technological progress.

Alvaro López-Maroto is currently working at Nuxia in the AI Engineering department, where he is collaborating on the development of a digital worker. At the same time, he is pursuing his PhD studies at the Artificial Intelligence Lab of UPM.

Share the Post: