A team of researchers from Skoltech, MIPT, AIRI Institute of Artificial Intelligence and other research centers has developed a method that allows not only to distinguish texts written by a person from those generated by a neural network, but also to understand by what features the classifier decides whether the text is generated or not. By analyzing the internal states of the deep layers of the language model, scientists were able to identify and interpret the numerical features responsible for the style, complexity and “degree of confidence” of the text.
Work accepted to the Findings of ACL 2025 conference and published as a preprint on the arXiv portal.
Why Detecting AI-Generated Text Is Getting Harder
The rapid development of large language models (LLMs) such as ChatGPT, Gemma, and LLaMA has led to the texts they generate filling the Internet, textbooks, tutorials, and even scientific articles. An acute problem has arisen: how to distinguish original human creativity from the product of a machine? Existing systems for detecting generated text often work as “black boxes”: they issue a verdict of “human” or “AI”, but cannot explain what specific properties of the text their decision is based on. This opacity limits their flexibility and reliability: if the detector makes a mistake, it can be very difficult to understand why exactly it made a mistake and how to avoid such a mistake in the future.
Explaining the Science: Sparse Autoencoders and Neural States
The researchers decided to approach the problem from a different angle.
Instead of creating another “black box,” they set out to look “under the hood” of the neural network and turn its internal states into a set of clear and interpretable text characteristics. To do this, they used a well-known technique called sparse autoencoders (SAE). If you imagine the internal state of a neural network as a complex cocktail of thousands of mixed signals, then SAE works as a high-precision separator that breaks this cocktail down into cleaner, atomic “ingredients” that are easier to interpret. Each such feature is responsible for a specific aspect of the text: for example, the complexity of sentences or the use of specific vocabulary.
What Makes AI Text Different from Human-Written Text?
Laida Kushnareva, Senior Academic Consultant at Huawei, commented: “People who regularly deal with texts generated by ChatGPT can often recognize such text by its characteristic features, such as inappropriately dry and formal language, excessively long and “watery” introductions before getting to the point, repeated formulations of the same idea, and low information density in general. However, most popular detectors of generated texts do not show to what extent these and other human-understandable features are present in the text.
In contrast, our SAE-based detector allows us to automatically decompose texts into “atomic” numerical features, many of which can be interpreted in terms understandable to humans. At the same time, the detector outperforms all existing solutions on the dataset we used. In addition, we showed that SAE can also detect some deliberate attempts to hide the fact of text generation, such as deliberately adding extra spaces, articles, or non-standard characters to confuse detectors. In other words, this technique allows us to automatically parse the text “bone by bone” and make a decision, the validity of which can then be verified by a person based on the identified features and their interpretation.”
How SAE-Based Detection Works — With Interpretable Features
In the study, the scientists fed the Gemma-2-2B neural network with various text examples and saved the internal states from the deep layers of the model for each text. They then extracted thousands of “atomic” features from these internal states using SAE. Using these features, they trained a classifier to recognize the generated texts and proceeded to the most interesting part — interpretation. They identified both “universal” features typical of many generative models and specific ones inherent in individual AI families or certain types of text (for example, scientific articles and reviews). For example, in texts on scientific topics, the AI is prone to overly complex syntactic constructions, and in texts on financial topics — to unfounded, verbose reasoning about simple facts.
Atomic Features and Text Properties
For example, the paper shows that “feature #3608 from the 16th layer of SAE” is responsible for syntactic complexity. The scientists found that artificially strengthening this feature during text generation forces the neural network to create overly confusing sentences that are difficult to read. Conversely, weakening this feature leads to the appearance of short, “chopped” phrases with minimal coherence. Another strong feature, #4645, is responsible for the degree of confidence of the text, and #6587 is responsible for verbose introductions and overly detailed explanations.
Steering the Generation Process
Anastasia Voznyuk, a student at MIPT, added: “In addition to analyzing what exactly the model pays attention to during detection, we tried to control the generation model. The features that we identified earlier can be strengthened or weakened, and as a result, we can observe that in some cases the new generated text is characterized by a given feature more strongly or, conversely, weaker. For example, when changing the feature that determines the level of “academicity” of the text language, the style of the text will also change in the corresponding direction.
Can Personalized Prompts Fool Detectors?
The results show that when modern language models like ChatGPT are given standard queries to generate, they are likely to generate text with distinctive features that are easily detected by this and other detectors. However, the researchers warn that if the network is given a more personalized task (for example, asking it to write in a style it is not familiar with), these distinctive features may weaken or even disappear, which can make the detection task significantly more difficult.
Interpretability Matters: Bringing Transparency to AI Detection
The study uses a new multifaceted approach that combines automated feature extraction, manual interpretation, and experimental verification using a technique called “steering.” This creates the basis for developing more interpretable detectors that can not only make a verdict, but also provide a report on what anomalies were found in the text. Such tools will be useful for educators, editors, and disinformation researchers. More broadly, this work is an important step toward demystifying artificial intelligence, allowing us to better understand how neural networks “think” and create texts.
What’s Next in AI Content Detection Research?
Future research will focus on applying this method to new, more powerful language models and studying more complex and elusive features to stay one step ahead of those trying to use AI for malicious purposes, while reducing the likelihood of making a mistake and unfairly accusing a human of having their text generated.