Mapping the Sounds of Southern Spain

What is a linguistic atlas?

A linguistic atlas is a record of how speech varies across a specific geographical region, through a series of maps. In layman terms, it essentially tells us how individuals say things differently based on where they are from. Linguistic atlases use the phonetic alphabet to document this speech variability, depicting sounds rather than letters. For example, when a person from Boston pronounces “car” it sounds quite different compared with someone from Texas. We can use the phonetic alphabet to depict this difference in writing.

Travelling back in time: Early linguistic atlas research methods

While the Linguistic Atlas of Andalusia (southern Spain) was published in the 70s, data collection began in the 50s. The available technology for this type of research in the 1950s wasn’t an audio recorder (let alone a recorder hosted on the internet), but pen and paper! Researchers visited towns across southern Spain and interviewed inhabitants of those towns. During their visit, researchers would show pictures of objects and ask the inhabitants to say the word for each. The researcher would then write out the phonetic notation for how that person said each word! In the 1973 version of the linguistic atlas, there are over 300 maps published in paper format.

Dr. Alfredo Herrero de Haro has two goals for this research: to analyze how accents vary in southern Spain using updated technology, and to map that variation through a series of web-based interactive maps.

Building a linguistic atlas today

Collecting audio samples for a large dataset from anywhere in the world has previously been a challenge. With Phonic’s voice surveys, Dr. Alfredo Herrero de Haro is able to collect audio samples from across southern Spain at an unprecedented scale and efficiency: 2000 speakers from 500 Spanish towns will complete a voice survey within the span of 1 year, resulting in dataset of 256,000 audio samples. Not only will this be a much more robust dataset than the original atlas, but it will also be more representative; the original only includes one speaker per town, and this speaker was typically male.

Speech analytics

The ability to collect audio samples in a more efficient way is great, but how do these raw audio files get translated into the phonetic alphabet? In Dr. Alfredo Herrero de Haro’s modern researcher tool-belt is a speech analysis software that displays the spectrogram for an audio file. Spectrograms are waveform visualizations of sounds and can be interpreted to identify which sound is being produced. This method is much more accurate than identifying the sound by ear, as was the case in the 1950s.

That being said, processing the audio samples is still very much a (manual) labor of love. Even with today’s technology, the nuances of spoken language make it hard to automate speech analytics. So while an automated script will provide the initial phonetic label for each audio sample, this classification must be verified by a human. Dr. Alfredo Herrero de Haro says that it will take about 3 years to complete this process!

From audio dataset to interactive map

Once the speech analytics are complete, they will be uploaded into a series of interactive maps. Dr. Alfredo Herrero de Haro’s team has already programmed the maps’ functionality. Soon researchers and members of the general public will be able to visit the linguistic atlas’ website and explore how speech varies across southern Spain! Not only will the phonetic classifications be available, but an audio sample will be available for listening as well.

Dr. Alfredo Herrero de Haro hopes that the completed atlas will be a great open resource for linguistic researchers, but also the general public who are interested in how accents differ across Spain.

“Users will be able to choose a word and listen to differences in how it is pronounced across any of the towns included on the map. They don't have to know linguistics. They don't have to know phonetics. You can click on a town on the map and then you can hear how someone from that town says a word, compared to someone six hours down the road.”

He plans to collect data until June 2024 and aims to complete the project by December 2026. We can’t wait to see the final product!

‍