Corpus: data extraction

This is the place for queries that don't fit in any of the other categories.

Corpus: data extraction

Postby guilhermegarcia » Fri Oct 11, 2013 10:06 pm

I will be as specific and concise as possible (I'm new to programming, but I do enjoy it). I have a database with a corpus. Example:

col 1 col 2 bastante bas$!tan$te

[that's Portuguese for 'a lot of']

! precedes the stressed syllable $ indicates syllable boundary

Here's what I want to know/do:

1. How many segments are between ! and the vowel of that syllable (answer: 1)
1.1. What kind of segment(s) is it? (answer: t)
2. How many segments are between that same vowel and the next $ (answer: 1)
2.1. What kind of segment(s) is it? (answer: n)

Ideally, I want to add four columns to my dataset (1-2.1 above).

Code: Select all
    col 1       col 2          1    1.1    2    2.1
    bastante    bas$!tan$te    1    t      1    n

Most of the time, 1 or 2 will be 0 or 1. If (1 or 2) > 1, columns 1.1 and 2.1 can have a single string with the segments (there's no need to have one extra column for each output).

Thanks a lot!
Last edited by micseydel on Sat Oct 12, 2013 12:26 am, edited 1 time in total.
Reason: Locked OP.
Posts: 1
Joined: Fri Oct 11, 2013 10:02 pm

Re: Corpus: data extraction

Postby casevh » Sat Oct 12, 2013 5:54 am

I used the interactive prompt to prototype my code. Here are the steps I used at the interactive prompt. The lines beginning with # are comments that I added.

Code: Select all
>>> word = "bas$!tan$te"
>>> # Split the word into syllables.
>>> word.split("$")
['bas', '!tan', 'te']
>>> # Now we just need to find the syllable that starts with !
>>> for syllable in word.split("$"):
...   if syllable.startswith("!"):
...     break
>>> # When you break out of a for loop, the index variable remembers its last value.
>>> for syllable in word.split("$"):
>>> syllable
>>> # All we want are the characters after the first one.
>>> syllable = syllable[1:]
>>> syllable
>>> # enumerate is useful when you want to step through a sequence and also know the position of the items
>>> list(enumerate(syllable))
[(0, 't'), (1, 'a'), (2, 'n')]
>>> # Because the definition of vowel can change....
>>> vowels=['a', 'e', 'i', 'o', 'u']
>>> # Step through all the characters and find the first vowel.
>>> for count, char in enumerate(syllable):
...   if char in vowels:
...     break
>>> count, char
(1, 'a')
>>> # Use string slicing and len() to get at the actual values.
>>> syllable[:count]
>>> syllable[count + 1:]
>>> len(syllable[:count])
>>> len(syllable[count + 1:])

With just a little editing, here is a function that returns the data from a given word.

Code: Select all
# Make this script work in both Python 2.x and 3.x by
# enabling the new print function in Python 2.x.
from __future__ import print_function

word = "bas$!tan$te"
vowels=['a', 'e', 'i', 'o', 'u']

def parse_word(s):
    for syllable in word.split("$"):
        if syllable.startswith("!"):
    syllable = syllable[1:]

    for count, char in enumerate(syllable):
        if char in vowels:

    before = syllable[:count]
    after = syllable[count + 1:]

    return (len(before), before, len(after), after)


Does this help?

Posts: 113
Joined: Sat Feb 09, 2013 7:35 am

Return to General Coding Help

Who is online

Users browsing this forum: Google Adsense [Bot] and 16 guests