Search a specific word in a file using python

A forum for general discussion of the Python programming language.

Search a specific word in a file using python

Postby dren-haliti » Fri Dec 06, 2013 10:31 pm

How can you search a specific word in python, for example; the text1.txt file contains this content "The world is not enough for both of us". So i did something like this:
Code: Select all
word = raw_input("what do you want to search? ")
for filee in open('text1.txt'):
   if word in filee:
     print "found"


When i put the word 'no' it prints "found". It should not do that... but still it prints that because the word 'not' is in the text and it doesn't matter.

When i search for the word 'not' i want the print to echo back, not when i search for 'no'.
Last edited by micseydel on Fri Dec 06, 2013 11:21 pm, edited 1 time in total.
Reason: Code tags, first post lock.
dren-haliti
 
Posts: 1
Joined: Fri Nov 29, 2013 9:54 am

Re: Search a specific word in a file using python

Postby micseydel » Fri Dec 06, 2013 11:55 pm

Hello, and welcome to the forum! Please read the following before making your next post: viewtopic.php?f=10&t=145

Before I get into explaining a solution to your problem, I want to point out a semantic issue in your code. You said
Code: Select all
for filee in open(...):

Note that when you have such a loop, you're really looping over the lines in the file specified in open(), you're not iterating over files, or a whole file all at once or character by character. This is good in the sense that it uses memory according to the length of a line rather than a whole file, so for text files should work in relatively low memory environments.

As for your problem... it's a relatively complicated task, the more it's looked into. As you noticed, what you tried works in the case of whole words, but raises false positives sometimes. The first solution that comes to my mind is to split on whitespace and then look at the list of strings you end up with. In the case of your sample input, this works.
Code: Select all
for line in open('text1.txt'):
    words = line.split()
    if word in words:
        print "found"

However, this doesn't account for two issues which can arise: differing capitalization, as in the first word, and punctuation.
Code: Select all
>>> sentence = "Hello my dear friend!"
>>> words = sentence.split()
>>> 'my' in words
True
>>> 'Hello' in words
True
>>> 'hello' in words
False
>>> 'friend' in words
False


tl;dr warning about further content: proceed with caution

As long as you look for lowercase strings, you can account for capitalization easily
Code: Select all
>>> lower_words = sentence.lower().split()
>>> 'hello' in lower_words
True
>>> 'Hello' in lower_words
False
>>> 'DEAR' in lower_words
False

You can filter out non-letters too
Code: Select all
>>> from string import letters
>>> ok_chars = set(letters)
>>> ok_chars.add(' ') # allow spaces; could add tabs or other characters if needed
>>> sentence = "Look, there's punctuation here!"
>>> simple_sentence = filter(lambda c: c in ok_chars, sentence.lower())
>>> print simple_sentence
look theres punctuation here

This bit of code could be done slightly more simply, from a newbie's point of view, but some of this is good for efficiency. You're welcome to ask questions.

You can make things a bit more complex. My solution above doesn't differentiate between the two English words "won't" and "wont". Stripping capitalization also loses meaning, as certain things are clearly names when capitalized by are otherwise regular nouns (or even adjectives). There is also a semantic possibility of a possessive apostrophe, meaning "Michael's" becomes "michaels" and after that processing you're not sure if it was supposed to be possessive or if there are multiple people with that name. I also don't account for the possibility of a word being split across multiple lines with a dash. Regular expressions can sometimes help, although they're a sublanguage that you might not want to learn. I would imagine that the nltk (natural language toolkit), a toolkit available online, would have solutions to this, though I've never used it so can't say for sure.

My point in all this is that the things you're doing in Python are considered quite high level, but are still lower level than natural language, such as English. The more Python you learn, the better you can account for these things. And the more "computer science" you learn the more efficiently you can do so (I used a set above for a reason). I hope I've provided a good basis here for you, but feel free to ask any further or clarifying questions.
Join the #python-forum IRC channel on irc.freenode.net!

Please do not PM members regarding questions which are meant to be discussed publicly. The point of the forum is so that others can benefit from it. We don't want to help you over PMs or emails.
User avatar
micseydel
 
Posts: 1507
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA


Return to General Discussions

Who is online

Users browsing this forum: No registered users and 2 guests