Me:
I have lines of text. I want to remove duplicate words per line. Help. Thanks!
My mind is melting, mind you I am not good at remembering Python.
Python 3 would be nice. Not a real python please.
Me:
I have lines of text. I want to remove duplicate words per line. Help. Thanks!
My mind is melting, mind you I am not good at remembering Python.
Python 3 would be nice. Not a real python please.
Example?
I’m gonna try to solve this problem. Is this for G’MIC scripting? I have some python files that can help.
Also, I found your solution via googling since I’m bad at python:
def unique_list(text_str):
l = text_str.split()
temp = []
for x in l:
if x not in temp:
temp.append(x)
return ' '.join(temp)
lines_of_text="""My mind is melting, mind you I am not good at remembering Python.
Python 3 would be nice. Not a real python please."""
lines=lines_of_text.splitlines()
new_lines=[]
for line_index in lines:
print(unique_list(line_index))
This gives:
My mind is melting, you I am not good at remembering Python.
Python 3 would be nice. Not a real python please.
Sure, this could eventually be in G’MIC. How about making it case insensitive? I would use lower()
but that would affect the sentence case.
The simple solution: you could have two arrays, one lower case, the other unchanged, and use pretty much the same algorithm, using the lower-cased version for the search.
Case insensitive solution here.
def unique_list(text_str):
lower_case=text_str.lower()
lower_case=lower_case.split()
l = text_str.split()
temp = []
new_lines=[]
for n in range(len(l)):
x = lower_case[n]
t = l[n]
if x not in temp:
temp.append(x)
new_lines.append(t)
return ' '.join(new_lines)
lines_of_text="""My mind is melting, Mind you I am not good at remembering Python.
Python 3 would be nice. Not a real python please."""
lines=lines_of_text.splitlines()
new_lines=[]
for line_index in lines:
print(unique_list(line_index))
Output:
My mind is melting, you I am not good at remembering Python.
Python 3 would be nice. Not a real please.
If you keep two lists, better make temp
a set, the performance will be better.
Just for fun, the nearly one-liner (no support for case):
from functools import reduce
lines_of_text="""My mind is melting, mind you I am not good at remembering Python.
Python 3 would be nice. Not a real python please."""
print(" ".join(reduce(lambda ul,i: ul if i in ul else ul+[i], lines_of_text.split(),[])))
another one liner without support for case:
l = """My mind is melting, mind you I am not good at remembering Python.
Python 3 would be nice. Not a real python please."""
print("\n".join([" ".join(dict.fromkeys(row.split())) for row in l.splitlines()]))
Thanks everyone and @ilmioalias for becoming a member.
How about removing certain words or phrases? E.g., mind you
, a
, please
.
My mind is melting, mind you I am not good at remembering Python.
Python 3 would be nice. Not a real python please.
Feel free to write an isolated example and then a one-liner to combine this with the previous task.
This doesn’t work, but I think @ofnuts or @ilmioalias can fix it.
init_lines_of_text = """My mind is melting, mind you I am not good at remembering Python.
Python 3 would be nice. Not a real python please."""
lines_of_text=init_lines_of_text.splitlines()
phrases=['mind you','a','please']
n=int(0)
for phrase in phrases:
t=phrases[n]
if t[-1]!=' ':
phrases[n]=t+' '
n+=1
for line in lines_of_text:
for n in range(len(phrases)):
line=line.replace(phrases[n],'')
print(line)
replace() returns a new string, it is not an in-place replacement.
Try this. I added stop words from NLTK
and it does something funky. Might have something to do with the mix of "
and '
quotes.
Output
Mminmelting, minI gooremembering Python.
Pyth3 woulnice. Noreal pythplease.
Input
init_lines_of_text = """My mind is melting, mind you I am not good at remembering Python.
Python 3 would be nice. Not a real python please."""
lines_of_text=init_lines_of_text.splitlines()
phrases=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
n=int(0)
for phrase in phrases:
t=phrases[n]
if t[-1]!=' ':
phrases[n]=t+' '
n+=1
for line in lines_of_text:
for n in range(len(phrases)):
line=line.replace(phrases[n],'')
print(line)
With all of those phrases, that sound like a hard problem.
Edit:
Found out the solution.
I think sorting the phrases by the length of the string would solve the problem.
Err, I tested it with QPython 3L on Android. My theory unfortunately did not held correctly.
Here’s code:
init_lines_of_text = """My mind is melting, mind you I am not good at remembering Python.
Python 3 would be nice. Not a real python please."""
lines_of_text=init_lines_of_text.splitlines()
phrases=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
phrases=sorted(phrases,key=len)
phrases=phrases[::-1]
print(phrases)
n=int(0)
for phrase in phrases:
t=phrases[n]
if t[-1]!=' ':
phrases[n]=t+' '
n+=1
for line in lines_of_text:
for n in range(len(phrases)):
line=line.replace(phrases[n],'')
print(line)
My uninformed solution is as follows:
stop = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
text = '''My mind is melting, mind you I am not good at remembering Python.
Python 3 would be nice. Not a real python please.'''
l = '\\n'.join([line for line in text.splitlines()])
l = ' '.join([word for word in l.split() if word not in stop])
l = l.replace('\\n','\n')
print(l)
#! /bin/env python3
import re
text = """My mind is melting, mind you I am not good at remembering Python. Python 3 would be nice. Not a real python please."""
# Can likely be simplified... and maye some of these can be replaced by regexes as well
words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
# Sort the words longest first (technically order is by "contains()"
# But sorting on length also ensures this and is faster)
# words.sort(key=len,reverse=True) # start with longest first
total=0
for word in sorted(words,key=len,reverse=True):
# bracket the word between word boundaries markers
# use re.sub() instead of str::replace because we can use IGNORECASE
# while we are at it, use subn() instead of sub for statistics
text,count=re.subn(r'\b'+word+r'\b','',text,flags=re.IGNORECASE)
total+=count
text=re.sub(' +',' ',text) # cleanup (2 or more spaces to a single)
print(text)
print("---")
print(f'{total:3d} replacements made')
yields:
mind melting, mind good remembering Python. Python 3 would nice. real python please.
Ha ha, this self-talk is great:
My mind is melting, mind you I am not good at remembering Python.
Python 3 would be nice. Not a real python please.
turned into
Mminmelting, minI gooremembering Python.
Pyth3 woulnice. Noreal pythplease.
has become
mind melting, mind good remembering Python. Python 3 would nice. real python please.
Still not perfect as the newline
disappeared and there is a space
at the beginning.
Try this afre:
import re
text = """My mind is melting, mind you I am not good at remembering Python.
Python 3 would be nice. Not a real python please."""
# Can likely be simplified... and maye some of these can be replaced by regexes as well
words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
total=0
lines=text.splitlines()
for line in lines:
# Sort the words longest first (technically order is by "contains()"
# But sorting on length also ensures this and is faster)
# words.sort(key=len,reverse=True) # start with longest first
for word in sorted(words,key=len,reverse=True):
# bracket the word between word boundaries markers
# use re.sub() instead of str::replace because we can use IGNORECASE
# while we are at it, use subn() instead of sub for statistics
line,count=re.subn(r'\b'+word+r'\b','',line,flags=re.IGNORECASE)
total+=count
line=re.sub(' +',' ',line) # cleanup (2 or more spaces to a single)
line=line.strip()
print(line)
print("---")
print(f'{total:3d} replacements made')
Output:
mind melting, mind good remembering Python.
Python 3 would nice. real python please.
---
10 replacements made
If you need to capitalize, use this:
print(line.capitalize())
Output:
Mind melting, mind good remembering python.
Python 3 would nice. real python please.
---
10 replacements made
Edit: If you need capitalization after ". ". That can be done too.