[wp-polyglots] continents-cities translation help
Wacław Jacek
mail at waclawjacek.com
Wed May 27 20:58:19 GMT 2009
Hey guys!
I wrote a little program in Python which might help translate the contents of continents-cities. Hope it does its job for you!
Here's the code (works as of 27th of May 2009, might require adjustments later if Wikipedia's way of displaying stuff changes):
__________________________________________________________________________
# coding: utf-8
# Wacław Jacek <mail at waclawjacek.com>
# Licence: GPLv3 (it's available here: http://www.gnu.org/licenses/gpl.txt)
### CHANGE THE TWO BELOW!!! ###
locale_code = 'pl' # Wikipedia URL prefix, eg. 'pl' for pl.wikipedia.org
translation_file_name = 'continents-cities-pl_PL.po' # your .po file
headers = {'User-Agent' : 'Super Groovy Geo Name Translator/1.0'}
import urllib
import urllib2
# get the list of words from the file
words = []
translation_file = open( translation_file_name, 'r' )
for line in translation_file:
if line[ : line.find( ' ' ) ] == 'msgid':
begin_pos = line.find( '"' ) + 1 # "+ 1" to skip the quotation mark itself
end_pos = line.find( '"', begin_pos ) # closing quotation mark position
word = line[ begin_pos : end_pos ]
if word != '': # if the string isn't blank
words.append( word )
print '"' + word + '"' # debug
translation_file.close()
print '\nDone getting words from .po file. Will now attempt to translate.\n'
# and off we go! (getting the translations from English Wikipedia)
translations = []
for word in words:
# construct the URL
url = 'http://en.wikipedia.org/w/index.php?title=' + urllib.pathname2url( word ) + '&printable=yes'
# connect and get
request = urllib2.Request( url, None, headers )
try: # if could reach the site
connection = urllib2.urlopen( request )
except urllib2.HTTPError:
translations.append( '' ) # no translation found
continue
site = connection.read()
connection.close()
# find it
query = '<li class="interwiki-' + locale_code + '"><a href="http://' + locale_code + '.wikipedia.org/wiki/' # what to find in the source
begin_pos = site.find( query )
if begin_pos != -1: # if found
begin_pos = begin_pos + len( query )
end_pos = site.find( '"', begin_pos ) # where to trim the output
# format it
translation = site[ begin_pos : end_pos ]
translation = urllib.url2pathname( translation ) # decode the URL
translation = translation.replace( '_', ' ' ) # replace underscores with spaces
# print it (just for debug purposes)
print word + ' -- "' + translation + '" (' + str( begin_pos ) + ' : ' + str( end_pos ) + ')'
translations.append( translation )
else:
translations.append( '' ) # no translation available in Wikipedia's interwiki (links to local sites)
print '\nDone fetching translations from English Wikipedia. Will now put them in a file.\n'
# put the translations into the translation file
output_file = open( translation_file_name + '.autotranslated', 'w' )
for id in range( len( words ) ):
output_file.write( 'msgid "' + words[ id ] + '"\nmsgstr "' + translations[ id ] + '"\n\n' )
output_file.close()
print 'Done!'
__________________________________________________________________________
Cheers,
W. J.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3262 bytes
Desc: S/MIME Cryptographic Signature
Url : http://comox.textdrive.com/pipermail/wp-polyglots/attachments/20090527/30a04def/smime.bin
More information about the wp-polyglots
mailing list