I'm not in at the moment. Please leave a message after the beep and I'll get back to you as soon as possible.
For a quicker response you might want to try sending an @ reply to @mzsanford on Twitter.
I'm not in at the moment. Please leave a message after the beep and I'll get back to you as soon as possible.
For a quicker response you might want to try sending an @ reply to @mzsanford on Twitter.
I’ve been working for a pretty early stage and popular start up for a few years now and I’ve learned some things. None of what I’ve learned is news to people who have been through the start up mania, and I bet there are better posts out there on the internet. These are my personal ramblings about my experience and might not reflect anyone else’s experience. Having said that, when I was making my decision to join a start up I wanted an informal description of this arc and all I found where venture capitalists and people yearning for the bygone ’90′s bubble. I’m neither of those. I’m just a guy who likes to play with badly formed analogies.
Start up life is complicated so no one analogy really explains it all. Instead I’ve opted to break it into three phases, all alike in dignity. It’s not like there is a day where you switch from one phase to the next … and I’m not even sure I could spot these again if I were in the middle of them. This is a hindsight look at the last two years. The craziest and possibly best years I’ve known.
When Prometheus stole fire from the gods he didn’t sell it to his neighbor. He didn’t barter with it or tell the person next to him about “the exciting investment opportunities in state-of-the-art home heating and illumination“. No, he gave it to humanity. I joined a start-up when it was still very small and making no money at all. At the same time we were becoming quite popular and busily working to stay online. I’m not saying we brought something as critical as fire to man, but what I am saying is that we didn’t show up looking like P.T. Barnum and trying to bilk the suckers. We showed up with what we could muster and gave it up for people to use as they like. And boy did they use it … and in ways we never imagined. Looking back on it the Prometheus myth would be more fitting if he was burned badly while delivering fire to man, but that’s another story.
To me the point is that it starts with an idea, yearning to spread, and making that idea available for free is a good start. Selfishly: you are more free to experiment when money isn’t your biggest worry. Pragmatically: Building a product people love will result in more people spending time with your product and becoming the people to whom you market your money-making idea. Bring the fire for free now and when you see they use it to see in the dark start making oil lamps to sell (or stoves to cook).
It’s hard to think of the Industrial Revolution without thinking of profit and loss, especially after the “free fire” talk above but that’s what I need to ask you to do. Trust me, we’re still in the free period of the start up product but the Industrial Revolution has a good lesson. This phase of the arc is about the accumulation of “technical debt“.
During the Industrial Revolution the skies were filled with black smoke from coal fires. Those coal fires were powering some of the greatest inventions of the 18th and 19th centuries and altered the lives of the people living in those times, and the history of the world in some cases. The focus of all manufacturers was getting their product out to the people, often ignoring any consequences. Start ups are less reckless and unlike Industrial Revolution factories they don’t endanger workers’ lives or the environment, but they leave a wake of refuse all the same. That refuse is called “technical debt” and is sometimes the code left over by late-night, under-fire fixes but more often than not it is the design work done when the product was being use differently that it would end up being used in the end.
This filling of your office sky with coal smoke is a good thing, actually. While there will be a time when you look back on it and realize you were rash, I would argue you could never have made it to that vantage point without a little rash action.
I’ll tie the second and third phases of this arc together, since they’re very related. In the real worlds there came a point where we started to talk about “peak oil” and “climate change” and ever since then we’ve looked back on the Industrial Revolution like a youthful indiscretion. Something we would do better if we had it to do again. We’ve more or less painted ourselves into a corner when it comes to energy and environmental issues and now we’re trying to fix it. Start ups are no different in my experience.
During the “Industrial Revolution” phase of the start up there was furious design and implementation work. We didn’t completely know where we were headed but we knew we wanted to get there and find out. Now that we’ve reached a point where we can see what the product is we can look back over the scorched earth behind us and wonder what we were thinking. Well, if we were busy wondering back then we wouldn’t be able to think about how to fix it, we would have failed, so I’m glad we did what we need to survive.
So we’re in search of the promised land of renewable energy, but unlike the real world we can cleanup all of the mess we’ve made. We want something that does not pollute our experience (or code base) but can sustain itself. This means fixing the things we’ve created in haste, turning a good product into a sustainable business, and continuing to do what made us a success: focusing on users. We have a group of great people evaluating a bevy of great options and I believe we’ll manage a balance we can be proud of. Maybe in the future I’ll add a fourth phase about utopia … or the after life. My bet is on utopia but life can be unpredictable.
I have many friends who are vegetarian to one degree or another and I’m happy to accommodate that. I respect their choice in the same way they respect mine. In asking around about people’s dietary restrictions over the years I’ve found a group who annoy me: The Dishonest Omnivore.
I am an omnivore and I’m not ashamed. I’m also not squeamish about where my food comes from. I know that pork used to be a pig, beef a cow. I can’t say I love the feel of raw chicken in my hands. What I can say is that I like the taste of chicken more than I dislike dealing with it in the raw … and because of that I’m unashamed to be an omnivore.
Anybody who has talked about food with me has heard my rant about burgers shaped like cows. Well, that pretty much the gist of this post and it sums up my feelings. If you order a hamburger you need to be comfortable with the fact you’re eating what was formerly a cow (or, depending on where you order it, several different cows). If you ordered and someone told you the cow’s name was “Bessy”, would you order something else? If so, perhaps vegetarianism is for you. Vegetarian food is just as tasty, sustaining and filling as the omnivore diet.
As time has progressed we’ve become farther and farther removed from the source of our food. Where our parent or grandparents might have raised live stock, today the vast majority of us don’t have that sort of connection to our food source. There has been a local food movement brewing, mostly focused on fruits and vegetables. This movement has been about distance in the traditional land area sense … I want less distance in the mental sense. I want to think about the previous form of my food. Be it vegetable:
or animal:
But let’s be honest – nobody is squeamish about what a vegetable looked like in the field.
Like all aspects of computers Unicode has its own security issues. And like all Unicode issues most engineers spend their entire professional career trying to avoid dealing with them. It’s ok, you can be honest, I understand. When I gave my talk about Twitter International at Chirp (the Twitter developer conference) I mentioned some of these issues. After that talk I was surprised how many people who know more about internationalization than I do said they hadn’t considered some of these issues.
I’m not going to go into a ton of detail since I’m not a security researcher. I am, however, and engineer focused on international and as such I think it’s my business to know where my push to internationalize everything reaches it’s limit. If you’re in a similar position, pushing people to internationalize, you should make sure you fully understand these issues. If you push people to internationalize and in the process create security flaws you’ll be spending your credibility. Don’t spend it on this – the cost is too high.
I recommend the awesome paper Unicode Security Considerations (Unicode Technical Report #36) as it served as the basis for this whole post. The problem is, Technical reports are tedious to read so I’m adding this teaser. Here are some highlights (lowlights?) of Unicode security:
The most common security issue I’ve seen with internationalized products is character ambiguity. This same property is commonly used for spam but it also poses security risks. Character ambiguity is using characters from different writing systems that look very similar to the expected characters. People see what they expect to see … this is the enemy.
Image from flickr
The security risk of ambiguity is mostly related to impersonation. Impersonation is the underlying mechanism for phishing, which we can all agree is a major security problem. While ASCII alone contained ambiguity (Capital I looks an awful lot like lower case L in many fonts) Unicode expands the problem. For example, full-width latin characters, like “foo”, in place of the expected latin characters, “foo”. Within a sentence that’s easy to spot, but what about “O” by itself, like “David O. Selznick”? Cyrillic adds a host of characters very easily confused with latin text.
This issue has implications beyond the impersonation of people. Any time you present a user-provided string to identify a given entity you run the risk of “impersonating” that entity. That’s a little too abstract, let’s ask Network Solutions and PaypaI.com … oh, did I say PaypaI.com? In many fonts, including most browser defaults, that’s almost indistinguishable. With the introduction of International Domain Names (IDN) there is a very real concern about this. Where does gооgle.com go? Did you notice the two Cyrillic о’s? Good news on this front specifically is that ICANN is more than aware of the issue and is on it.
As somebody who speaks a little Arabic, and generally geeks out about Unicode and right-to-left this is a personal favorite. It turns out that in addition to the built-in Unicode character direction there are also some characters for explicitly controlling character direction. This is a case where adding support for something produces some unexpected security issues. Packet Storm Security has a great, easy to read paper on the subject [PDF].
By forcing the direction you can make ‘foo’ appear as ‘oof’, which seems innocuous enough. Where things get interesting are when programs try to augment text with auto-linking. I’ve been doing some Ruby on Rails work and often times project use the auto_link() helper function. If a user provides the text that is being passed into auto_link() to can end up with:
auto_link("Change your password at #{[0x202E].pack('U')}http://MALWARE.COM/long/and/impressive/secure/looking/url/here/moc.knabitic.www://ptth")
Which in both Firefox and Safari looks like so:
Note that this still links to malware.com.
Hopefully this was educational and not too dry. I recommend reading the Unicode Security Considerations document as well as the closely related Unicode Security Mechanisms document if you’re interested in other possible security issues. I didn’t even touch on the lower level buffer overflow errors in text processing … most people reading this are using a sufficiently high level language such that they assume they can ignore that.
After reading Alex Payne’s post on heroism (Don’t Be A Hero) I have to say I was a little irked. I disagree somewhat on the details of what defines a hero in this context and that seems to be the crux of my discomfort. I don’t think hero’s have to work until four in the morning. Nor do I think a hero creates inherently lower quality software. A hero is someone so dedicated and passionate about what they are doing that they are willing to work hard and deliver when other people are not (and for some people, what they are passionate about is not low-quality “feature work”). For some “heros” this becomes late nights, for others early mornings, and for still others it’s a during the day activity with no extra time. I’ll be honest, that last case is pretty rare, because the passionate usually see time as flexible and success as a rigid goal.
I was never really sure how to some up my feelings on the post until last night. Oddly, it was the Eye of The Tiger scene in Persepolis that enlightened me. I’ve seen Rocky many times – and even once in the last few weeks – but somehow seeing that well-worn scene re-used highlighted my feelings. What’s great about Rocky is that it gives me a way to sum up not just The Hero, but also the personalities that often surround them.
The role of support is one that is often overlooked but shouldn’t be marginalized. Many would-be heros fail because nobody is there to provide the unwavering support needed to keep up the hard work and dedication. While The Hero does things out of a sense of passion even that can be worn down by the continual hardships created (See: Anti-the-hero later in this post). Without The Supporter’s positive support it’s easy to get discouraged. This positive support contrasts the criticism of The Catalyst (more on that next) to create a balance that sustains The Hero. The role of The Supporter is so obvious there is not much to add other than it’s a small but critical requirement.
The Catalyst is someone unfazed by heroism who pushes the hero to continue. This person is key in that they provide the criticism. This is the criticism discussed previously by Alex (Criticism, Cheerleading, and Negativity) and constitutes the best undiluted feedback The Hero is going to get. The Catalyst is usually the only person as hard on The Hero as they themselves are. Without The Catalyst, The Hero will plateau at some point where they feel they are trying to make it but they’re still a bum. The pre-catalyst stage is essentially where the movie Rocky begins.
I’ve often said: “People love to see an underdog win. Almost as much as they want to see a hero fail”. I started saying this early in my working life as what I thought was great satirical hyperbole. It turns out that long after the fact I’ve only confirmed it is true in some cases. Throughout Rocky Paulie acts in selfish and crude ways that undermine the success of Rocky and the happiness of his own sister. Paulie is a believable character because there are people like this throughout our lives … they are not directly against us but through their action they continually derail us.
I’m not saying that anyone against heroism is out to stop The Hero. In some cases it’s simply a misunderstanding of why The Hero exists. It’s not to be a martyr, nor to prove something to someone. It’s simply to prove to themselves that they can go fifteen rounds with Ap0llo Creed and come out on the other side. There is an important point in the finale of Rocky, he doesn’t win the fight. It’s about passion, not success.
At the end of the film, after the fight but before finding out he lost in a split decision, Rocky has the following to say to Apollo Creed:
Apollo Creed: Ain’t gonna be no rematch.
Rocky: Don’t want one.
The Hero does not exist to please you. The Hero does not exist to feel better than you. The Hero exists to feel better than they did when they got up this morning. Where Alex and I can certainly agree is that The Hero is a detriment if he or she is focused on fire-fighting and slap-dash software development. Where we seem to diverge is that I have seen many a hero who’s passion is high quality software AND features. This issue isn’t heroism, it’s heroism with the wrong focus.
Unicode support in Ruby doesn’t get much attention. Most of the information about it focuses on MySQL more than it does on actual Ruby support. Ruby can read and write Unicode data without much trouble but actually working with it, and moreover making sure it does not get corrupted, is one of the lesser visited back-alleys of Ruby. Hopefully I can make some more time to blog about other Ruby/Unicode interaction but I have to start somewhere so Regular Expressions are as good a place as any. Perhaps better since they’re their own dark art.
When it comes to Unicode and Regular Expressions the \w escape (for matching word characters) is the most commonly misused. Ruby makes this situation all the more difficult by changing behavior based on a global variable, $KCODE.
When most programmers use the \w escape they mean [a-zA-Z0-9_] (which is how POSIX defines [[:word:]]) and Ruby will work like that … until the $KCODE changes. Once $KCODE is set to u (Unicode) the \w escape starts matching any word character in any langage, including things like ش or ㌳. Check out gist 274731 for a working example, or the similar patch to the OAuth gem, which shows that this isn’t only theoretical. It isn’t just complex things like OAuth request signatures, imagine this as a validation on a user name (which would allow some of the commonly confused characters, like í).
\s): The Final FrontierAnother common misconceptions about Regular Expressions is that the \s escape handles all space characters. While it does match more than “ ” (U+0020) alone it’s by no means complete. There are a multitude of space-like characters in the Unicode standard but when it comes to natural language there is a small subset that will suffice in the vast majority of cases. In fact, U+0020 will cover most languages but fails on east Asian ideographic alphabets (which don’t space separate words, as I’ve mentioned in the past) where the full-width space (U+3000) is used.
If you’re well versed in Regular Expressions you might consider POSIX character classes the answer to the problem. The POSIX standard defines the longer named character class [[:space:]] but it’s a direct equivalent to the \s escape. For a practical demonstration check out gist 274725 over on github.
Not every country and language uses the same numeral system. One thing that makes programming slightly easier is that the arabic numeral system (0123456789) has become more or less the standard throughout in computing world. This convenience has allowed Ruby (and most other languages) to ignore the alternate numbering systems Unicode allows. A rather contrived example is that of braille but a much more common one is the numeral system used in Egypt, the so-called “Arabic – Indic” digits (٠١٢٣٤٥٦٧٨٩). As you can see from gist 274737 on github the \d escape does not match any of these (nor does [[:digit:]]) and String#to_i doesn’t handle them either. Again, the good news is how prolific the arabic numeral system has become.
No programming language handles Unicode perfectly, and Regular Expressions are very often problematic corners of Unicode support. This isn’t Ruby specific and to be totally fair Ruby does a better job than some others. Like all posts this isn’t exhaustive as much as an introduction to some of the most common issues. If you’re interested in more information feel free to contact me on Twitter (@mzsanford) or apply to work with me on the interesting problems I’m finding every day.
Tokenization refers to splitting any data into chunks, and in the case of this post I’m focusing on splitting text into words. The process of turning free-form text into individual pieces of information (word, phrases, sentences, etc) is something that natural language parsing (NLP) researchers have been interested in for years. There is a whole field of study on the subject that this post does not hope to even touch on. For developers with no language experience this process is usually overlooked as absurdly simple, I mean split(/\W+/), right? If you nodded then this is for you. If you think that was overly simple this will probably be old hat.
English, as my native language, turns out to be one of the easier languages to do basic tokenization on. By basic I actually refer to split(/\W+/). This will split a sentence into words and it will be correct in many cases. Like everything else in English, it’s pretty much defined by its exceptions. Hyphenated words are a pretty obvious stumbling block but there are some others as well.
Depending on what you’re planning to use data for there is some more processing that might be needed. Data normalization is a normal step in almost every process so I don’t think anyone will find that surprising. Like any other normalization this is very dependent on what you plan to do, and this is where English can get tricky. Here are a few of the common normalization tasks with words:
The same thing goes for many other european languages. What varies between these languages isn’t the delimiters, which are still spaces and punctuation, but instead the amount of normalization needed. While your project might not need stemming for English it’s possible that the vowel-changes in German conjugation will require it. One other small detail people skip is that \w (or the inverse, \W) may not match accented characters depending on your programming environment.
While talking about normalization of languages people may or may not speak it’s always good to give an example of something language-specific. I speak German so it’s a natural choice for an example. I implied above that the major difference between English and other european languages is normalization, but there is a German specific issue that blurs the line between tokenization and normalization … decomposition.
German is pretty notorious for having long, silly sounding words. What people often miss is that these are actually compounds. For example sliced sugar beats used during the process of sugar processing is called zuckerrübenschitzel:
Now, that seems like a bit of a specialized word to me, however this was a label in a museum. You see, it’s actually three words: zucker (sugar), rüben (beat), schnitzel (slices). So, while that’s one token, it can be decomposed into 3 words … if you need to do that decomposition or not depends on your application. How you actually do that is a whole ‘nother blog post.
So, English can be hard to process depending on how you plan to use the data. Chinese, Japanese and Korean are pretty much always hard. Lately I’ve been working on the problem of Japanese and I’ll do a full post on that at a later date. For English speakers, consider how you would process things if there were no spaces between words. The three writing systems in Japanese provide some clues but the essence of the problem is the word delimiters. I’ll leave at this for now and do a post on Japanese tokenization later.
Without going into the nitty gritty of Japanese tokenization there is a pretty good example that comes to mind. Imagine a system that lets you post short messages, and that many of those contain links. Now assume that at display time you want to automatically link the URLs that appear in the message. Now, here we go in Japanese (well, some Japanese characters laid out like a short message):
ののhttp://example.com/のの
Auto-linking requires that you identify the URL in the midst of all of the other text. While this is a pretty simple problem I should point out that the default auto linking in many languages and libraries do not handle this correctly. The easiest route is to use what you know about valid URL characters (simplified to host name only, other URL components are left to people willing to read the RFC):
message.gsub(/http:\/\/[a-z0-9-\.]+\.[a-z]{2,}\/?/i) {|url| … }
I’ve mentioned Arabic before and it’s normally either ignored when talking about internationalization or it’s skipped after some hand waving about right-to-left. Arabic is a phonetic language every bit as much as English (arguably more so … I’m looking at you faux pas) with an alphabet and spaces between words. To some extent it’s just like the european languages I mentioned above … almost.
Arabic relies very heavily on prefixes and suffixes connected directly to words. The most ubiquitous of these is the definite article ال (Al-, meaning “the”). It’s questionable if this is strictly a tokenization problem, but it does mean that using Arabic data without specific normalization is of very limited usefulness. Possession is represented by a suffix attached directly also, as are a myriad of other things. This is sort of like the German example above, only that it effects so many words that you’ll have to tackle it sooner if you plan to find meaningful data.
When English speaking developers first encounter languages like Hebrew or Arabic where things are written from right to left they react in one of two ways. Either they see this as insurmountable to support in their application or they feel the opposite and assume that since they have UTF-8 everything will just work. While most modern programming languages support UTF-8 encoding that does not mean that everything does it correctly, and often the right-to-left layout is an overlooked part of UTF-8 support. This post hopes to clarify a little bit about right-to-left processing and Arabic in particular since I speak some of that and it inspired this post.
For the more detail oriented please note that I’ve skipped any discussion of endian-ness.
This is a common source of confusion with right to left languages, even for advanced developers. When English developers think of text they think of bytes streaming from left to right, top to bottom, the same way they read. While that’s the way we visualize the data it is, in fact, just a string of bytes without any direction. It’s best to break with any confusing directionality and think of the bytes as running in a single top-to-bottom line. Here’s some sample english text to belabor the point:
abcd 0x61 0x62 0x63 0x64 becomes: a 0x61 b 0x62 c 0x63 d 0x64
With that little visual exercise out of the way I can move on to right-to-left languages. If you look at the second part of the above it is the same order was the bytes used for right-to-left languages. The first character a native speaker would write is the first character in the data stream, the second comes next, and so on. This is nothing revolutionary but I can’t count the number of times I have heard skilled developers say things like “but in Arabic the string is backwards”. It’s very easy to fall into the trap, don’t be fooled. The string isn’t stored in reverse order, it is displayed in reverse order.
As stated above the bytes for a right-to-left string are stored in the same logical order but are displayed in reverse. That sentence almost makes the assumption that UTF-8 “just handles” right-to-left correct. The main problem is that it’s all up to the display program to do things correctly. If your application is using a web browser or OS standard text control you’re probably using the OS text layout engine. These modern layout engines are probably going to work out fine, I know they do in all of the OS’s I’ve used recently. Where things get more interesting is in graphics processing libraries. If you are writing (or using) a graphics processing library that focuses on primitive drawing (line, shapes, etc) it’s very likely the text layout engine was an after thought. It’s also pretty likely it was added by an English speaking developer with no thought toward non-Latin scripts (right-to-left as well as ideographic systems like Chinese).
That’s all well and good but it doesn’t explain how layout engines should be handling right-to-left layout. I’ve never designed a text layout engine … it’s hard and the OS native ones do a great job. I’m not writing this to explain how to write a layout engine. Firstly it’s a complicated subject of which I only know what I need, and secondly I would caution anyone against writing such a thing again . What I want to cover is the basics of how the bytes in the same order as Latin scripts end up the other direction. Oddly that contrast is best covered in the next section, where you’ll see them together.
Text with mixed character sets is very common across the internet. A big part of this is that HTML and HTTP are both run on the Latin script (hell, they’re all English and English abbreviations). This means the HTML markup and things like URLs need to co-exist with right-to-left content in many places. Website names are a perfect example of that. The basis of directionality in Unicode is that all directionality is defined on a per-character basis. I’ll start with an example and explain from there.
Text: abابab Unicode Bytes Letter ------- --------- ------ U+0061 0x61 a U+0062 0x62 b U+0627 0xD8A7 ا U+0628 0xD8A8 ب U+0061 0x61 a U+0062 0x62 b
There is an algorithm for bidirectional character layout, but I find it’s easiest to think of it as: A group of character with the same directionality are processed by the layout engine together. This means that a group of right-to-left characters surrounded by left-to-right characters will be reversed, as a native speaker would expect. This also means that a single character of a different directionality does not break those around it (like aبc). I’ll use bytes so it’s clear what I mean. If you look at the example above you’ll see that after 0×62 you next see 0xD8A8, which is the fourth character. When the layout engine reaches the third character (0xD8A7) it finds a directionality of right-to-left, then the fourth character (0xD8A8), which is also right-to-left. Since these are both right-to-left they are displayed as such. The following character is once again 0×61, which is again a change in directionality.
I started working on this post because I have an interest in languages and computers. But probably more so because I speak some Arabic and it has some interesting text layout issues. Right-to-left is one of the most obvious issues, but the character connection is one that I have seen unimplemented most often (Adobe Flash, TextMate, etc.). Arabic is written with characters that connect to the subsequent character, sort of like cursive in English. Arabic complicates that a bit more by having some characters that connection on both sides (like ب) and other that only connect on the right (like و). This post isn’t about Arabic letter forms but it shows where text layout engines are more complicated than people think. Let’s look at one quick example of what characters I type versus what is displayed.
I Type (and store in a file): ل ل ل Unicode Bytes Letter ------- --------- ------ U+0644 0xD984 ل U+0644 0xD984 ل U+0644 0xD984 ل Displayed As: للل Which are actually the characters … Unicode Bytes Letter ------- --------- ------ U+FEDF 0xEf889F ﻟ U+FEE0 0xEF88A0 ﻠ U+FEDE 0xEf889E ﻞ
As you can see the characters stored and the characters displayed are all different. This choice to use different characters in order to make the letters connect is done by the layout engine. Writing this was actually quite hard since I did it in TextMate and it uses a text layout engine that does not connect characters.
This was a very simple explanation of right-to-left character display. Nothing revolutionary but the idea was to point out that applications usually fall between the two initial reactions of dread and the expectation it will all “just work because I’ve gots teh Unicode”. This post leaves out the very large localization issues that right-to-left languages create. I’ll use web design for some examples, since it’s something I know and something you’re likely to know as well:
This skips the cultural issues of Hebrew and Arabic localization (obviously different in many ways), but I want to touch a bit on that last one since it’s a favorite of mine. Imagine the stock photo of a soaring business chart with the climbing red line and no scale. Now imagine if you read from right-to-left, and thus thought of the X-axis as reversed … you just told everyone you’re failing more every day. Good job.
There have been a few questions on the Twitter API development list asking about how search.twitter.com is able to detect the language of a tweet. The methods used are nothing new to the field of natural language processing (NLP), but most developers haven’t studied much NLP. I’ll cover the industry standard method we’re using, as well as the shortcomings.
I’m a language geek but not a linguist or NLP scientist so I started with a knowledge of programming but not of the existing techniques for language detection. I was able to recognize spoken and written languages I didn’t speak and that sparked my interest in what I was gleaning that information from. I’m no protege so there must be some simple mental process. I thought language-specific search would be nice so I read up and started on the code.
My first thought was that you can determine a language by using some of the most common words. I spent a lazy Saturday afternoon thinking about it and came up with an idea. While ‘die’ is a word in English, it’s a very common article in German (feminine ‘the’). I started to think about how I could leverage that knowledge to detect languages but ran into a wall. You see, I speak German so it wasn’t a good explanation for my ability to pick up the difference between spoken Chinese and Korean. Those two bring up a good point, the way I determine those in writing (characters) differs from how I do in speech (tones). In languages, the most common words tend to be the shortest and with only a limited number of syllables it seemed like my common-words method was doomed to failure.
It seemed clear my first idea wasn’t going to work but it seemed close to a statistical method for identifying a language. That phrase ‘statistical method’ made me think of conference papers so I started searching those. That reading not only brought me up to speed on the current state of language detection, but increased my general language interest.
I was pleasantly surprised to find out I wasn’t totally off base in looking at distribution of the data. It’s the basis of statistical analysis so how could I really be that far off, but being new to all of this I feared I had made the most rookie of mistakes. It turns out that if you chop words up into groups of letters and store the distribution for each language they are different enough to let you determine a language with pretty good accuracy.
This method is very successful but in requires that you have a large set of training data for each language in order to get an accurate distribution. There are some academic collections you can use, but not for a commercial product. My background in is in web crawlers, so crawling a series of sites for each language seemed reasonable. The problem was, without language detection I wouldn’t know what language it was. A bit of a catch-22.
Enter Wikipedia. There data is divided by language, freely available, large, and created on a variety of subjects by a variety of authors. In my brief NLP and language reading I had already learned that the subject, author and audience of a work will have a large influence on the types of words used. For example, if you did your English training on legal contracts you would think we say use much more Latin that we really do.
The code for doing this character distribution work can be found in Nutch. That code was hard to find when searching for language detection. Being a crawler developer I remembered seeing it and verified it works based on character distribution. My crawlers and my language hobby were coming together at last. I did some crawling for additional languages from Wikipedia and then realized that ideographic language fail using this method since they don’t use as restrictive of an alphabet.
While the character distribution method handles languages using the Latin alphabet really well it does break down on some other alphabets. It works surprisingly well on languages using the Arabic alphabet (Arabic, Farsi, etc), as well as Cyrillic, Hebrew and a slew of others. I am guessing any semi-phonetic language is going to match that pattern. Where it has problems is Kanji, since a word is not made up of a combination of characters from a small set.
I don’t speak Chinese or Japanese but I used them as a good example of two ideographic languages that the character distribution method fails to differentiate. What’s interesting between these two is that Japanese actually uses several different character sets for text. You can see in the Unicode table there are Kanji, Hiragana and Katakana … which got me thinking: What if you used the statistical distribution of character sets?
As it turns out this gives reasonably good results. Good enough that they are worth keeping. I removed all ideographic training data from the character distribution check and made the code try a second method where it checks the character set distribution. A bit of manual evaluation and some checks for minimum confidence later and it seems like we are sorting Chinese from Japanese correctly often enough to make me, a non-speaker, happy.
I’m not a linguist. I’m not a computational linguist. I’m a programmer who is facinated with language. I learned the basics of language detection and extended it a little to cover ideographic scripts. I’ve had some success and I hope this helps you have a little too. Our training data still needs some work, but I think over all I’ve found a solution that is pretty damn good for the cost … which was my time and a little CPU.
c3f7525
3ef8e07
2707484
5f4009a
dfbee5b
ea33b63
b2b27bc
f7b563d
6ff180f
05a8332
009e642
347a454
c5bb5d4
0df9fc6
a0cfb0b
5b38e8a
Tech manager of international at Twitter.
Old movie fan.
Living in SF.
This blog is not affiliated with my employer, Twitter, and reflects only my personal views.