Strings Intelligence, Technology

Qordoba’s Open Source StringExtractor

See if this sounds familiar: a few months ago, one of Qordoba’s enterprise clients had signed up with us to start a large-scale localization project of their application. As a pre-requisite for this process, several of their engineers sat down to identify all of the text strings that would need to be mapped out in order for a localization process to render translated versions of their software.

It became quickly apparent that the hard-coded text string mapping, key generation, and replacement would take their small engineering team about six months to complete. For obvious reasons this timeframe was unacceptable. As a result of their frustration they came to Qordoba’s tech team to see if there was a solution or suggestion that could help them expedite this project.

Thus the Qordoba StringExtractor was born to help customers creating segmented interfaces eliminate one of the biggest engineering bottlenecks and barriers to creating and managing multiple versions of their apps. The Qordoba StringExtractor, which will soon launch in a more advanced beta version powered with machine learning, has saved Qordoba’s customers upwards of 77% of the time they estimated it would have taken for manual string mapping and key replacement.

If you are a newbie to major UI refreshes or localizing a code base, one of the largest bottlenecks is to prep the code for versioning. Depending on code volume this could mean a number of different scenarios. We have seen customers whose codebase would require a team of 4-5 developers working full-time for 5 weeks to manually:

  • (1) copy/paste and apply regex to extract every string one by one
  • (2) generate new, unique keys
  • (3) replace the keys with the string

How does Qordoba’s StringExtractor work?

StringExtractor works via a python package, the Qordoba CLI, which utilizes an underlying lexer enabling you to run three simple terminal commands:

  • Extract – copy/pasting and applying regex to identify and pull out every string
  • Generate – generating new unique keys based on the text
  • Execute – plugging in the keys in at the location of the string

In order to get a better grasp on how we designed our StringExtractor and how it works, you should download the open source version from our Github repo and try it along with the simple xml file extraction scenario I outline next.

To install the open-source command line tool, download its latest version with (Mac) `pip install qordoba==0.2.0a1`

Select an example xml file of your choosing or you can find a sample one here (link). Create the following folder structure with the xml file inyourinput_dir. This will later become your app repository.

Step 1: Extract + confirm

During the first step of extraction, the Qordoba StringExtractor takes your input directory path and your report path.

The report directory will later hold file which contains all your extracted strings. To start, please execute the extract command with: qor i18n-extract -i input_dir -r report_dir

Our model scans your file formats and executes a lexer which parses the files, identifies and then extracts strings from your codebase.

Once this is complete, open the newly-created report within your directory. This json report should contain your:






Note: If you want to customize the extraction processes down-the-line, it is relatively simple to build your own customized lexer. I will give an example in a future post.

The last part of step one, and just before you generate keys, we highly recommend that you go through and check the extracted strings for any errors and/or for strings you don’t want to generate keys for in future steps. Remove any of these unnecessary files. (Qordoba’s Strings Management Platform provides customers an interface that helps expedite this process.)

Step 2: Generate keys

Now you are ready to perform the second phase of extraction: generating keys. Whether you are localizing your code or developing segmented versions for different customers or software types, you need to generate unique keys need for each string. Keys should also be meaningful. For example, instead of hard coding in “Hello” / “Hola” /“你好” / “What’s up?”, you should generate a key such as “Greeting.” Oftentimes, the second phase is the most time consuming for a development team.

In an effort to solve this, Qordoba helps teams by generating unique Qor UUIDs that we merge with the three most important words within your string in order to correlate them to the context.

The generate command scans the file in the report_dir and generates new keys for every string by calling our key API. Generated keys are added to a new report which is stored in a new directory. NOTE: We store the keys in a new directory in case any connectivity issues occur. It is easier to do a quick spot check and confirm that each of the strings has a corresponding unique key that has been generated. It should help determine and confirm (if necessary that all of the reports have been processed).  

In addition, Qordoba makes sure that the same strings are associated with the same keys, and that you don’t have any duplicate or redundant keys:

To execute the generate command, type: qor i18n-generate -r report_dir -e report_key_dir

Step 3: Replace strings with keys

The last and final stage of your code preparation will parse your files, pick out the lines of the report and replace your hard-coded string with its associated key.

To execute the final command, run: qor i18n-execute -i input_dir -r report_key_dir

Having worked with a number of development teams across industries and countries that were looking to initiate large scale localization/internationalization projects for their code or create different software versions for segmentation, Qordoba developed this tool and set of recommendations to help expedite the i18n/l10n, extraction process.

We think this is a game changer for development teams, and hope that you find this guide helpful.

Related articles

Strings Intelligence | 5 min read

The Right Words

The best digital products have ridiculously memorable personalities. Gmail’s “Oops! Something went wrong” to Slack’s “What a day! What cannot…