Hi there,
I am ingesting a large corpus of data which contains a field of type string which contains ASCII text. The text contains a lot of variety and includes all sort of commas, quotes, double quotes, back/forwarded slashes and so on.
I get a lot of failures when inserting because it is unclear even how to escape a simple double quote (I use the slash but I get it back too) situation.
I am thinking there could be two ways to solve this:
we enable to include BASE64 encoding for string values and a type (ASCII/UTF)
we enable a sort of raw insertion mode via data api instead of query api
I think the first option is more simple to implement. @james.williams
We’re looking at this internally and aiming to align on the standard SQL & NoSQL ways of handling strings (i.e. backslashes escaping characters aren’t stored in the database.)
Based on what you’ve said in your post only double quotes and backslashes that don’t intend to escape characters should be an issue for you right now. Have you had a look at using our concept API? Here’s a link: Concept API | Vaticle.
Hello James,
that is correct for now my most common use cases is having include url strings (basklashes, ampercent etc) and text strings with single/double quotes.
Can you elaborate more on the Concept API?
I don’t see an entity method that would allow me to insert a thing programmatically?
I have access to the attributes immediately before creating the tql string, and immediately after accessing it as a grpc call, so i am happy to change it to suit your recommendation.
Are you using the concept API in Python as a workaround as suggested in Escape hell: handling string values - #4 by james.williams? My understanding is that this will insert strings as they are provided. Let me know if that isn’t working for you.
um, yeah, at the moment I transform the json models into typeql, because i thought it was a beautiful idea. This works perfectly, even with tql object definitions of >50 lines, so it would be a hassle to convert to api calls for insert.
I imagine the best way in the interim is to hard code a response in my grpc attribute retrieval code, so that if there are multiple back slashes, I divide the number by two, or some such hack. Can you advise on the best approach to handle during extraction?
You could certainly try! Prefacing this with the advice that this approach is difficult and error-prone across large datasets, if there are an odd number of backslashes take the ceiling of half (e.g. 4 backslashes become 2, 5 becomes 3, 6 becomes 3) as it may also be escaping a non-backslash.