-
Notifications
You must be signed in to change notification settings - Fork 989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error while encoding: java.lang.RuntimeException: java.lang.Integer is not a valid external type for schema of double #2055
Comments
I just tried to reproduce this using the environment at https://github.com/masseyke/es-spark-docker and it worked fine for me. I used:
Here's the output:
I'll upgrade to a newer elasticsearch/es-hadoop to see if that makes a difference. I'm wondering though if upgrading to a newer spark might fix your problem. |
It also works with elasticsearch 8.6.0 and elasticsearch-spark-30_2.12-8.6.0.jar in my environment (otherwise the same as the one I described previously). |
It works for me with spark 3.1.3 as well. Here's what I've got:
Otherwise I'm just installing "Sample eCommerce orders" in kibana and running the code I pasted above. |
Hi, Thanks for trying to simulate the issue. I missed one step in code. df.show() only fails. Print schema and count works fine for me as well. Could you please try that as well Thank, |
Oh that makes sense. I can reproduce it now. The problem is a lat or lon that doesn't have a decimal. Here's the data I used to reproduce it:
Note that you have to have at least one document in there that does have decimals. And then in spark:
I'll see if this is easily fixable. |
Thanks for the same. I am new to Spark and not well versed. Q1. Is there a way to read the whole elastic JSON record as a schemaless column in data frame Q2: is there a way to read the elastic columns as default string, rather than an interpreted rich datatype. This way many of the reading errors with elastic can be avoided |
Hi, I noticed that elastic search has a coerce feature and there can be data corruption. Is there a way to overcome this in spark https://xeraa.net/blog/2020_elasticsearch-coerce-float-to-integer-or-long/ |
This unfortunately does not look like an easy one to fix. The reason is that the only information we have in es-hadoop is that elasticsearch has handed us an integer, so that's what we pass to spark (and spark isn't being very flexible here). A little further away in the stack we have access to the schema, but that only tells us that we're in a geo_point field (and unfortunately the geo_point field type has several variants). I'll see if my colleague has any better ideas when he's back next week.
|
The esJsonRDD method is what you want -- https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html#spark-read-json. |
Thanks for the comments,
Thanks, |
Hi there, I'm currently experiencing a rather similar issue on "official" sample data: I'm using the "Kibana Sample Data Logs" and my PySpark code fails at at Is the problem in the data or in es-hadoop? Code
Version InfoJava version: 1.8.0_382 PySpark Info:
|
Elastic Search Hadoop has a scala version and would have to leverage that for processing. The issue is occuring because the library gets the schema of your index and some data inside elasticsearch is not the same as the schema inference. This is due to the fact that elasticsearch stores the original data as JSON only and when indexing it parses to the data type of schema in lucene for searching. Hence the current library when it parses the original JSON ( not indexes one) infers a wrong datatype and errors. This can be even that a field is defined as float and the data is stored in JSON as 1.2, 1,2.3 etc.. The "1" here will fail as it is not a float and an integer. To overcome this either correct the problematic records or leverage the Scala version(not Pyspark) and use the esJSONRDD method |
I see... |
What kind an issue is this?
The easier it is to track down the bug, the faster it is solved.
Often a solution already exists! Don’t send pull requests to implement new features without
first getting our support. Sometimes we leave features out on purpose to keep the project small.
Issue description
Trying a simple Example of Reading ElasticSearch GeoIP data from Sample kibana_sample_data_ecommerce
https://www.elastic.co/guide/en/kibana/8.6/get-started.html
If we try to Read the
geoip.location
field which is a geoip filed ,df.show()
will error with the following messagejava.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.Integer is not a valid external type for schema of double
Tried with various options without any success
Steps to reproduce
Code:
Run
df.show()
Strack trace:
Version Info
Jar : elasticsearch-spark-30_2.12-8.6.0.jar
OS: : Google Data Proc 3 Node Default Cluster
JVM :
Hadoop/Spark:
ES-Hadoop :
ES :
PySpark Version Info:
The text was updated successfully, but these errors were encountered: